System benchmarks
Brian Bramer, DeMontfort University, UK
(bb@dmu.ac.uk)
Goto Computer
Systems Notes main page
Contents
1 Introduction
2 Benchmark requirements
3 Important considerations when comparing benchmark
results
4 MIPS (millions of instructions per second)
5 Whetstone benchmark
6 Dhrystone benchmark
7 Linpack benchmark
8 SPEC (Standard Performance Evaluation Cooperative)
benchmarks
9 SSBA (Synthetic Suite
of Benchmarks from the AFUU)
10 TPC (Transaction
Processing Council) benchmarks
11 Benchmarks for IBM PC compatible computers
12 Sources of Information
in the Internet
13 Conclusions
References
1 Introduction
When selecting a computer system to satisfy
an end-user requirement a formal procedure would be followed, for example
(Bramer 1989), http://www.cse.dmu.ac.uk/~bb/Teaching//ComputerSystems/ComputerSelection/ComputerSelection.html):
-
Carry out a feasibility study. The need for a new
installation or the possibility of upgrading an existing system is analysed
to determine cost effectiveness in terms of end-user requirements and advantages
gained, e.g. increased productivity of skilled staff, reduced product development
times, a more viable product, etc. The result of the feasibility study
will be a report to be submitted to senior management to request funds
to implement the proposed system.
-
Draw up a detailed requirements specification.
Once a budget has been agreed a requirements specification is drawn up
which describes in detail the facilities of the proposed system and the
acceptance criteria..
-
Draw up an ITT (invitation to tender).
From the requirements specification information is extracted which specifies
system facilities which are to be met by outside organisations (vendors
of hardware and software). This document is sent to a range of suitable
system vendors.
-
Receive tender documents from vendors.
The vendors describe their proposals for a system which will satisfy the
ITT. For a small system this will be a simple quotation.
-
Draw up a shortlist. Using information
in the tender documents a shortlist is formed which will be based on system
facilities, delivery times, maintenance offered, price,etc.
-
Finalise contract to purchase system.
For a small microcomputer system an order may immediately be issued. In
the case of a more complex system details may have to be discussed with
vendors on the shortlist and field evaluations carried out.
-
Order system
-
Install system and carry out acceptance
tests. In the case of a small system it will arrive, be installed and,
if working, put into immediate use. In the case of a large system acceptance
tests will be carried out to ensure that it meets the requirements specification.
-
Training. This may be in-house or offered
by the system vendor as part of the overall deal.
-
Systems maintenance. Daily/weekly maintenance
of system performed by in-house staff, e.g. disk backup, or hardware and
software maintenance performed by the system vendor or a specialist maintenance
organisation.
At various points during the above process
benchmark tests will be used to evaluate alternative computer configurations
to ensure that they meet the requirements generated during the feasibility
study. Benchmark tests may take two forms:
-
a number of tasks representative of the
end-user work load of the proposed system;
-
'standard' benchmark programs which are
used to measure common performance factors of the proposed hardware/software
configurations.
In the final analysis benchmarks based
on end-user tasks are the only real test of a system and will form the
basis of acceptance tests carried out during system installation (to ensure
that the requirements specification is satisfied). However, it is not always
possible to carry out benchmarks based on end-user tasks during the early
stages of the selection process (e.g. when vendors are demonstrating their
products):
-
the end-user packages may not be readily
available on all the configurations due to software licence problems or
the sheer logistics of mounting a large program on every configuration;
-
getting representative end-user data sets
onto every configuration may be very difficult;
-
the end-user requirements may be too broad
to allow the assessment of every task on every possible configuration,
e.g. in an educational environment where a laboratory may be used for a
wide spectrum of teaching tasks.
In such circumstances 'standard' benchmark
programs may be used as an initial filter by assessing common performance
parameters thus enabling the early weeding out of unsuitable configurations.
This paper discusses the important factors which determine system performance
and the standard benchmarks used to assess these.
2 Benchmark requirements
When purchasing computer systems to
satisfy an end-user application (Bramer 1989) it is important to be able
to compare systems not only in terms of cost but performance (processor,
disk, network, etc). Although in the final analysis systems should be evaluated
using end-user applications software and data sets (operated by end-users
if possible) it is useful as an initial filter to compare systems using
'standard' benchmark programs (Sill 1995). Such 'standard' benchmark programs
fall into two categories:
-
programs specially written to test particular
performance factors, e.g. Whetstone for floating point numerical, Dhrystone
for integer numerical and string processing;
-
end-user programs used to evaluate performance
factors which are important in particular application areas, e.g. Linpack.
In general, the best benchmark (Weicker
1990):
-
is written in a high-level language, making
it portable across different machines;
-
is representative of some programming
style or application (e.g. systems programming, numerical programming,
commercial programming);
-
can be measured easily;
-
has wide distribution.
Today benchmark systems are often made
up of a large numbers of tests (e.g. Power Meter for IBM/PC compatibles
tests the CPU, disk, video, etc.) or suites of programs (e.g. SPEC) to
provide a more thorough large scale test of a computer system.
3 Important considerations
when comparing benchmark results
Manufacturers often quote the results of 'standard' benchmark programs
in their sales literature and it is important to be aware of:
-
The precise hardware configuration used in terms of processor model, clock
speed, number of CPUs, memory size, cache size, video processor and memory,
bus, disk speed, disk cache, etc. A cache memory can have a major effect
on the performance of programs with large numbers of small loops and/or
subroutines. Even when carrying out benchmark tests in person (e.g. when
shortlisting tenders) one must take care to check the configuration being
tested. Vendors have been known to tender one system and loan a different
(more powerful) system during benchmark tests (e.g. with a faster processor,
larger cache memory, larger main memory, better quality screen, etc).
-
The operating ststem environment, e.g. OS version, filesystem, and software
disk caches and compressors enabled, number of concurrent users, etc.
-
The version of the benchmark being used, e.g. although the current Dhrystone
version is 2.1 some manufacturers still quote the results obtained (if
better) using version 1.1.
-
The language used. The same program implemented in different languages
may yield different execution times dependent upon the precise techniques
used to store variables, call subroutines, access arrays, etc.
-
The compiler used and the level of optimisation during compilation.
-
The library used. For example, the Whetstone floating point benchmark spends
40 to 50% of execution time in mathematical subroutines and the library
used for the test can alter results significantly, i.e. some systems have
two versions of the floating point subroutines; one which complies with
the IEEE floating-point standard and a second which is faster and may give
less accurate results. It is also worth noting that manufacturers are not
above 'tweaking' compilers and subroutine libraries to enhance the results
of 'standard' benchmarks.
Take care, the effects of different hardware configurations can often be
disguised by using a different version of the operating system. If running
benchmarks on machines of the same processor type it is wise to use the
same software configuration, e.g. when testing IBM/PC compatibes boot a
'standard' version of MS-DOS off a floppy disk.
4 MIPS (millions of instructions
per second)
Since the evolution of RISC machines the literal use of MIPS in terms of
the execution of processor instructions per second has lost all meaning
(Weicker 1990) and is sometimes considered as an acronym for Meaningless
Indicator of Processor Speed. Consider for example the following statement
in a high level language (e.g. Pascal) where A, B and C are integer operands
in main memory:
A := B + C
A mainframe CISC machine capable of memory
to memory arithmetic operations upon three operands could use a single
instruction:
ADD B,C,A add B+C return result in A (all memory operands)
Microprocessors such as the Motorola MC68000
family (and Intel 8086) are more limited in that instructions may operate
only on two operands and one of the operands of an ADD instruction must
be in a processor register:
MOVE.W B,D0 load memory operand B into data register D0
ADD.W C,D0 add memory operand C to data register D0
MOVE.W D0,A return result into memory operand A
The only memory operations of a RISC machine
are LOAD and STORE so the code would appear:
LOAD B,reg1 load memory operand B into register 1
LOAD C,reg2 load memory operand C into register 2
ADD reg1,reg2,reg3 add B and C return result in register 3
STORE reg3,A return result to memory operand A
Assuming that the machines execute the
code in the same time and the mainframe was rated at 1 Mips the MC68000
would then be rated at 3 Mips and the RISC machine at 4 Mips. Because of
this problem the term Mips is often been redefined as 'VAX Mips' where
the performance is given relative to a VAX 11/780 which was generally 'rated'
at 1 Mips (is dependent upon the compiler used, see Weicker 1990). Even
when VAX Mips are quoted it is important to know what programs form the
basis of comparison and what compilers were used on the VAX 11/780.
5 Whetstone benchmark
The Whetstone benchmark was the first
intentionally written to measure computer performance and was designed
to simulate floating point numerical applications:
-
it contains a large percentage of floating
point data and instructions;
-
a high percentage of execution time (approximately
50%) is spent in mathematical library functions;
-
the majority of its variables are global
and the test will not show up the advantages of architectures such as RISC
where the large number of processor registers enhance the handling of local
variables;
-
Whetstone contains a number of very tight
loops and the use of even fairly small instruction caches will enhance
performance considerably;
-
the original program was written in Fortran
using single or double precision calculations.
The source of the benchmark may be obtained
by ftp from netlib.att.com in directory /netlib/benchmark/whetstone*.
6 Dhrystone benchmark
The Dhrystone benchmark was designed
to test performance factors important in non numeric systems programming
(operating systems, compilers, wordprocessors, etc.):
-
it contains no floating point operations;
-
a considerable percentage of time is spent
in string functions making the test very dependent upon the way such operations
are performed (e.g. by in-line code, routines written in assembly language,
etc.) making it susceptible to manufacturers 'tweaking' of critical routines;
-
it contains hardly any tight loops so
in the case of very small caches the majority of instruction accesses are
will be misses; however, the situation changes radically as soon as the
cache reaches a critical size and can hold the main measurement loop;
-
only a small amount of global data is
manipulated (as opposed to Whetstone).
There are two versions of the Dhrystone
benchmark. Version 1.1 contained some 'dead code' which could be removed
by optimising compilers. Version 2.1 corrected this and should be the version
used in practice. Some manufacturers, however, still quote the (better)
results of Version 1.1 so care must be taken when comparing Dhrystone performance
figures to check which version was used.
The source of the benchmark may be
obtained by ftp from ftp.nosc.mil in directory pub/aburto.
7 Linpack benchmark
The Linpack benchmark was derived from
a real application which originated as a collection of linear algebra subroutines
in Fortran. As one would expect it tests floating point performance and
results are presented in Mflops (millions of floating point instructions
per second):
-
it has a large percentage of floating
point operations (note that division is not used);
-
it uses no mathematical functions (in
contrast to Whetstone);
-
there are no global variables; operations
being carried out on local variables or an array passed to subroutines
as a parameter;
-
it operates on a two-dimensional array
and when comparing results care must be taken to ensure that the same array
size was used;
-
results are for single or double precision
operations (which should be specified);
-
a large percentage (over 70%) of the execution
time is spent within a single function where even a small instruction cache
can alter results considerably;
-
the benchmark relies heavily on a package
of basic linear algebra subroutines (BLAS) which should be coded in Fortran
(as in the original): some vendors present results where the subroutines
have been rewritten in assembly language which can make a considerable
difference.
The source of the benchmark may be obtained
by ftp from netlib.att.com in directory /netlib/benchmark/linpack*.
A version of the Linpack floating point
program converted to C can be obtained from ftp.nosc.mil directory /pub/aburto;
source in clinpack.c and results in clinpack.dpr, clinpack.dpu, clinpack.spr,
and clinpack.spu.
8 SPEC (Standard Performance
Evaluation Cooporation) benchmarks http://www.spec.org/
It has been recognised for some time
that benchmarks such as Whetstone and Dhrystone, which are small programs
in todays terms, are inadequate when attempting to evaluate the performance
of high powered computer systems running modern large scale software (e.g.
CAD design systems, CASE software engineering tools, large database environments,
AI programming environments, etc). For example:
-
due to the high code locality few page
faults will be generated under a virtual memory environment;
-
modern instruction and data caches can
be of such a size that the whole program and data can fit within them and
the tests then become totally unrepresentative of 'real life' programs.
SPEC (the Standard Performance Evaluation
Corporation) was initially formed by the manufacturers of professional
workstations: Apollo, Hewlett-Packard, MIPS and Sun (many other manufacturers
have since become members). Its aim was to "establish, maintain and endorse
a standardized set of relevant benchmarks that can be applied to the newest
generation of high-performance computers" (quoted from SPEC's bylaws).
In October 1989 SPEC released its first set of ten benchmark programs (SPEC
release 1, see Uniejewski 1990 for full details) supplied under licence
on magnetic tape and approximately 150,000 lines of code.
Currently SPEC covers three groups,
each with their own benchmarks:
Open Systems Group (OSG)
Component- and system-level benchmarks in an UNIX / NT / VMS environment.
High Performance Group (HPG)
Benchmarking in a numeric computing environment, with emphasis on high-performance
numeric
computing.
Graphics Performance Characterization
Group (GPCG)
Benchmarks for graphical subsystems and OpenGL and Xwindows.
Open
Systems Group Current Benchmarks
SPEC CPU2000
The current release of SPEC's popular processor performance tests; the
successor to SPEC CPU95.
SPEC JBB2000
A Java Business Benchmark - SPEC's first server-side Java benchmark emulating
middle tier business logic.
SPEC JVM98
Benchmark suite for comparing Java virtual machine (JVM) client platforms.
SPEC SFS 2.0
"System: File Server", a test of NFS server performance.
SPEC WEB99
A standardized performance test for WWW servers, successor of SPECweb96.
SPEC CPU2000 is the next-generation
industry-standardized CPU-intensive benchmark suite. SPEC designed CPU2000
to provide a comparative measure of compute intensive performance
across the widest practical range of hardware. The implementation resulted
in source code benchmarks developed from real user applications. These
benchmarks measure the performance of the processor, memory and compiler
on the tested system.
SPEC CPU2000 comprises two sets (or
suites) of benchmarks:
CINT2000 for measuring compute-intensive integer performance, and CFP2000
for compute-intensive floating point performance. The two suites measure
the performance of a computer's processor, memory architecture and compiler.Improvements
to the new suites include longer run times and larger problems for benchmarks,
more application diversity, greater ease of use, and standard development
platforms that will allow SPEC to produce additional releases for other
operating systems.
Results
SPEC CPU2000 provides performance
measurements for system speed and throughput. The speed metric, SPECint2000,
measures how fast a machine completes running all of the CPU2000 benchmarks.The
throughput metric, SPECint_rate2000, measures how many tasks a computer
can complete in a given amount of time. SPEC CPU2000 has been designed
to measure throughput for single-processor, symmetric-multiprocessor, and
cluster systems.
The CINT2000 suite comprises 12 application-based
benchmarks written in C and C++ languages.
Included in the floating-point benchmarks
CFP2000 are Fourteen (14) CPU-intensive benchmarks written in FORTRAN (77
and 90) and C languages.
Cost (November 2000)
SPEC CPU2000 (CINT2000 and CFP2000) is available now on CD-ROM from
SPEC. The cost is $500 for new customers and $250 for current licensees.
Universities can acquire the product for $125.
9 SSBA (Synthetic Suite of Benchmarks
from the AFUU)
The SSBA is the result of the studies
of the AFUU (French Association of Unix Users) Benchmark Working Group.
The group assigned itself the goal of thinking on the problem of assessing
the performance of data processing systems, collecting a maximum number
of tests available throughout the world, dissecting the codes and results,
discussing the utility, fixing versions and supplying them with various
comments and procedures.
It is claimed to be a simple and coherent
tool for end users and specialists, providing a clear and pertinent initial
approximation of the performance of UNIX systems.
10 TPC (Transaction Processing Council)
benchmarks http://www.tcp.org
In general the SPEC benchmarks are suitable
for comparing the performance of high performance workstations and nodal
processors for specialist scientific or engineering numeric processing.
The TPC (transaction processing council) is a consortium (HP, IBM, DEC,
NEC, Fijitsu, Hitachi, Bull, etc.) working on industry wide benchmark standards
applicable to large multi-user transaction processing systems.
The Debit-Credit benchmark was
designed in the 1970's to test the performance of computer systems intended
to run a fledgling on-line teller network at the Bank of America. It was
proposed as an industrial standard by a group of industrial and academic
professionals in the paper "A measure of Transaction Processing Power"
(Datamation 1985).
In essence the Debit-Credit benchmark
represents the transaction processing load of a hypothetical bank with
one or more branches and multiple tellers. Transactions consist of debits
and credits to customers accounts with the system maintaining track of
customers account, the balance of each teller and branch and a history
of the banks recent transactions. The major problem was that the proposed
Debit-Credit benchmark was not a real industrial standard in that certain
aspects of the benchmark were very loosely defined whereas other were very
specific and not readily portable across systems. Vendors therefore look
liberties and redefined the benchmark to match the strengths of their particular
system configurations.
TPC-A (TPC's Debit-Credit Benchmark).
Obsolete
as of 6/6/95
In November 1989 TPC published its
first benchmark, TPC benchmark A (TPC-A), which tests the fundamental components
of an on-line transaction processing system (interactive terminal I/O,
disk I/O and database access) by simulating a hypothetical bank with a
single transaction type (debit/credit). TPC-A is based on a single, simple,
update-intensive transaction which performs three updates and one insert
across four tables. Transactions originate from terminals, with a requirement
of 100 bytes in and 200 bytes out. There is a fixed scaling between tps
rate, terminals, and database size. TPC-A requires an external RTE (remote
terminal emulator) to drive the SUT (system under test).
TPC-B (TPC's Database Subsystem
Benchmark) Obsolete as of 6/6/95
TPC-B is a more specific test of the
database subsystem of a transaction system (TPC-A being a general on-line
transaction processing system performance benchmark). TPC-B uses the same
transaction profile and database schema as TPC-A, but eliminates the terminals
and reduces the amount of disk capacity which must be priced with the system.
TPC-B is significantly easier to run because an RTE is not required. In
effect,
TPC-B is a stress test of the database
system's ability to handle transaction processing and sets out exact rules
by which vendors execute and report test results. TPC-B, like TPC-A, is
not intended to be representative of a complex modern transaction processing
environment, but used as initial filter when comparing different system
configurations.
TPC-C (TPC's order processing benchmark)
Approved in July of 1992, TPC Benchmark
C is like TPC-A in that it is an on-line transaction processing (OLTP)
benchmark. However, TPC-C is more complex
than TPC-A because of its multiple
transaction types, more complex database and overall execution structure.
TPC-C involves a mix of five concurrent transactions
of different types and complexity
either executed on-line or queued for deferred execution. The database
is comprised of nine types of records with a wide range of
record and population sizes. TPC-C
is measured in transactions per minute (tpm).
TPC-C simulates a complete computing
environment where a population of terminal operators executes transactions
against a database. The benchmark is centered
around the principal activities (transactions)
of an order-entry environment. These transactions include entering and
delivering orders, recording payments, checking
the status of orders, and monitoring
the level of stock at the warehouses. While the benchmark portrays the
activity of a wholesale supplier, TPC-C is not limited to
the activity of any particular business
segment, but, rather represents any industry that must manage, sell, or
distribute a product or service.
The five types of transactions:
-
a new order (for on average ten different
stock items)
-
delivering orders,
-
posting a customer payment
-
retrieving an order status report
-
monitoring the inventory level of recently
ordered items
TPC-H: (Ad-hoc,
decision support) benchmark
represents decision support environments where users don't know which queries
will be executed against a database system; hence, the "ad-hoc" label.
Given this ad-hocness, no pre-knowledge of the queries can be built into
the DBMS system and the query execution times can be very long
TPC-R (Business Reporting, Decision Support) benchmark represents
decision support environments where users run a standard set of queries
against a database system. In this environment, pre-knowledge of the queries
is taken for granted and the DBMS system can be optimized to run these
standard queries very rapidly.
TPC-W is a transactional web benchmark. The workload is performed
in a controlled internet commerce environment that simulates the activities
of a business oriented transactional web server. The workload exercises
a breadth of system components associated with such environments.
11 Benchmarks for IBM PC compatible
computers
In the area of PCs (except for a few top
end systems) the SPEC and TPC benchmarks are too large and complex. There
are, however, a number of IBM PC compatible specific benchmark utility
programs which are widely available and may be used when comparing alternative
systems. e.g. iCOMP.
See http://www.intel.com/procs/perf/resources/benchmark.htm.
12 Sources of Information in
the Internet
The USENET newsgroup comp.benchmarks (e.g.
using ftp open sunsite.doc.ic.ac.uk and look in directory \usenet\usenet-by-group\comp.lang.c++)
is a useful source of information including FAQs (Frequently Asked Questions).
Using a WWW (World Wide Web) browser doing a search on 'Benchmarks', 'PC
benchmarks', 'SPEC', 'TPC', etc. will turn up lots of information and further
contacts. Using a WWW browser try contacting the comp.benchmarks newsgroup;
you can post questions and look at other users questions and replies.
12.1 A Performance Database Server (Sill 1995)
A Performance Database Server is available
which can be used to extract current benchmark data and literature is available
via the WWW from http://netlib.cs.utk.edu/performance/html/PDStop.html
Questions and comments for PDS should be mailed to utpds@cs.utk.edu.
13 Conclusions
This paper reviewed a range of issues
critical in system performance evaluation. In particular it is important
to remember that 'standard' benchmarks should only be used as an initial
filter to identify those systems and configurations which appear to meet
the computational requirements of the application, i.e. CPU power, disk
speed, graphics performance, etc. All benchmarks, even those from SPEC
and TPC, are subject to vendor 'tweaking' and should not be used as an
absolute guide to performance. In particular, concentrating on raw processor
power can be very misleading in that the performance of end-user applications
also depends on other factors such as memory size, disk and network speed,
etc. For example, the School of Computing and Mathematical Sciences at
De Montfort University has a number of professional workstations including
Apollo DN5500s (based on the 25MHz Motorola MC68040) and HP9000/720s (based
on the PA-RISC processor). Although the processor performance of the HP9000/720
is rated at two to three times the Apollo DN5500 (57 Mips against 22 Mips)
in practice the School's HPs perform ten to twenty times better than the
Apollos when running large interactive systems under X windows and Motif.
Clearly factors other than CPU power are effecting performance, e.g. main
memory size (the Apollos are now underconfigured with only 8 Mbyte - the
maximum that can be physically fitted) and the high performance X window
graphics accelerator used on the HP machines.
The final phases of evaluating alternative
systems should be carried out using end-user applications with, if possible,
end-user participation.
References
Bramer, B, 1989, 'Selection of computer
systems to meet end-user requirements', IEEE Computer Aided Engineering
Journal, Vol. 6 No. 2, April, pp. 52-58.
Datamation, 1985, 'A measure of Transaction
Processing Power", Datamation, April 1.
Datamation, 1992, 'X terminals X-plor
New Territory', Datamation, Vol. 38, No. 19, September 15th, 1992, pp 108.
Rabbat, G, Furht, B & Kibler, R,
1988, 'Three-dimensional computer performance', IEEE Computer, Vol. 21
No. 7, July, pp 59-60.hopper
Sill, D, 1995, "Benchmarks FAQ version
0.6", March 26 1995. (available by ftp from USENET newsgroup comp.benchmarks)
Socarras, A E, Cooper, R S, & Stonecypher,
W F, 1991, 'Anatomy of an X terminal', IEEE Spectrum, Vol. 28 No. 3, March,
pp 52-55.
Stallings, W, 1987, 'Computer organisation
and architecture', MacMillian.
Tanenbaum, A S, 1990, 'Structured Computer
Organisation', Prentice-Hall.
Uniejewski, J, 1990, 'Characterising
system performance using application-level benchmarks', Proc. Buscom September
1990, pp 159-167, Partial publication in SPEC newsletter, Vol. 2 No. 3,
Summer 1990, pp 3-4.
Weicker, R P, 1990, 'An overview of
common benchmarks', IEEE Computer, Vol. 23 No. 12, December, pp 65-75.
_