System benchmarks

Brian Bramer, DeMontfort University, UK (bb@dmu.ac.uk)

Goto Computer Systems Notes main page
 
 

Contents

1 Introduction

2 Benchmark requirements

3 Important considerations when comparing benchmark results

4 MIPS (millions of instructions per second)

5 Whetstone benchmark

6 Dhrystone benchmark

7 Linpack benchmark

8 SPEC (Standard Performance Evaluation Cooperative) benchmarks

9 SSBA (Synthetic Suite of Benchmarks from the AFUU)

10 TPC (Transaction Processing Council) benchmarks

11 Benchmarks for IBM PC compatible computers

12 Sources of Information in the Internet

13 Conclusions

References
 
 

1 Introduction

When selecting a computer system to satisfy an end-user requirement a formal procedure would be followed, for example (Bramer 1989), http://www.cse.dmu.ac.uk/~bb/Teaching//ComputerSystems/ComputerSelection/ComputerSelection.html):

  1. Carry out a feasibility study. The need for a new installation or the possibility of upgrading an existing system is analysed to determine cost effectiveness in terms of end-user requirements and advantages gained, e.g. increased productivity of skilled staff, reduced product development times, a more viable product, etc. The result of the feasibility study will be a report to be submitted to senior management to request funds to implement the proposed system.
  2. Draw up a detailed requirements specification. Once a budget has been agreed a requirements specification is drawn up which describes in detail the facilities of the proposed system and the acceptance criteria..
  3. Draw up an ITT (invitation to tender). From the requirements specification information is extracted which specifies system facilities which are to be met by outside organisations (vendors of hardware and software). This document is sent to a range of suitable system vendors.
  4. Receive tender documents from vendors. The vendors describe their proposals for a system which will satisfy the ITT. For a small system this will be a simple quotation.
  5. Draw up a shortlist. Using information in the tender documents a shortlist is formed which will be based on system facilities, delivery times, maintenance offered, price,etc.
  6. Finalise contract to purchase system. For a small microcomputer system an order may immediately be issued. In the case of a more complex system details may have to be discussed with vendors on the shortlist and field evaluations carried out.
  7. Order system
  8. Install system and carry out acceptance tests. In the case of a small system it will arrive, be installed and, if working, put into immediate use. In the case of a large system acceptance tests will be carried out to ensure that it meets the requirements specification.
  9. Training. This may be in-house or offered by the system vendor as part of the overall deal.
  10. Systems maintenance. Daily/weekly maintenance of system performed by in-house staff, e.g. disk backup, or hardware and software maintenance performed by the system vendor or a specialist maintenance organisation.
At various points during the above process benchmark tests will be used to evaluate alternative computer configurations to ensure that they meet the requirements generated during the feasibility study. Benchmark tests may take two forms: In the final analysis benchmarks based on end-user tasks are the only real test of a system and will form the basis of acceptance tests carried out during system installation (to ensure that the requirements specification is satisfied). However, it is not always possible to carry out benchmarks based on end-user tasks during the early stages of the selection process (e.g. when vendors are demonstrating their products):
  1. the end-user packages may not be readily available on all the configurations due to software licence problems or the sheer logistics of mounting a large program on every configuration;
  2. getting representative end-user data sets onto every configuration may be very difficult;
  3. the end-user requirements may be too broad to allow the assessment of every task on every possible configuration, e.g. in an educational environment where a laboratory may be used for a wide spectrum of teaching tasks.
In such circumstances 'standard' benchmark programs may be used as an initial filter by assessing common performance parameters thus enabling the early weeding out of unsuitable configurations. This paper discusses the important factors which determine system performance and the standard benchmarks used to assess these.
 
 

2 Benchmark requirements

When purchasing computer systems to satisfy an end-user application (Bramer 1989) it is important to be able to compare systems not only in terms of cost but performance (processor, disk, network, etc). Although in the final analysis systems should be evaluated using end-user applications software and data sets (operated by end-users if possible) it is useful as an initial filter to compare systems using 'standard' benchmark programs (Sill 1995). Such 'standard' benchmark programs fall into two categories:

In general, the best benchmark (Weicker 1990):
  1. is written in a high-level language, making it portable across different machines;
  2. is representative of some programming style or application (e.g. systems programming, numerical programming, commercial programming);
  3. can be measured easily;
  4. has wide distribution.
Today benchmark systems are often made up of a large numbers of tests (e.g. Power Meter for IBM/PC compatibles tests the CPU, disk, video, etc.) or suites of programs (e.g. SPEC) to provide a more thorough large scale test of a computer system.
 
 

3 Important considerations when comparing benchmark results

Manufacturers often quote the results of 'standard' benchmark programs in their sales literature and it is important to be aware of:

  1. The precise hardware configuration used in terms of processor model, clock speed, number of CPUs, memory size, cache size, video processor and memory, bus, disk speed, disk cache, etc. A cache memory can have a major effect on the performance of programs with large numbers of small loops and/or subroutines. Even when carrying out benchmark tests in person (e.g. when shortlisting tenders) one must take care to check the configuration being tested. Vendors have been known to tender one system and loan a different (more powerful) system during benchmark tests (e.g. with a faster processor, larger cache memory, larger main memory, better quality screen, etc).
  2. The operating ststem environment, e.g. OS version, filesystem, and software disk caches and compressors enabled, number of concurrent users, etc.
  3. The version of the benchmark being used, e.g. although the current Dhrystone version is 2.1 some manufacturers still quote the results obtained (if better) using version 1.1.
  4. The language used. The same program implemented in different languages may yield different execution times dependent upon the precise techniques used to store variables, call subroutines, access arrays, etc.
  5. The compiler used and the level of optimisation during compilation.
  6. The library used. For example, the Whetstone floating point benchmark spends 40 to 50% of execution time in mathematical subroutines and the library used for the test can alter results significantly, i.e. some systems have two versions of the floating point subroutines; one which complies with the IEEE floating-point standard and a second which is faster and may give less accurate results. It is also worth noting that manufacturers are not above 'tweaking' compilers and subroutine libraries to enhance the results of 'standard' benchmarks.
Take care, the effects of different hardware configurations can often be disguised by using a different version of the operating system. If running benchmarks on machines of the same processor type it is wise to use the same software configuration, e.g. when testing IBM/PC compatibes boot a 'standard' version of MS-DOS off a floppy disk.
 
 
 

4 MIPS (millions of instructions per second)

Since the evolution of RISC machines the literal use of MIPS in terms of the execution of processor instructions per second has lost all meaning (Weicker 1990) and is sometimes considered as an acronym for Meaningless Indicator of Processor Speed. Consider for example the following statement in a high level language (e.g. Pascal) where A, B and C are integer operands in main memory: A := B + C A mainframe CISC machine capable of memory to memory arithmetic operations upon three operands could use a single instruction:
        ADD B,C,A               add B+C return result in A (all memory operands)
Microprocessors such as the Motorola MC68000 family (and Intel 8086) are more limited in that instructions may operate only on two operands and one of the operands of an ADD instruction must be in a processor register:
MOVE.W B,D0             load memory operand B into data register D0
ADD.W C,D0              add memory operand C to data register D0
MOVE.W D0,A             return result into memory operand A
The only memory operations of a RISC machine are LOAD and STORE so the code would appear:
LOAD B,reg1             load memory operand B into register 1
LOAD C,reg2             load memory operand C into register 2
ADD reg1,reg2,reg3      add B and C return result in register 3
STORE reg3,A            return result to memory operand A
Assuming that the machines execute the code in the same time and the mainframe was rated at 1 Mips the MC68000 would then be rated at 3 Mips and the RISC machine at 4 Mips. Because of this problem the term Mips is often been redefined as 'VAX Mips' where the performance is given relative to a VAX 11/780 which was generally 'rated' at 1 Mips (is dependent upon the compiler used, see Weicker 1990). Even when VAX Mips are quoted it is important to know what programs form the basis of comparison and what compilers were used on the VAX 11/780.
 
 

5 Whetstone benchmark

The Whetstone benchmark was the first intentionally written to measure computer performance and was designed to simulate floating point numerical applications:

  1. it contains a large percentage of floating point data and instructions;
  2. a high percentage of execution time (approximately 50%) is spent in mathematical library functions;
  3. the majority of its variables are global and the test will not show up the advantages of architectures such as RISC where the large number of processor registers enhance the handling of local variables;
  4. Whetstone contains a number of very tight loops and the use of even fairly small instruction caches will enhance performance considerably;
  5. the original program was written in Fortran using single or double precision calculations.
The source of the benchmark may be obtained by ftp from netlib.att.com in directory /netlib/benchmark/whetstone*.
 
 

6 Dhrystone benchmark

The Dhrystone benchmark was designed to test performance factors important in non numeric systems programming (operating systems, compilers, wordprocessors, etc.):

  1. it contains no floating point operations;
  2. a considerable percentage of time is spent in string functions making the test very dependent upon the way such operations are performed (e.g. by in-line code, routines written in assembly language, etc.) making it susceptible to manufacturers 'tweaking' of critical routines;
  3. it contains hardly any tight loops so in the case of very small caches the majority of instruction accesses are will be misses; however, the situation changes radically as soon as the cache reaches a critical size and can hold the main measurement loop;
  4. only a small amount of global data is manipulated (as opposed to Whetstone).
There are two versions of the Dhrystone benchmark. Version 1.1 contained some 'dead code' which could be removed by optimising compilers. Version 2.1 corrected this and should be the version used in practice. Some manufacturers, however, still quote the (better) results of Version 1.1 so care must be taken when comparing Dhrystone performance figures to check which version was used.

The source of the benchmark may be obtained by ftp from ftp.nosc.mil in directory pub/aburto.
 
 

7 Linpack benchmark

The Linpack benchmark was derived from a real application which originated as a collection of linear algebra subroutines in Fortran. As one would expect it tests floating point performance and results are presented in Mflops (millions of floating point instructions per second):

  1. it has a large percentage of floating point operations (note that division is not used);
  2. it uses no mathematical functions (in contrast to Whetstone);
  3. there are no global variables; operations being carried out on local variables or an array passed to subroutines as a parameter;
  4. it operates on a two-dimensional array and when comparing results care must be taken to ensure that the same array size was used;
  5. results are for single or double precision operations (which should be specified);
  6. a large percentage (over 70%) of the execution time is spent within a single function where even a small instruction cache can alter results considerably;
  7. the benchmark relies heavily on a package of basic linear algebra subroutines (BLAS) which should be coded in Fortran (as in the original): some vendors present results where the subroutines have been rewritten in assembly language which can make a considerable difference.
The source of the benchmark may be obtained by ftp from netlib.att.com in directory /netlib/benchmark/linpack*.

A version of the Linpack floating point program converted to C can be obtained from ftp.nosc.mil directory /pub/aburto; source in clinpack.c and results in clinpack.dpr, clinpack.dpu, clinpack.spr, and clinpack.spu.
 
 
 
 

8 SPEC (Standard Performance Evaluation Cooporation) benchmarks http://www.spec.org/

It has been recognised for some time that benchmarks such as Whetstone and Dhrystone, which are small programs in todays terms, are inadequate when attempting to evaluate the performance of high powered computer systems running modern large scale software (e.g. CAD design systems, CASE software engineering tools, large database environments, AI programming environments, etc). For example:

SPEC (the Standard Performance Evaluation Corporation) was initially formed by the manufacturers of professional workstations: Apollo, Hewlett-Packard, MIPS and Sun (many other manufacturers have since become members). Its aim was to "establish, maintain and endorse a standardized set of relevant benchmarks that can be applied to the newest generation of high-performance computers" (quoted from SPEC's bylaws). In October 1989 SPEC released its first set of ten benchmark programs (SPEC release 1, see Uniejewski 1990 for full details) supplied under licence on magnetic tape and approximately 150,000 lines of code.
 

Currently SPEC covers three groups, each with their own benchmarks:

                       Open Systems Group (OSG)
                            Component- and system-level benchmarks in an UNIX / NT / VMS environment.
                       High Performance Group (HPG)
                            Benchmarking in a numeric computing environment, with emphasis on high-performance numeric
                            computing.
                      Graphics Performance Characterization Group (GPCG)
                            Benchmarks for graphical subsystems and OpenGL and Xwindows.
 
 

Open Systems Group Current Benchmarks

                       SPEC CPU2000
                            The current release of SPEC's popular processor performance tests; the successor to SPEC CPU95.
                       SPEC JBB2000
                            A Java Business Benchmark - SPEC's first server-side Java benchmark emulating middle tier  business logic.
                       SPEC JVM98
                            Benchmark suite for comparing Java virtual machine (JVM) client platforms.
                       SPEC SFS 2.0
                            "System: File Server", a test of NFS server performance.
                       SPEC WEB99
                            A standardized performance test for WWW servers, successor of SPECweb96.

SPEC CPU2000


SPEC CPU2000 is the next-generation industry-standardized CPU-intensive benchmark suite. SPEC designed CPU2000 to provide a comparative measure of  compute intensive performance across the widest practical range of hardware. The implementation resulted in source code benchmarks developed from real user applications. These benchmarks measure the performance of the processor, memory and compiler on the tested system.
 

SPEC CPU2000 comprises two sets (or suites) of benchmarks: CINT2000 for measuring compute-intensive integer performance, and CFP2000 for compute-intensive floating point performance. The two suites measure the performance of a computer's processor, memory architecture and compiler.Improvements to the new suites include longer run times and larger problems for benchmarks, more application diversity, greater ease of use, and standard development platforms that will allow SPEC to produce additional releases for other operating systems.

Results
SPEC CPU2000 provides performance measurements for system speed and throughput. The speed metric, SPECint2000, measures how fast a machine completes running all of the CPU2000 benchmarks.The throughput metric, SPECint_rate2000, measures how many tasks a computer can complete in a given amount of time. SPEC CPU2000 has been designed to measure throughput for single-processor, symmetric-multiprocessor, and cluster systems.

The CINT2000 suite comprises 12 application-based benchmarks written in C and C++ languages.
Included in the floating-point benchmarks CFP2000 are Fourteen (14) CPU-intensive benchmarks written in FORTRAN (77 and 90) and C languages.

Cost (November 2000)
SPEC CPU2000 (CINT2000 and CFP2000) is available now on CD-ROM from SPEC. The cost is $500 for new customers and $250 for current licensees. Universities can acquire the product for $125.
 

9 SSBA (Synthetic Suite of Benchmarks from the AFUU)

The SSBA is the result of the studies of the AFUU (French Association of Unix Users) Benchmark Working Group. The group assigned itself the goal of thinking on the problem of assessing the performance of data processing systems, collecting a maximum number of tests available throughout the world, dissecting the codes and results, discussing the utility, fixing versions and supplying them with various comments and procedures.

It is claimed to be a simple and coherent tool for end users and specialists, providing a clear and pertinent initial approximation of the performance of UNIX systems.
 
 

10 TPC (Transaction Processing Council) benchmarks http://www.tcp.org

In general the SPEC benchmarks are suitable for comparing the performance of high performance workstations and nodal processors for specialist scientific or engineering numeric processing. The TPC (transaction processing council) is a consortium (HP, IBM, DEC, NEC, Fijitsu, Hitachi, Bull, etc.) working on industry wide benchmark standards applicable to large multi-user transaction processing systems.

The Debit-Credit benchmark was designed in the 1970's to test the performance of computer systems intended to run a fledgling on-line teller network at the Bank of America. It was proposed as an industrial standard by a group of industrial and academic professionals in the paper "A measure of Transaction Processing Power" (Datamation 1985).

In essence the Debit-Credit benchmark represents the transaction processing load of a hypothetical bank with one or more branches and multiple tellers. Transactions consist of debits and credits to customers accounts with the system maintaining track of customers account, the balance of each teller and branch and a history of the banks recent transactions. The major problem was that the proposed Debit-Credit benchmark was not a real industrial standard in that certain aspects of the benchmark were very loosely defined whereas other were very specific and not readily portable across systems. Vendors therefore look liberties and redefined the benchmark to match the strengths of their particular system configurations.

TPC-A (TPC's Debit-Credit Benchmark). Obsolete as of 6/6/95

In November 1989 TPC published its first benchmark, TPC benchmark A (TPC-A), which tests the fundamental components of an on-line transaction processing system (interactive terminal I/O, disk I/O and database access) by simulating a hypothetical bank with a single transaction type (debit/credit). TPC-A is based on a single, simple, update-intensive transaction which performs three updates and one insert across four tables. Transactions originate from terminals, with a requirement of 100 bytes in and 200 bytes out. There is a fixed scaling between tps rate, terminals, and database size. TPC-A requires an external RTE (remote terminal emulator) to drive the SUT (system under test).
 

TPC-B (TPC's Database Subsystem Benchmark) Obsolete as of 6/6/95

TPC-B is a more specific test of the database subsystem of a transaction system (TPC-A being a general on-line transaction processing system performance benchmark). TPC-B uses the same transaction profile and database schema as TPC-A, but eliminates the terminals and reduces the amount of disk capacity which must be priced with the system. TPC-B is significantly easier to run because an RTE is not required. In effect,

TPC-B is a stress test of the database system's ability to handle transaction processing and sets out exact rules by which vendors execute and report test results. TPC-B, like TPC-A, is not intended to be representative of a complex modern transaction processing environment, but used as initial filter when comparing different system configurations.

TPC-C (TPC's order processing benchmark)

Approved in July of 1992, TPC Benchmark C is like TPC-A in that it is an on-line transaction processing (OLTP) benchmark. However, TPC-C is more complex
than TPC-A because of its multiple transaction types, more complex database and overall execution structure. TPC-C involves a mix of five concurrent transactions
of different types and complexity either executed on-line or queued for deferred execution. The database is comprised of nine types of records with a wide range of
record and population sizes. TPC-C is measured in transactions per minute (tpm).

TPC-C simulates a complete computing environment where a population of terminal operators executes transactions against a database. The benchmark is centered
around the principal activities (transactions) of an order-entry environment. These transactions include entering and delivering orders, recording payments, checking
the status of orders, and monitoring the level of stock at the warehouses. While the benchmark portrays the activity of a wholesale supplier, TPC-C is not limited to
the activity of any particular business segment, but, rather represents any industry that must manage, sell, or distribute a product or service.

The five types of transactions:

  1. a new order (for on average ten different stock items)
  2. delivering orders,
  3. posting a customer payment
  4. retrieving an order status report
  5. monitoring the inventory level of recently ordered items
TPC-H:  (Ad-hoc, decision support) benchmark  represents decision support environments where users don't know which queries will be executed against a database system; hence, the "ad-hoc" label. Given this ad-hocness, no pre-knowledge of the queries can be built into the DBMS system and the query execution times can be very long

TPC-R (Business Reporting, Decision Support) benchmark represents decision support environments where users run a standard set of queries against a database system. In this environment, pre-knowledge of the queries is taken for granted and the DBMS system can be optimized to run these standard queries very rapidly.

TPC-W is a transactional web benchmark. The workload is performed in a controlled internet commerce environment that simulates the activities of a business oriented transactional web server. The workload exercises a breadth of system components associated with such environments.
 
 
 

11 Benchmarks for IBM PC compatible computers

In the area of PCs (except for a few top end systems) the SPEC and TPC benchmarks are too large and complex. There are, however, a number of IBM PC compatible specific benchmark utility programs which are widely available and may be used when comparing alternative systems. e.g. iCOMP. See http://www.intel.com/procs/perf/resources/benchmark.htm.
 
 

12 Sources of Information in the Internet

The USENET newsgroup comp.benchmarks (e.g. using ftp open sunsite.doc.ic.ac.uk and look in directory \usenet\usenet-by-group\comp.lang.c++) is a useful source of information including FAQs (Frequently Asked Questions). Using a WWW (World Wide Web) browser doing a search on 'Benchmarks', 'PC benchmarks', 'SPEC', 'TPC', etc. will turn up lots of information and further contacts. Using a WWW browser try contacting the comp.benchmarks newsgroup; you can post questions and look at other users questions and replies.

12.1 A Performance Database Server (Sill 1995)

A Performance Database Server is available which can be used to extract current benchmark data and literature is available via the WWW from http://netlib.cs.utk.edu/performance/html/PDStop.html  Questions and comments for PDS should be mailed to utpds@cs.utk.edu.
 

13 Conclusions

This paper reviewed a range of issues critical in system performance evaluation. In particular it is important to remember that 'standard' benchmarks should only be used as an initial filter to identify those systems and configurations which appear to meet the computational requirements of the application, i.e. CPU power, disk speed, graphics performance, etc. All benchmarks, even those from SPEC and TPC, are subject to vendor 'tweaking' and should not be used as an absolute guide to performance. In particular, concentrating on raw processor power can be very misleading in that the performance of end-user applications also depends on other factors such as memory size, disk and network speed, etc. For example, the School of Computing and Mathematical Sciences at De Montfort University has a number of professional workstations including Apollo DN5500s (based on the 25MHz Motorola MC68040) and HP9000/720s (based on the PA-RISC processor). Although the processor performance of the HP9000/720 is rated at two to three times the Apollo DN5500 (57 Mips against 22 Mips) in practice the School's HPs perform ten to twenty times better than the Apollos when running large interactive systems under X windows and Motif. Clearly factors other than CPU power are effecting performance, e.g. main memory size (the Apollos are now underconfigured with only 8 Mbyte - the maximum that can be physically fitted) and the high performance X window graphics accelerator used on the HP machines.

The final phases of evaluating alternative systems should be carried out using end-user applications with, if possible, end-user participation.

References

Bramer, B, 1989, 'Selection of computer systems to meet end-user requirements', IEEE Computer Aided Engineering Journal, Vol. 6 No. 2, April, pp. 52-58.

Datamation, 1985, 'A measure of Transaction Processing Power", Datamation, April 1.

Datamation, 1992, 'X terminals X-plor New Territory', Datamation, Vol. 38, No. 19, September 15th, 1992, pp 108.

Rabbat, G, Furht, B & Kibler, R, 1988, 'Three-dimensional computer performance', IEEE Computer, Vol. 21 No. 7, July, pp 59-60.hopper

Sill, D, 1995, "Benchmarks FAQ version 0.6", March 26 1995. (available by ftp from USENET newsgroup comp.benchmarks)

Socarras, A E, Cooper, R S, & Stonecypher, W F, 1991, 'Anatomy of an X terminal', IEEE Spectrum, Vol. 28 No. 3, March, pp 52-55.

Stallings, W, 1987, 'Computer organisation and architecture', MacMillian.

Tanenbaum, A S, 1990, 'Structured Computer Organisation', Prentice-Hall.

Uniejewski, J, 1990, 'Characterising system performance using application-level benchmarks', Proc. Buscom September 1990, pp 159-167, Partial publication in SPEC newsletter, Vol. 2 No. 3, Summer 1990, pp 3-4.

Weicker, R P, 1990, 'An overview of common benchmarks', IEEE Computer, Vol. 23 No. 12, December, pp 65-75.
 
 

_