peterd@june.cs.washington.edu (Peter C. Damron) (03/16/89)
In a recent discussion with a friend working on compilers I brought up these comments about comparing system X versus system Y. (the names have been changed to protect the innocent.) I don't want to open up a discussion about benchmarking or choosing benchmarks, but I am interested in any techniques for evaluating the "effectiveness" of a compiler or compiler + architecture. I wonder if anyone on the net has any further comments or suggestions. I see three possibilities for comparing the systems: 1. compiler effectiveness 2. architecture effectiveness 3. compiler + architecture effectiveness It is arguable which of these measurements is most important. It depends on your point of view. The end user probably cares most about number 3. Comparing compiler + architecture --------------------------------- Number 3 is the easiest to test and the one usually measured. The usual method is to collect a "suitable" set of benchmarks, compile and run them. The word suitable means the set which proves the point you want to make :-). If you want to be as fair as possible, you should choose benchmarks from a wide variety of application areas and they should be large programs. If you have a particular application in mind, use that as a benchmark. See below under benchmarks. Comparing architectures ----------------------- To measure number 2, architecture effectiveness, you must attempt to hold compiler technology constant. This is where using gcc or pcc or other retargetable compilers might help. You port the compiler to the architectures of interest, then you use some benchmark programs, generate code, and compare the output code. See below about benchmarks and comparing code. The problem with this approach is that the compiler technology you use to do the comparison can severely bias your results. If architecture A depends on very good register allocation to run efficiently, but the compiler you use has a bad register allocator, the results will be biased against architecture A. So, the best you can do is either compare architectures which require roughly the same compiler technology for effective use, or use the latest and greatest compiler technology in all aspects of the compiler. The latter is too difficult, so the former is probably the best you can do. Comparing compilers ------------------- To measure number 1, compiler effectiveness, you must attempt to hold the architecture constant. Again, a set of benchmarks is chosen and run through the compilers. The output code is then evaluated. See benchmarks and code comparison below. If the target architectures of the compilers differs, then trying to make any comparison is very difficult -- too many parameters. In comparing compilers, it is important to understand the different compiler phases and how they interact. The effectiveness of each phase can be measured by designing artificial benchmarks to exercise that phase. However, poor performance on a benchmark can easily be attributed to one phase, but be caused by a bad interaction between other phases of the compiler. Phases which might be of interest to test include: instruction selection, instruction scheduling, register allocation, local optimizations, and global optimizations. Benchmarks ---------- There are 3 kinds of benchmarks: A. synthetic (e.g. dhrystone, whetstone, livermore loops, etc.) B. actual applications C. abstractions from actual applications The most widely used are class A benchmarks, but they are only very rough measures of performance, since any real application will be different. Class B, actual applications, are what the user is actually interested in, but these can be hard to execute and measure accurately. Class C is a compromise. Take your application, figure out what a typical "run" might be (or a series of them), modify the application to do the typical run in an automated and measurable way. This is the most work, but probably gives the best measure of how well the system under measure will perform for your application. Of course, you may have several applications, so you can run more than one benchmark. It is probably a good idea to run many benchmarks anyway, to get an idea of what the variance is in relative performance. Note that compiler libraries, the run-time system, the operating system, and I/O may all effect the outcome of benchmarks. Comparing machine code ---------------------- There are many ways to compare machine code for different machines. Probably none of them are really adequate. First, there are two classes of measurements, static and dynamic. Static measurement attempts to quantify the quality of the code without regard to how many times each instruction might be executed. Static measurements are probably better for comparing compilers. Dynamic measurements attempt to quantify how fast the code will execute, taking into account how many times and in what order instructions are executed. Dynamic measurements are probably better for comparing architectures. Static measurements are typically taken with some sort of instruction counting program. You might measure number of instructions, cycles, memory references, instruction size, or some combination. Dynamic measurements are typically taken by executing the program on some hardware, though instruction counting or simulation is possible. If the code is to be executed, attention must be given to making sure the hardware configurations are equivalent or comparable. In measuring code, you may want to find out certain characteristics about the code, especially for comparing compilers. How effective is the register allocation? How many different instructions from the architecture were actually used? How many addressing modes were used? Which optimizations were performed? Dead code? Loop Invariants? etc. How effective was the scheduler in eliminating NOP's or filling branch delay slots? These questions relate directly back to various portions of the compiler. Some of these measurements are easy, but others are difficult. Also, these measurements are only an indirect measure of the related portion of the compiler, because they will depend on the program too. Summary ------- Any of these types of measurements are difficult to make fairly. There are many factors to consider and many parameters which cannot be held constant for the comparison. Generalizations are probably to be avoided, but some specific findings may hold. (e.g. compiler A does a better job at register allocation than compiler B, or compiler+architecture A is better at numerical analysis than compiler+architecture B, primarily because of the library routines) Just producing a set of numbers is not enough. Careful analysis of the results is necessary to avoid unintended bias and to fully understand why the numbers came out that way. Plug ---- The following technical reports compare various instruction selection techniques (i.e. comparing parts of compilers). Of course, these comparisons used a very controlled experimental testbed. Comparisons with production compilers will be much harder. Robert R. Henry and Peter C. Damron, "Performance of Table-Driven Code Generators Using Tree-Pattern Matching", Technical Report 89-02-02, Dept. of Computer Science, Univ. of Washington, Feb. 1989. Robert R. Henry and Peter C. Damron, "Algorithms for Table-Driven Code Generators Using Tree-Pattern Matching", Technical Report 89-02-03, Dept. of Computer Science, Univ. of Washington, Feb. 1989. Peter C. Damron Dept. of Computer Science, FR-35 University of Washington Seattle, WA 98195 peterd@cs.washington.edu {ucbvax,decvax,etc.}!uw-beaver!uw-june!peterd -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { decvax | harvard | yale | bbn}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
wsmith@m.cs.uiuc.edu (Bill Smith) (03/17/89)
>I see three possibilities for comparing [compilers and systems: > >1. compiler effectiveness >2. architecture effectiveness >3. compiler + architecture effectiveness If you are a software developer, there are at least three additional criteria for comparing systems: 4. Compile-time diagnostic messages and helpfulness 5. Run-time debugging support 6. Support for software engineering These can be then factored into the 3 possibilities mentioned above: to compare the effectiveness of the compiler, OS or the two together. Speed isn't everything, but it is the easiest to measure. Bill Smith wsmith@cs.uiuc.edu uiucdcs!wsmith -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { decvax | harvard | yale | bbn}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request