mash@mips.COM (John Mashey) (06/12/90)
In article <8019@mirsa.inria.fr> jlf@mirsa.inria.fr (Jean-Louis Faraut) writes: >Therefore, we are looking at new RISC technology and how big CPU >manufacturers announced performances are is a matter of wondering for >me :-? >I try it with different commands to test CPU, I/O etc ... but for the >sake of brevity, I'll only present here results obtained from a >slightly modified version of the famous Bill Joy's test program, where >RISC are supposed to be better than CISC . > >Here is my version of the Joy's program : >======================================== >#!/bin/sh >echo 99k65536vvvvvp8op | dc >======================================== Well, unfortunately, there is un unfortunate bug in this benchmark, in that the behavior of this program in no way resembles most code typically run on general-purpose UNIX sytems, and you absolutely do NOT want to use it to help choose computers unless your workload happens to be multiple-precisions arithmetic doing lots of ulitplies and divides. Specifically, most RISC designers, after studying many programs, decided that integer multiply (and especially divide) were used less frequently than many other operations, and there is substantial data that backs this up from many vendors. RISC designers, depending on the benchmarks used, and amount of silicon available, allocated various amounts of silicon to support these operations, from zero up. The SPARC designers included a Multiply-Step, and no Divide-Step (i.e., divides are done by fair-sized hunk of code); HP PA included M-S and D-S; MIPS & 88K included both integer mult & divide in hardware, etc. However, for example, a typical integer divide on a MIPS takes about 35 cycles.... and probably about the same on a typical CISC. IF YOU WANT TO PROVE THAT A RISC IS NOT VERY MUCH BETTER THAN A CISC, OR EVEN WORSE, AT CPU PERFORMANCE: Use a program consisting of integer divides of 2 variables that the optimizers can't get rid of. a) This will show a MIPS or 88K at their least advantage. b) This will prove that a SPARC is the slowest thing in existence (well almost). While we (MIPS) thought divide was good to have,\ I must defend the SPARCers as not being irrational in leaving it out, given the constraints of the first implementation. Attached to the end of this is a brief sample of the MIPS prof/pixstats, which shows that on a MIPS-based machine (i.e., including DEC 5810), multiply/divide accounts for 1/3 of the cycles. This is NOT as bad as the even more classic "2^4096" dc benchmark, but it's still very high. (I tried email to jlf, but it bounced for some reason) jlf: if you want some solid CPU (single-user) data on realistic benchmarks, call the MIPS Paris office (33-1-42-04-0311) and ask for "SPEC Data Helps Normalize Vendor Mips-Ratings for Sensible Comparison OR Your Mileage May Vary, But If Your Car Were A Computer, It Would Vary More" Issue 2.0, May 1990. This is a giant analysis of all the published SPECmark data, and includes plenty of RISCs & CISCs..... IT doesn't tell you about multi-user stuff... PROF & PIXSTATS: stop here if you don't want gory details. Profile listing generated Mon Jun 11 19:49:24 1990 with: prof -pixie dc * -p[rocedures] using basic-block counts; * * sorted in descending order by the number of cycles executed in each * * procedure; unexecuted procedures are excluded * 18830106 cycles **INSTRUCTION CYCLES, NO CACHE MISSES, STALLS** cycles %cycles cum % cycles bytes procedure (file) /call /line 15732582 83.55 83.55 53695 37 div (dc.c) 1359698 7.22 90.77 5938 36 mult (dc.c) 342478 1.82 92.59 3142 32 getdec (dc.c) 307083 1.63 94.22 32 20 seekc (dc.c) 294895 1.57 95.79 3597 43 add (dc.c) **ALL REMAINING FUNCTIONS USED LESS THAN 1% OF THE INSTRUCTION CYCLES** * -h[eavy] using basic-block counts; * * sorted in descending order by the number of cycles executed in each * * line; unexecuted lines are excluded * procedure (file) line bytes cycles % cum % div (dc.c) 665 144 3260586 17.32 17.32 div (dc.c) 657 124 2781234 14.77 32.09 div (dc.c) 671 92 1911378 10.15 42.24 div (dc.c) 655 72 1648096 8.75 50.99 div (dc.c) 664 52 1124340 5.97 56.96 div (dc.c) 672 40 805894 4.28 61.24 div (dc.c) 658 40 739898 3.93 65.17 div (dc.c) 656 16 412024 2.19 67.36 div (dc.c) 667 12 337302 1.79 69.15 div (dc.c) 659 116 245129 1.30 70.45 mult (dc.c) 1097 100 233586 1.24 71.69 **ALL REMAINING STATEMENTS USED LESS THAN 1% OF THE INSTRUCTION CYCLES. pixstats dc: 27600801 (1.466) cycles (1.1s @ 25.0MHz) **INSTR CYCLES, INCL /* STALLS 18830106 (1.000) instructions ** INSTRUCTIONS ** 6807625 (0.362) loads 1994843 (0.106) stores 8802468 (0.467) loads+stores 8802468 (0.467) data bus use 8770695 (0.466) multiply/divide interlock cycles (12/35 cycles) **EXCEPTIONALLY HIGH: VERY FEW REAL PROGRAMS LOOK THIS WAY: 1/3 cycles IS WAITING FOR MUL OR DIV TO COMPLETE! ** 0.448 load nops per load **ALSO VERY HIGH: more typical is .25-.30 lnops/load** opcode distribution: lw 6350450 33.72% lnop 3051497 16.21% sw 1686506 8.96% bnop 1237018 6.57% bne 1162741 6.17% addu 893980 4.75% addiu 799847 4.25% beqz 719618 3.82% lb 449888 2.39% li 326381 1.73% sb 307835 1.63% sltu 303554 1.61% subu 256783 1.36% mflo 238028 1.26% div 238028 1.26% **YEP, RIGHT UP THERE** bcond 139925 0.74% bgez 129581 0.69% mfhi 114251 0.61% multu 114251 0.61% **AND THERE'S THE MULTIPLY** -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
ian@sibyl.eleceng.ua.OZ (Ian Dall) (06/13/90)
In article <39319@mips.mips.COM> mash@mips.COM (John Mashey) writes: >In article <8019@mirsa.inria.fr> jlf@mirsa.inria.fr (Jean-Louis Faraut) writes: >>Here is my version of the Joy's program : >>======================================== >>#!/bin/sh >>echo 99k65536vvvvvp8op | dc >>======================================== > >Well, unfortunately, there is un unfortunate bug in this benchmark, >in that the behavior of this program in no way resembles most code >typically run on general-purpose UNIX sytems, and you absolutely do NOT >want to use it to help choose computers unless your workload happens to >be multiple-precisions arithmetic doing lots of ulitplies and divides. True, and undoubtedly why the sparc faired so badly. What I think is more interesting is that the 5810 slows down faster than the Vax with increasing number of processes. VAX 8650 (8 mips CISC) Ultrix 1.2 Joy user 0.02 0.02 0.05 0.11 0.21 0.48 DEC 5810 (18.7 mips RISC/MIPS) Ultrix 3.0 Joy user 0.01 0.01 0.08 0.10 0.11 0.46 user/VAX 2.00 2.00 0.63 1.10 1.91 1.04 average=1.45 The 1 process case is a reasonable approximation to the ratio of the rated speeds (especially given the resolution of the measurement). The 32 process case shows the 5810 in a poor light. Does this have anything to do with the frequency of multiply instructions or is it a feature of frequent context switches? Cache flushes maybe? >Specifically, most RISC designers, after studying many programs, decided >that integer multiply (and especially divide) were used less frequently >than many other operations, and there is substantial data that backs this >up from many vendors. I can't help thinking that average speed (over an instruction mix) is, (like most statistics) an inadequate measure. The trouble is, if you have a multiply intensive application, it is a pain if it runs dramatically slower than you would expect for a machine of that class. In a sense, one would like to know the worst case "speed" as well as the "average" speed of a machine (lots of hand waving here). -- Ian Dall life (n). A sexually transmitted disease which afflicts some people more severely than others.
rcpieter@tuegate.tue.nl (Tiggr) (06/13/90)
In article <675@sibyl.eleceng.ua.OZ> ian@sibyl.OZ (Ian Dall) writes: >[...] What I think is >more interesting is that the 5810 slows down faster than the Vax with >increasing number of processes. Note that the VAX8650 is running Ultrix 1.2 (gosh!) and the DEC5810 Ultrix 3.0. I thought it was well known that every release of Ultrix is slower and larger then the previous one. Tiggr -- Don't believe me, I'm only a student.
mash@mips.COM (John Mashey) (06/15/90)
In article <675@sibyl.eleceng.ua.OZ> ian@sibyl.OZ (Ian Dall) writes: >I can't help thinking that average speed (over an instruction mix) is, >(like most statistics) an inadequate measure. The trouble is, if you have >a multiply intensive application, it is a pain if it runs dramatically >slower than you would expect for a machine of that class. In a sense, one >would like to know the worst case "speed" as well as the "average" speed >of a machine (lots of hand waving here). Really, what you want is enough data points that you think you know not only some measure of centrality but some measure of variation, but those are not enough. You always really want enough benchmarks to see the patterns of difference: this is why SPEC has alwasy insisted on making ALL of the benchmark nubmers available, because quite different patterns can be found. The worst case performance is not all that interesting: for two cached machines with different cache organization, you can usually "prove" different ratios of relative performance by careful selection of the most relevant cache-busting code. For instance, the "compress" program is often a good example of something that will drag most machines down to DRAM speed. Another good one is, on a direct-mapped, virtual cache machine, is to copy, 1 byte at a time, between two areas that collide in the cache. This causes every single byte read to: writeback the (dirty) cache line to memory read the new cache line and then each byte write: flushes the (clean) cache line reads the new cache line writes the byte into that cache line (i.e., if you want to artifically show off a SPARC 490 at its worst, you can probably prove its slower than a 68020 with such a benchmark). Of course, any given machine can be done in this way. What is useful is to have some cases that show performance in different usage patterns: the mean&std deviation, or mean and min alone just don't tell you much about hwat's happening. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
dik@cwi.nl (Dik T. Winter) (06/15/90)
In article <39397@mips.mips.COM> mash@mips.COM (John Mashey) writes: > In article <675@sibyl.eleceng.ua.OZ> ian@sibyl.OZ (Ian Dall) writes: > > >I can't help thinking that average speed (over an instruction mix) is, > >(like most statistics) an inadequate measure. The trouble is, if you have > >a multiply intensive application, it is a pain if it runs dramatically > >slower than you would expect for a machine of that class. In a sense, one > >would like to know the worst case "speed" as well as the "average" speed > >of a machine (lots of hand waving here). > > Really, what you want is enough data points that you think you know > not only some measure of centrality but some measure of variation, > but those are not enough. You always really want enough benchmarks to > see the patterns of difference: this is why SPEC has alwasy insisted > on making ALL of the benchmark nubmers available, because quite > different patterns can be found. > I do not think so. In my opinion you also want to know the worst case. But not only the performance for the worst case, but also what is the worst case. Every machine has a kind of program where it does not work as expected. Well known cases are integer multiply and cache problems (as outlined by John Mashey). I think that the interested customer needs to know the cases where this happens. The SPEC figures go already in that direction, as outlined above, but more might be done in this respect. The alternative is that there are customers that are disgusted at the performance for their special application (like Robert Silverman for his integer arithmetic programs some time ago). What I would like is that the SPEC required also some statement that mentioned fields where the machine would get below expectations with respect to the SPEC figures. I posted already about the SPARC; I have not had enough time on MIPS processors to get at their weaknesses (but I understand you might get in problems with the cache); and so on. All machines I have used upto now have their strong points and their weak points. Of course, the trade rags contain the strong points only, but as SPEC is trying to give some objective information on processors, in my opinion a mention of weak points might be given. Of course, this is a very rough idea (and might become controversial), but some thoughts in this direction are very desirable. And it might help the decision process about the machine. How many customers do know how to benchmark a machine? Let alone, how to benchmark a machine for their particular situation? Of course the manufacturers make use of this (always lay the blame at the manufacturer :-)). As a rough estimate, the numbers for the first and second case above are (I may be 100% off) respectively: zero and zero. But I hope those numbers will grow. (As a parallel example; how many people do know how to calculate the speedup on a multi-processor system?) Currently you read the promotional material, the SPEC figures (if available), and try your own favourite programs, and decide; taking in account the computing environment, which might be completely irrelevant. For larger machines you may simulate the workload you expect (which is probably wrong, or, alternatively outdated by the time you have the new system installed). (And in this context, you is a committee; each member with his own preferences and so on.) Alas, in these days the only way to assess the viability of a machine is to install it, let it run a few years, and than say that it was the proper decision (or not of course). (Unless it appears after two months that the machine is not used; in that case you can make an early decision; but of course that is too late.) Do I sound pessimistic? Yes. Am I pessimistic? No. How come? In my opinion the SPEC figures are a giant step in the good direction, and I hope that some time they may become really useful. Still, the biggest problem is of course that decisions are made by management, and not by the users (this is not directed at dcab :-)). Oh well, these were some rambling thoughts... -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
casey@gauss.llnl.gov (Casey Leedom) (06/16/90)
| From: mash@mips.COM (John Mashey) | | The worst case performance is not all that interesting: for two cached | machines with different cache organization, you can usually "prove" | different ratios of relative performance by careful selection of the | most relevant cache-busting code. | | [A good example] on a direct-mapped, virtual cache machine, is | to copy, 1 byte at a time, between two areas that collide in the cache. | | (i.e., if you want to artificially show off a SPARC 490 at its worst, | you can probably prove its slower than a 68020 with such a benchmark). | Of course, any given machine can be done in this way. While I agree with you that one can always come up with cache-busting code, I think that you picked a particularly bad example as I think that the cache design in question is just brain dead. If you design a direct mapped cache you should have at least two buckets for each cache line. Linearly doing things to two different arrays is so common that you're bound to run into the problem you mention. (As proof, I was called in on a problem with a Sun 3/280 that was bought for image processing. Part of their processing involved, essentially, copying a 1/4Mb array 30 times a second. The group had justified buying the 280 on the grounds that a 180 just wouldn't be fast enough. Imagine their horror when they ran their code on their brand new 280 and found that it ran 3 times slower than on a 180! The problem turned out to be that the two arrays were an exact multiple of 64Kb away from each other -- the size of the 280's cache. Eventually I was able to bring the 280 up to the speed of a 180 by offsetting the arrays by 24 bytes from 64Kb. (There were actually a bunch of fast nodal points, but you get the idea.)) A cache shouldn't break on common operations. Casey
mash@mips.COM (John Mashey) (06/18/90)
In article <137441@sun.Eng.Sun.COM> lm@sun.UUCP (Larry McVoy) writes: >>| Of course, any given machine can be done in this way. >> >> While I agree with you that one can always come up with cache-busting code, >>I think that you picked a particularly bad example as I think that the cache >>design in question is just brain dead. If you design a direct mapped cache >>you should have at least two buckets for each cache line. >Excuse me, but isn't what you are describing a two way set associative cache? >Furthermore, I believe that John's point was that you can take *any* caching >scheme and break it. Suppose that we had a cache like you described. It Yes, this was not picking on Sun (all of MIPS' caches are direct-mapped, too, except the RC6280's second-level cache, although we do tend to use physical-tagged caches most of the time for other reasons, not relevant to the single-threaded discussion of arrays & such). >Finally, you may or may not be correct in your assessment that direct mapped >caching schemes are "brain dead" but you ought to consider what is involved >in making a cache N-way set associative. It takes up a lot of real estate >and if N is even slightly large, it can add extra gate delays in the >critical path of looking for a hit (last I looked fan in on AND gates wasn't >that large, but maybe that's changed). Yes. Look at Steve Pryzblksi's thesis and/or the Morgan Kaufmann book; also, there have been several papers analyzing direct-mapped caches versus set-associative ones in various design environments, and finind that they are very competitive in those environments where set-associative is not already strongly imnplied for other reasons. Set-associativity is always nice when you can get it, and sometimes it's the right thing, sometimes not. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086