mash@mips.COM (John Mashey) (12/26/90)
Having finally caught up with the net after a long trip, I'm sad to see that 1 out of 3 postings in this newsgroup concern the "bc" benchmark or some variety of thereof. I had higher hopes for this, especially as at least some people have read previous discussions in comp.arch. This %#@!$% thing is like a vampire: every time you think you've finally put a stake thru its heart, it returns one more time. 1. Small benchmarks are very prone to misinterpretation, prone to compiler gimmickry, and seldome excercise modern machines very well. About their only even-slightly-rational use is to compare machines with the same chips running at different clock rates. Small, synthetic benchmarks can easily over- or under- emphasize language and/or machine features out of all proportion to mixtures found in more realistic benchmarks. As a matter of faith, I consider small benchmarks guilty until proven innocent, i.e., if you can prove their results correlate well, across product lines, with much more substantial real programs, then maybe you have something (and in fact, this is a good thing to have; for instance, I've often thought of offering a small prize for anyone who can create a small program that predicts performance on the 10 SPEC benchmarks across machine lines, but I haven't figured out how to describe this well enough to figure out if someone has achieved it.) 2. Filling the net with timings for a benchmark where no one even explains what code is being executed, how big it is, whether or not it correlates with ANYTHING, etc, etc, is like trying to predict the speed of automobiles by ripping out their steering wheels, and seeing how fast they roll. 3. NOW, here are SOME FACTS about this benchmark: 1) It is tiny: 99.57% of the instruction cycles (on a MIPS machine) are accounted for by 10 LINES OF CODE 71% of the cycles are consumed in 3 LINES OF CODE In addition, unlike matrix kernels, whose code is small, but whose data references are big, this doesn't even have that property: all the code & data fit in tiny caches. 2) Its instruction usage bears little resemblance to much of anything: see Hennessy & patterson for typical characteristics of code. In particular, this code almost never makes function calls, and ((on a MIPS machine, which HAS integer multiply and divide) spends 50% of the total cycles doing integer multiply and divide. I assure you, this is typical of very few programs; this is NOT the kind of statistics that any computer architect I know designs machines around, etc, etc. (Of course, I should love this benchmark, as it REALLY hurts machines with no integer multiply.) At the end of this posting are the slices of prof & pixstats output. 4. PLEASE STOP WASTING TIME WITH THIS BENCHMARK (Please, let this be the last stake in its heart :-) 5. ABOUT THE ONLY USEFUL THING I CAN THINK OF TO DO WITH THIS is for somebody to run this benchmark on many of the machines for which SPEC integer benchmarks exist, plot the two together, and compute a correlation for them; or even, pick any one of the SPEC integer benchmarks and do it for that. (Or pick some other realistic integer benchmark for which well-controlled results exist.) ---------- Profile listing generated Tue Dec 25 13:42:35 1990 with: prof -pixie dc * -p[rocedures] using basic-block counts; * * sorted in descending order by the number of cycles executed in each * * procedure; unexecuted procedures are excluded * 84303520 cycles cycles %cycles cum % cycles bytes procedure (file) /call /line 84058950 99.71 99.71 1827369 36 mult (dc.c) 132423 0.16 99.87 4905 37 div (dc.c) 31153 0.04 99.90 538 21 nalloc (dc.c) ..... OH GOOD: it spends 99.7% of its time in one function... IN FACT, going to the next level of detail, where we see the number of cycles spent in the statements that consumed the time, we discover that 83.7% of the instruction cycles are spent IN JUST 4 LINES OF C....: * -h[eavy] using basic-block counts; * * sorted in descending order by the number of cycles executed in each * * line; unexecuted lines are excluded * procedure (file) line bytes cycles % cum % mult (dc.c) 1097 100 22754044 26.99 26.99 mult (dc.c) 1094 96 20317562 24.10 51.09 mult (dc.c) 1093 68 16755620 19.88 70.97 mult (dc.c) 1095 36 10771470 12.78 83.74 mult (dc.c) 1098 40 8383670 9.94 93.69 mult (dc.c) 1096 16 4787320 5.68 99.37 mult (dc.c) 1084 80 83600 0.10 99.47 mult (dc.c) 1102 96 45066 0.05 99.52 mult (dc.c) 1087 68 41076 0.05 99.57 nalloc (dc.c) 1974 36 29529 0.04 99.60 div (dc.c) 665 144 24070 0.03 99.63 mult (dc.c) 1101 96 23606 0.03 99.66 div (dc.c) 657 124 22139 0.03 99.69 mult (dc.c) 1104 40 20630 0.02 99.71 ...... ------------ Following is an analysis of instruction usage, on MIPS R3000-based machine: pixstats dc: 174126742 (2.065) cycles (6.97s @ 25.0MHz) 84303520 (1.000) instructions [# instructions]] 1283 (0.000) calls [basicaally: never does function calls]] 28881440 (0.343) loads [a little high] 8458964 (0.100) stores 89823222 (1.065) multiply/divide interlock cycles (12/35 cycles) (amazingly high: 50% of the time in this code is doing integer multiply divide. Real programs do exist like this, but this is completely unrepresentative of the vast bulk of integer code....] 1.36e+05 cycles per call ... like I said: hardly ever does function calls 6.57e+04 instructions per call Instruction concentration: 1 1.4% 2 2.8% 4 5.7% 8 11.4% 16 22.7% 32 45.4% 64 90.8% 128 99.6% 256 99.8% 512 99.9% 1024 100.0% 2048 100.0% 3697 100.0% THIS SAYS: in a peerfect full-associative cache, 90.8% ofthe instruction cycles would be spent in only 64 words (64 instructions), and 99.9% would fit into 1024 words.... i.e., it fits into almost any machine's cache... opcode distribution: [dynamic]] div 2395317 2.84% multu 1197623 1.42% A PROGRAM WITH TWICE AS MANY INTEGER DIVIDES AS MULTIPLIES.... -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
borasky@ogicse.ogi.edu (M. Edward Borasky) (12/27/90)
Thank you for at least driving another stake in "bc benchmark"'s heart. However, as you and I know, there is a tremendous need out there for [sigh] [gasp] A SINGLE NUMBER to characterize JUST EXACLTY HOW FAST ANY GIVEN COMPUTER IS. I have my own personal favorite which I will not belabor because everyone has his own personal favorite. My question is this: just as you and I believe that vampires don't exist, do you believe that a single-number that measures a computer's speed doesn't exist? I won't state MY belief to avoid bias in the discussion. My use of the word "bias" in the preceding sentence is a HINT on my belief!
pbickers@tamaluit.phys.uidaho.edu (Paul Bickerstaff) (12/27/90)
In article <15379@ogicse.ogi.edu>, borasky@ogicse.ogi.edu (M. Edward Borasky) writes: > My question > is this: just as you and I believe that vampires don't exist, do you > believe that a single-number that measures a computer's speed doesn't > exist? I won't state MY belief to avoid bias in the discussion. There is NO such number!! It does not take an expert to dig up two programs, #1 runs faster on machine A than on machine B but #2 runs faster on machine B. Both programs could e.g. be in fortran. Further, a suite of "representative" programs could run faster on A when the load factor is 1.0 but faster on B when the load factor is say 8. The list of possibilities goes on. Paul Bickerstaff Internet: pbickers@tamaluit.phys.uidaho.edu Physics Dept., Univ. of Idaho Phone: (208) 885 6809 Moscow ID 83843, USA FAX: (208) 885 6173
choll@telesoft.com (Chris Holl @adonna) (12/28/90)
In article <15379@ogicse.ogi.edu>, borasky@ogicse.ogi.edu (M. Edward Borasky) writes: > ...do you believe that a single-number that measures a computer's > speed doesn't exist? When I worked at Boeing Computer Services we typically looked for one number to compare two machines. The comparison was very narrow in scope however. We compared one vendor's computer to their next box. As long as the architecture stays the same, such comparisons are valid. This was needed because as a computer service, there had to be a way to consistently charge customers independent of which box their job actually ran on. The CPU times needed to be normalized so it wouldn't matter if a job ran on a Cyber 175 or 760. Cray 1S or X-MP. In fact, we had to guarantee this for our government customers who insisted that their bill for the same job should always be within some percentage (5%, I think). The CPU ratios was determined by running 10 to 14 CPU kernels such as linear code, loops, subroutine calls, memory fetches (in and out of stride), matrix reductions, etc. Again, as long as the architecture was the same (or close enough) the ratios stayed pretty constant (and typically close to the ratio of the clocks, which was usually the biggest difference). Where the architecture changed we had trouble justifying one number. For example, we compared a Cray 1 to a Cray X-MP. The clocks were 12.5 to 9.5 nanoseconds. All the ratios looked fine (1.32 give or take a bit) except scatter/gather which was 10 to 14 times faster on the X! (Hardware scatter/gather - architecture change.) A job that performed a lot of scatter/gather would burn different amounts of CPU seconds on the different Crays. The other "one number" we used was for capacity planning. After maturing through many yardsticks of throughput, one of my fellows (Dr. Howard "Doc" Schemising - wonderful guy) developed a capacity test that would precisely model the current workload on a machine. This was used with great accuracy to measure the capacity of different machine (for that workload). Anyway, summing up my ramblings, one number is okay for a given architecture or a given application. Unfortunately that is not what most people are looking for. They want you to tell them how fast their jobs are going to be on machine X if they are this fast on Y. My answer has always been "Depends what you're doing." which rarely satisfies 'em. :-). Chris Holl TeleSoft (formally of BCS) 5959 Cornerstone Ct. W. San Diego, CA 92121
borasky@ogicse.ogi.edu (M. Edward Borasky) (12/28/90)
In article <1142@telesoft.com> choll@telesoft.com (Chris Holl @adonna) writes: >When I worked at Boeing Computer Services we typically looked for one >number to compare two machines. [...] > This was needed because as a computer service, there had to be >a way to consistently charge customers independent of which box their >job actually ran on. [...] >In fact, we had to guarantee this for our government customers who >insisted that their bill for the same job should always be within some >percentage (5%, I think). I was hoping for a response like this. There are two types of computer users -- those like you and me who realize that computing costs money, is a resource that must and can be managed, and those like students, computer science faculty, dreamer/architects who think that computing should be, can be and often is essentially free. Granted, you can pick YOUR favorite speed number (let's say SPECmarks) and come up with a very low cost box that sits on your desk and delivers it, complete with stunning 3D graphics and UNIX and some kind of windowing. But although the user of this box may think of $10K as very little money, the company or university who bought 100 of them (now we're talking a million) PLUS the Ethernet PLUS the guy who comes and bails you out when you delete your whole directory accidentally plus the guy who backs your files up once a week so you CAN get bailed out, etc. -- the company/university has a large investment here. > >For example, we compared a Cray 1 to a Cray X-MP. The clocks were >12.5 to 9.5 nanoseconds. All the ratios looked fine (1.32 give or >take a bit) except scatter/gather which was 10 to 14 times faster on >the X! (Hardware scatter/gather - architecture change.) A job that >performed a lot of scatter/gather would burn different amounts of CPU >seconds on the different Crays. I'll bet that the COST difference between the two machines was such that you could afford to give away the extra speed on th X-MP from the hardware scatter/gather -- bill the X-MP as if it were strictly 1.32 times the Cray 1. > >The other "one number" we used was for capacity planning. After You just said the secret word -- "capacity planning"! I wish the duck were still around to drop down and give you fifty dollars! >(Dr. Howard "Doc" Schemising - wonderful guy) developed a capacity >test that would precisely model the current workload on a machine. >This was used with great accuracy to measure the capacity of different >machine (for that workload). Is this published? Could you post it? The guys here and in "comp.arch" would LOVE to see it! > [...] >Unfortunately that is not what >most people are looking for. They want you to tell them how fast >their jobs are going to be on machine X if they are this fast on Y. Yes, for that you DO need more than ONE number. But for supercomputers and supercomputer applications, you can do a damn fine job with THREE numbers! Two numbers to describe the computer and one for the appli- cation.
mash@mips.COM (John Mashey) (12/29/90)
In article <15424@ogicse.ogi.edu> borasky@ogicse.ogi.edu (M. Edward Borasky) writes: >>The other "one number" we used was for capacity planning. After >You just said the secret word -- "capacity planning"! I wish the duck >were still around to drop down and give you fifty dollars! >>(Dr. Howard "Doc" Schemising - wonderful guy) developed a capacity >>test that would precisely model the current workload on a machine. >>This was used with great accuracy to measure the capacity of different >>machine (for that workload). >Is this published? Could you post it? The guys here and in "comp.arch" >would LOVE to see it! Yes, it would be good to see this. Note the important fact that there are two steps: a) Characterizing the workload b) Predicting the performance on that workload Part a) is why SPEC advises people to try to correlate their own workloads with some subset of SPEC benchmarks, and then ignore the other SPEC benchmarks, and in fact, I've started to see users doing this already. Also, I've seen some pretty good benchmarks, with workloads tailored to different departments within a company ... unfortunately, the best ones I've seen were all proprietary... -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
alan@shodha.enet.dec.com ( Alan's Home for Wayward Notes File.) (12/31/90)
In article <15379@ogicse.ogi.edu>, borasky@ogicse.ogi.edu (M. Edward Borasky) writes: > My question is this: just as you and I believe that vampires don't > exist, do you believe that a single-number that measures a computer's > speed doesn't exist? I do not believe that a SINGLE (atomic) number exists which usefully measure's a computer's speed. The problem with a single number is that it may measure only one or a few aspects of the computer's speed. For example: how long it takes execute an instruction stream which contains 60% integer divides and 39% integer multiplies and which fits into the system data and instruction caches. For some small population that aspect may be interesting. For many others that is not a reflection of reality. Some people will be interested in how fast the computer does integer adds and subtracts. Others, floating point arithmetic. Still others byte copies and compares. And still others will have applications that are dominated by I/O. What is required are MANY (atomic) numbers, each of which measures one or more of the interesting aspects of a computer's speed. To go along with these numbers are reasonably detailed descriptions of what that number measures. That way you can examine your application to characterize it's use of the system and find the number which is the best match. Only then is it safe to compare single numbers. Now, one thing that can be done is to take these MANY (atomic) numbers and run them through some mathmatical function to get a single number that represents the combination of all of the numbers. The function has to be carefully constructed so that one exceptional number (good or bad) doesn't dominate the final answer. If all bc(1)'s are created equal and you know that it's use of the system reflects how you use the system, then it may well be a good single benchmark. I believe that using bc(1) is only safe as benchmark if your primary applciation is in fact bc(1). If a vendor decides to make bc(1) go very fast at the expense of improving the performance of their compilers and libraries then only the people that use bc(1) will win. My application: Read 24 bytes and extract four non-well aligned fields; a read/write flag, LBN number, transfer size and unit number. If this record is interesting, perform a variety of operations some of which are: Look up the transfer size in a list based on the unit number to count how many transfers of that size there are (pointer chasing ending with an integer increment). Increment a counter (based on the unit number) for each read or write. Determine the absolute distance from the previous LBN to the current LBN (for this unit number). Determine what logical partition this LBN was in to increment another set of counters (still more pointer chasing for the transfer size). So far the application has stressed integer add, compares and moving bytes around. Look up the LBN in a database to determine whether or not the LBN is file system metadata and what kind. It just became seriously I/O bound (a large buffer cache helps A LOT). Once all this has been done for every 24 byte record in the input stream, print a summary of the results. Most of the math is done in floating point (adds with a small number of multiplies and divides). If you like them, vampires are a nice fantasy. As are single number benchmarks. -- Alan Rollow alan@nabeth.enet.dec.com
eachus@linus.mitre.org (Robert I. Eachus) (01/01/91)
There is a way to have and use a single meaningful (balanced) benchmark number, but only to do intial selection: Say you choose SPECmarks. (I use Dhrystones, but that is a minor detail.) What you end up with are two things, one a standard single number rating, and the other a set of comments or annotations telling the particular strengths and weaknesses of various machines. First of all, realize that there are really only three shades of gray where buying hardware is concerned: More than fast enough, very tight, and no way Jose. If you pick a machine in the grey area, there is a major tradeoff between cost to optimize code for the machine selected, and the cost of more than adequate hardware. Unfortunately, supercomputer useres and real-time people sometimes find that there IS no other choice, but I digress. If your single number correlates well with price, then choosing the right machine for the job entails first benchmarking your application on whatever machine you have around. (I often see "estimates" of code size and run-time that are off by two orders of magnitude. If you can't get within a factor of two or three, why bother spending time benchmarking the hardware?) At this point we now have an estimate that say application Y will require a 50 MIPS VAX. You can now characterize the application in terms such as integer vs. floating, vectorizeable vs. scalar, I/O intensive vs. compute bound, or single task instensive vs. multitasking, and check out the machines in the range you need with the strengths you want. This will often come down to a single choice or at least a single processor family that you need to run your application specific benchmark on. This method of choosing hardware DOES require running application specific benchmarks twice. But, at least in the applications that I care about, you are fooling yourself if you don't write and run a good application specific benchmark to start with. Using the SPECmark suite and tailoring it to your particular application might give a better fit, but the amount of work required to determiine the coefficients is usually more than that involved in this approach. -- Robert I. Eachus with STANDARD_DISCLAIMER; use STANDARD_DISCLAIMER; function MESSAGE (TEXT: in CLEVER_IDEAS) return BETTER_IDEAS is...
choll@telesoft.com (Chris Holl @adonna) (01/01/91)
>From: borasky@ogicse.ogi.edu (M. Edward Borasky) > I was hoping for a response like this. There are two types of computer > users -- those like you and me who realize that computing costs money, > is a resource that must and can be managed, and those like students, > computer science faculty, dreamer/architects who think that computing > should be, can be and often is essentially free. Boeing Computer Services occasionally received criticism of their "high" bills for computing services. In response they produced a paper called "The Real Cost of Computing" to educate their customers. It described many of the things you mentioned including support of the hardware, software, configuration, backups, etc. I don't know if the paper is available, but I could find out if there is interest. > I'll bet that the COST difference between the two machines was such > that you could afford to give away the extra speed on the X-MP from > the hardware scatter/gather -- bill the X-MP as if it were strictly > 1.32 times the Cray 1. That's exactly what we wound up doing. We couldn't justify a larger figure because if a job ran on the X that didn't use scatter/gather it would get a higher bill than it would have on the S, and that wouldn't do. Using 1.32 a job that used scatter/gather simply got a better deal on the X. We made a point of telling users this (so they got the proper perspective :-) and to encourage them to take advantage of the hardware. If their jobs became more efficient, throughput would go up. The algorithm for billing was slightly different however. The 1-S had a single processor and 2 Meg while the X-MP had 2 processors and 4 Meg. We billed for CPU seconds and memory residency. When a job started using more than 2 Meg it started to pay a percentage of the other processor, even if it wasn't using it. If a job used the entire 4 Meg it payed for both processors, because no one else could use the other processor without occuping memory. BCS took the 1-S out of the configuration after users had migrated to the X, so after the overlap period everyone got a better deal. Now they have a Y-MP. I wasn't there for that transition, so I don't know exactly how they managed it. > You just said the secret word -- "capacity planning"! I wish the duck > were still around to drop down and give you fifty dollars! Duck? $50? > >(Dr. Howard "Doc" Schemising - wonderful guy) developed a capacity > >test that would precisely model the current workload on a machine. > >This was used with great accuracy to measure the capacity of different > >machine (for that workload). > > Is this published? Could you post it? The guys here and in "comp.arch" > would LOVE to see it! In article <44371@mips.mips.COM>, mash@mips.COM (John Mashey) writes: > Yes, it would be good to see this. Note the important fact that there are > two steps: > a) Characterizing the workload > b) Predicting the performance on that workload > Also, I've seen some pretty good benchmarks, with workloads tailored to > different departments within a company ... unfortunately, the best ones > I've seen were all proprietary... Yes, step a) is critical. And yes, Doc Schmeising's benchmark is proprietary. (Sorry I typo-ed his name wrong the first time.) I have had a few requests for more information on this capacity benchmark, and I don't think the following violates any of BCS' rights. Doc Schmeising's benchmark was called QBM (Quick BenchMark) and is owned by Boeing Computer Services (BCS). There was talk at one time of marketing it, but they never did. A shame, because it is a great tool. Doc has retired and I'm not there any more and I don't even know if they are still using it. The basic premise is to take a "slice" of your system's workload that runs in some fixed period of time (we used 10 minutes - Quick) and dump it into another system to see how long it takes. If the work can't complete in 10 minutes, the target system has less throughput than the base system. If it can complete in 10 minutes the target system has equal or greater throughput. The tricky part is to quantify throughput, and this was QBM's real strength. BCS collected data on a variety of resources used by jobs. This included thing like . CPU seconds burned . Amount of memory used . Duration of memory residency . Disk blocks transfered . Disk accesses and a few others. This data was stored on tapes and went back years. It was collected for CDC Cybers and Crays (the two main workhorses at BCS). You, as the performance guru and benchmarker, would have to select a 10 minute window where your machine was "full." Full does not mean the system was on its knees and response time was horrible. It means processing a reasonable workload with acceptable response time, good CPU usage, no thrashing, etc. I would pick a period of a busy day; say, a busy hour, or 2 hours, or 30 minutes, or whatever, and feed it to QBM. QBM would filter out noise, and select 10 minute samples from the time period. It would select as many as you asked for (since the 10 minute slices could start on any fraction of a second) and print out a variety of statistics about the sample. A GREAT DEAL OF CARE WAS TAKEN TO PICK GOOD DATA. A lot of accounting and performance data was reviewed, and eventually one was picked and that was called your baseline. That represented the "full" capacity of your base machine. This selection process was done once or twice a year. Only when you felt the work profile (type of work being done) had changed enough to affect your results. This would happen. For example, when users migrated from a Cyber to a Cray eventually they would start writing code to take better advantage of vectorization. QBM would then take the data from that sample and produce a synthetic workload consisting of the same number of jobs, starting and stoping at the same times during the 10 minute window (very important) and using the same resources. Some assumptions were made about how the CPU seconds were used (matrix reductions? straight line code? etc.) and how they were distributed during the job. You can't use all the CPU seconds and then do all the I/O. On the other hand, an even distribution is not representitave either. A variety of CPU kernals were used (the 10 to 14 I mentioned in my first posting) and given different weightings dependent on what we thought our machines were being used for. Now here's the real beauty of QBM: Not only could it create a workload that was the same as your real-life sample, it could create a workload that was 1.5 times that sample. Or 5 times, or 0.5 times. You could now create any multiple of that workload. A lot of testing went into QBM to be sure that the jobs it created would actually run in exactly 10 minutes on the base machine. It is much to Doc's credit that they did. A QBM workload of 1.0 ran in 10 minutes. If 1.1 ran in 10 minutes, then the sample you picked wasn't really when your machine was full. After some experience with it, we had samples were 0.9 would complete in 10 minutes, 1.0 would also, but 1.1 would not. We would then take these samples to a new machine and scale the load up or down until the jobs JUST ran in 10 minutes. In this way we could report that "For our current workload AND configuration, the X-MP will provide 3.85 times the throughput of the 1-S." (This was our actual result.) Of course the configurations of both base and target system had to be taken into account. This is significant also, because we could benchmark different configurations on the same machine. Suppose we had more channels? Fewer channels and faster disks? More disks? All these questions could be answered by setting up test configuations and measuring the throughput. No papers were written about QBM (unfortunately) althought I did present benchmarking results at a couple CUGs (Cray Users Group) and those papers are available. Hope this has helped, Chris. Christopher Holl TeleSoft 5959 Cornerstone Ct. W. San Diego, CA 92121 (619) 457-2700