johnw@astroatc.UUCP (John F. Wardale) (05/13/87)
In article <28200036@ccvaxa> preece@ccvaxa.UUCP writes: > > grenley@nsc.nsc.com: >> How about, instead, compiles? They are usually CPU intense (unless you >---------- >I don't think compilers are sufficiently comparable to make good >benchmarks, unless you wanted to specify the compiler, too (say, Ok, so here I am, a new developer.... I need to buy a unix box. (I'm developing code for some flavor of unix.) So I select about a dozen "likely" candidates; put my sources on each; then run the following on each: time "touch types.h;make unix" or some other, similar or reasonable thing. I compare the speeds, and costs, and buy the one that's most effective for me. While this is the best approach for selecting box to compile kernels on, it has the following problems: 1) Its en expensive (time consuming) exercise. 2) Ones actual uses for a system are *LIKELY* to change in the future. ----------------------------- Grenley is right! A lot of people what/need a "system performance" benchmark! I wish that benchmarks like dhrystone *INCLUDED* the time-to-compile-link-etc. in the time for dhry's per second! This would make (generally slow) super-optimizing compilers look less good, while improving the lightening fast (direct to memory -- ala turbo-pascal) compilers that may generate slightly poorer than average code. What difference does it make to me if super-O's `C' runs 15k dhry/sec if it takes 2 or 3 times longer to compile than speedy's `C' which only gets 8k dhry/sec. Given numbers like this, I would REALLY want both, but then I like to use interpreters (fast, threaded beasts, *NOT* like BASIC -- yuk!) to develop code, and only "compile" it once. ----------------------------- Picking a machine is ***ALOT*** more than finding one with the highest number of XYZ benchmark that you can afford! Anyone care to write an AI-ish program that collects prices, sw, reliability, etc. etc. etc. (Ok, gobs of benchmarks too) and several formula for calculating single-figure-of-merit, and help in matching a formula to your expected needs. I would expect results like: The following 10-30 machines would be good for you, I rate them as follows on a 1-5 scale (maybe a 1-10 scale) (I assume that compressing to a single figure, would be question- able for even one significant digit, but I also realize I'm lazy and need help picking the machine for me (or my company, etc.) John W - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Name: John F. Wardale UUCP: ... {seismo | harvard | ihnp4} !uwvax!astroatc!johnw arpa: astroatc!johnw@rsch.wisc.edu snail: 5800 Cottage Gr. Rd. ;;; Madison WI 53716 audio: 608-221-9001 eXt 110 To err is human, to really foul up world news requires the net!
grenley@nsc.nsc.com (George Grenley) (05/13/87)
It's nice to see some discussion on this issue. I think, based on the response so far, that we need to fork this discussion into two categories. One would be "CPU Benchmarks", which us chip peddlers would like, and a "System Benchmark", which is what real users want. In article <272@astroatc.UUCP> johnw@astroatc.UUCP (John F. Wardale) writes: >Ok, so here I am, a new developer.... I need to buy a unix box. >(I'm developing code for some flavor of unix.) So I select about >a dozen "likely" candidates; put my sources on each; then run the >following on each: > time "touch types.h;make unix" >or some other, similar or reasonable thing. I compare the speeds, >and costs, and buy the one that's most effective for me. > >While this is the best approach for selecting box to compile >kernels on, it has the following problems: > >1) Its an expensive (time consuming) exercise. Why? Assuming media portability, one ought to be able to run it in a reasonable amount of time - either don't compile ALL of Unix, or just start it running and go on about your other duties. >Grenley is right! A lot of people what/need a "system >performance" benchmark! I wish that benchmarks like dhrystone >*INCLUDED* the time-to-compile-link-etc. in the time for dhry's >per second! This would make (generally slow) super-optimizing >compilers look less good, while improving the lightening fast >(direct to memory -- ala turbo-pascal) compilers that may generate >slightly poorer than average code. The answer, of course depends on whether you're a programmer or a user. Most machines compile a program once and run it many many times, so efficiency of compiled code is more important than compile time. Unless, of course, you are the programmer... In general, though, if machine A runs a dumb compiler twice as fast as machine B, it will run a smart compiler pretty close to twice as fast, too. So, for benchmarking, we can use any compiler. My employer is justifiably proud of its new optimizing compiler which gets about 20% faster code than the old one - so the system performance goes up 20% with the same H/W - 20% performance bump for free! But, you have to ask yourself, is it `fair' to include compiler improvements in CPU benchmarks? Some say not; I disagree. We are interested in the H/W that produces the most overall performance. After all, the whole idea behind RISC is that they're easy to write optimizing compilers for. It wouldn't be fair to not allow them to use them.
eugene@pioneer.UUCP (05/20/87)
Gee whiz! I go to a conference and take a little vacation than there 150 architecture articles with lots on benchmarking all bad. I've 35 to go and have yet to see a completely scientific one. No wonder why physicists don't think of computing "science" as a science {Sorry, John nothing personal against you}. You guys should go out and get books on experiment design: Campbell and Stanley, Cochrane and Cox, or find one of Phil Heidelberger's (IBM, IEEE) articles. No sense in trying to knock down what's wrong with your various horse races {Note I just make a video tape about this for a supercomputer class at U of Idaho yesterday}. More shortly, we will publish a bibliography for PER. From the Rock of Ages Home for Retired Hackers: --eugene miya NASA Ames Research Center eugene@ames-aurora.ARPA "You trust the `reply' command with all those different mailers out there?" "Send mail, avoid follow-ups. If enough, I'll summarize." {hplabs,hao,ihnp4,decwrl,allegra,tektronix,menlo70}!ames!aurora!eugene
reiter@endor.harvard.edu (Ehud Reiter) (05/20/87)
In article <1589@ames.UUCP> eugene@pioneer.arpa (Eugene Miya N.) writes: >150 architecture articles with lots on benchmarking all bad ... >You guys should go out and get books on >experiment design: Campbell and Stanley, Cochrane and Cox, or find >one of Phil Heidelberger's (IBM, IEEE) articles. I've seen Eugene Miya say good scientific experimental design should be used for benchmarking many times, but I'm still a bit puzzled as to how this should be done. Ideally, if we had good data on exactly what programs were typically run by each class of user, then we could measure the performance of a machine for a carefully chosen set of "benchmark" programs, and then use the above data to extrapolate the machine's performance for each user class. I assume this is what Eugene means by a good experiment. However, I've never seen good data on what programs typical users run, and without this data we can not perform the above "experiment". Perhaps this just means we should try to gather good data on what programs users run - I think this is a great idea, as long as someone else does it! Ehud Reiter reiter@harvard (ARPA,BITNET,UUCP) reiter@harvard.harvard.EDU (new ARPA)
eugene@pioneer.arpa (Eugene Miya N.) (05/21/87)
In article <6024@steinmetz.steinmetz.UUCP> William E. Davidsen Jr writes: > >After doing benchmarks for about 15 years now, I will assure everyone >that the hard part is not getting reproducable results, but in (a) >deciding how these relate to the problem you want to solve, and (b) getting >people to believe that there is no "one number" which can be used to >characterize performance. If pressed I use the reciprocal of the total >real time to run the suite. It's as good as any other voodoo number... Yes I agree, and I have not had to do it that long. Let's take a moment to study ways to relate or characterize end-users applications: 1) without gross generalizations, but real quantitative data, and 2) using common ideas and tools? Okay? Static as well as dynamic tools. What can we tell independent of machines and languages? Second: There's lots of disciplines which abuse and use single figures of merit and get away from them. Consider: earlier in the season, (end of ski season really): base of NS was a sea of mud, 2/3 way up the mountain in a sheltered area was the snow gauge reading 5.5 feet. You think we have problems with measurement? Is an average ($ {int int from {all area of ski resort} depth function dx dy} over area} a reasonable way to characterize resort coverage? Do we buy cars on single figures of merit? If not, then now many? Consider cardiology: heart function. Single figures are used: heart rates, but EKGs are much better they portrary more. Picture worth a thousand words? Try embedding one on the net with any good resolution. Yes, we can get away, but we have to take others with us. I better stop before Alan Smith totally loses respect (probably has already). From the Rock of Ages Home for Retired Hackers: --eugene miya NASA Ames Research Center eugene@ames-aurora.ARPA "You trust the `reply' command with all those different mailers out there?" "Send mail, avoid follow-ups. If enough, I'll summarize." {hplabs,hao,ihnp4,decwrl,allegra,tektronix,menlo70}!ames!aurora!eugene
nerd@percival.UUCP (Michael Galassi) (05/25/87)
In article <415@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes: >As larry says, real page-thrashers are highly dependent on a lot of attributes. >That doesn't mean they're bad tests, merely that they're extremely hard >to do in a controlled way. In particular, you often see radically different >results according to buffer cache sizes, for example. > >-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> I've not seen this stated around here so I'll do it. Benchmarks can be divided into two major categories: Those which exercise the processor (CPU FPU MMU etc...) and those which exercise the WHOLE computer (i.e. i/o system too). For the person who is evaluating a CPU family for a new design I can see where the first class of benchmarks comes in VERY handy, but the rest of us (those who want to buy a computer, install UNIX, and generate accounts) the MIPS, FLOPS, *stones, etc that the cpu will do are rarely of much interest. I care much more about how the system will handle with a dozen users all doing real tasks (vi, cc, f77, rn, rogue, or whatever) than I do about the the time it takes the cpu to find the first X primes when it is not installed in its cardcage where god wanted it to be. I guess I don't care much about the "a lot of attributes" individualy, but rather how they all work together. Give me anything that overall preforms well (so long as there is no intel cpu in it) and I'll be pleased as pie. -michael -- If my employer knew my opinions he would probably look for another engineer. Michael Galassi, Frye Electronics, Tigard, OR ..!{decvax,ucbvax,ihnp4,seismo}!tektronix!reed!percival!nerd
mash@mips.UUCP (05/26/87)
In article <642@percival.UUCP> nerd@percival.UUCP (Michael Galassi) writes: >In article <415@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes: >>That doesn't mean they're bad tests, merely that they're extremely hard >>to do in a controlled way. In particular, you often see radically different >>results according to buffer cache sizes, for example. >Benchmarks can be divided into two major categories: >Those which exercise the processor (CPU FPU MMU etc...) and those which >exercise the WHOLE computer (i.e. i/o system too). For the person who >is evaluating a CPU family for a new design I can see where the first >class of benchmarks comes in VERY handy, but the rest of us (those who >want to buy a computer, install UNIX, and generate accounts) the MIPS, >FLOPS, *stones, etc that the cpu will do are rarely of much interest. >I care much more about how the system will handle with a dozen users >all doing real tasks (vi, cc, f77, rn, rogue, or whatever) than I do... >I guess I don't care much about the "a lot of attributes" individualy, >but rather how they all work together. Give me anything that overall >preforms well (so long as there is no intel cpu in it) and I'll be >pleased as pie. 1) There ARE people who mostly care about computational benchmarks; some of the CAD folks are perfect examples, as are those who run troff, etc. But that's not the point. 2) I think most people in this newsgroup understand that system benchmarks are important. I'll try one more time: THEY'RE JUST HARD TO DO. That doesn't stop people from doing them, which makes especiall good sense if they have some job streams that really represent their loads. We do these sorts of benchmarks all the time; I've been doing UNIX system-type benchmarks of one ilk or another for aa lot of years. The trouble is, it's going to be hard enough to agree on some compute-bound benchmarks, without the hassle of trying to normalize all the rest of the stuff. For example, do you normalize on system cost? Do you normalize memory sizes? Do you normalize on disk number and type? All that we're saying is that system benchmarks are painfully hard to get representative; there are many pitfalls and benchmarking weirdnesses to look out for; "overall preforms well" is a REAL hard metric, for example. Note: I don't yet see a strong sense of agreement on a set of CPU benchmarks that we believe. From past experience, getting a set of system benchmarks that people agree on will be much harder. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
eugene@pioneer.arpa (Eugene Miya N.) (05/26/87)
I am completing another iteration of one of my prototype test programs. The test is written in FORTRAN, but could easily be written in C or (yes) Pascal [in fact I would encourage writing a Pascal version]. It is a preliminary program to be run prior to a CPU/memory test (no floating-point, that's third). But back to this first test. It's very simple, but also appears a little dumb. It's designed to test for one type of optimization as well as test the quality of a system clock. Just give me a week or so. If there is enough interest I will post it, otherwise I will take selected interested parties. John ca be certain he will get a copy: remember, it's simple, obvious, and appears somewhat stupid. Remember, it tests clock quality which is certainly important to any subsequent tests. From the Rock of Ages Home for Retired Hackers: --eugene miya NASA Ames Research Center eugene@ames-aurora.ARPA "You trust the `reply' command with all those different mailers out there?" "Send mail, avoid follow-ups. If enough, I'll summarize." {hplabs,hao,ihnp4,decwrl,allegra,tektronix,menlo70}!ames!aurora!eugene
eugene@pioneer.UUCP (05/27/87)
In article <3490003@wdl1.UUCP> bobw@wdl1.UUCP (Robert Lee Wilson Jr.) writes: > >I've never been quite sure what that accomplishes. To put it another >way, what is the benchmark supposed to be measuring: SYSTEM >performance, or HARDWARE performance. > >----------------------------------------------------------------- >I disclaim almost everything, probably including this line. Let me ask: HOW DO YOU SEPARATE THEM? [I think it's possible.] People talk about CPU and memory performance benchmarks: how do you separate these? Can you tell me when something is hardware bound or software bound? What does it mean when you say system? Is the WHOLE [Another poster's term] of a system equal to the sum of it's parts? [Take optimizers into account.] Do we have to say: nope we can't separate them, there is a Gestalt working here, and we have to assume the applications and the machine are ATOMIC (indivisible) for to divide the problem into parts would destroy the character of the problem (benchmarking the machine). For those people only concern about running their applications: while you have valid concerns [i.e., getting the job done], there are a few people who seek progress. They seek to understand where their problems run, and to look to the future to improve their performance rather than treat their work solely like a black box. Given these architects, engineers, and scientists some credit some time for they are the one who look to the future (to improvements). Sure computers are a tool, but you have to hone your tools, thank God for Seymour Cray. --eugene
reiter@endor.harvard.edu (Ehud Reiter) (05/27/87)
I think some people are perhaps missing the point. Of course, we would all like system benchmarks which accurately predict the performance for our workloads. But such benchmarks are usually impossible, because performance varies quite a bit depending on workload, and most users just don't have a very good idea of what their workload is and will evolve into. Even when the workload is accurately known, this kind of benchmarking is expensive and time-consuming. The point is, there is a great demand out there for simple, single figure performance numbers which are in the public domain. No matter how much we complain that single figures are meaningless, people out there in the real world are going to continue using them. There's a reason why MIPS and Dhrystones are so often quoted. And, we can do better than Dhrystone! We all know what the problems with Dhrystone are - can't be globally optimized, too much string handling, too small, etc. We can certainly write a benchmark which, although still "bad", will be much better than Dhrystone. I think we can even get away with replacing single-number benchmarks by two number benchmarks, which would give a high and low performance figure instead of just a single performance figure (that is, the benchmark would consist of lots of programs. The performance numbers would be normalized against some standard (good old 4.2BSD VAX-11/780?), and the summary statistics would be the highest and lowest of the normalized numbers). In summary, we can't write a perfect benchmark, but we can write a better benchmark. Ehud Reiter reiter@harvard (ARPA,BITNET,UUCP) reiter@harvard.harvard.EDU (new ARPA)
ps@celerity.UUCP (Pat Shanahan) (05/29/87)
In article <2100@husc6.UUCP> reiter@endor.UUCP (Ehud Reiter) writes: >... >The point is, there is a great demand out there for simple, single figure >performance numbers which are in the public domain. No matter how much we >complain that single figures are meaningless, people out there in the real >world are going to continue using them. There's a reason why MIPS and >Dhrystones are so often quoted. This is very unfortunate, if true. People who believe simple, single figure performance numbers are doomed to be suprised by reality. > >And, we can do better than Dhrystone! We all know what the problems with >Dhrystone are - can't be globally optimized, too much string handling, >too small, etc. We can certainly write a benchmark which, although still >"bad", will be much better than Dhrystone. I agree. I don't know of any real C program that does as much structure assignment as the C Dhrystone. I think that C performance is important enough to justify a benchmark that reflects how the language is actually used. > >I think we can even get away with replacing single-number benchmarks by >two number benchmarks, which would give a high and low performance figure >instead of just a single performance figure (that is, the benchmark would >consist of lots of programs. The performance numbers would be normalized >against some standard (good old 4.2BSD VAX-11/780?), and the summary >statistics would be the highest and lowest of the normalized numbers). I think a better approach would be the one taken in the Livermore loops benchmark. The report includes the performance for the individual loops, as well as summary information such as the harmonic mean. I am not sure if high and low would really help much, except in convincing people that single numbers are meaningless. The extreme outliers can be due to architectural choices that are good for most programs but bad for certain exceptional programs. For example, pipelining may be good for real programs, but bad for an artifical test of jump performance. If you are going to report high and low it is very important to make all the benchmark programs reasonably mixed. If you are going to report individual results this is less critical. > >In summary, we can't write a perfect benchmark, but we can write a better >benchmark. > > Ehud Reiter > reiter@harvard (ARPA,BITNET,UUCP) > reiter@harvard.harvard.EDU (new ARPA) It should certainly be possible to write a better benchmark of C performance than the Dhrystone. -- ps (Pat Shanahan) uucp : {decvax!ucbvax || ihnp4 || philabs}!sdcsvax!celerity!ps arpa : sdcsvax!celerity!ps@nosc
mdr@reed.UUCP (Mike Rutenberg) (10/04/88)
In article <6729@nsc.nsc.com> grenley@nsc.UUCP (George Grenley) writes: >some official number, rather than measured data. Also, there are frequently >variations in supposedly standard code. I have different versions of Dhry1.1 >which vary over 40%, even though they are supposedly the same code. But it is so hard to make it run and yet be "the same code." The problem is that to get good results with a given benchmark within a given system, you often do have to tweak things to get comparable numbers, often holding your breath that it all works out. In compiling the dhrystone benchmark, a C compiler I use will remove the loop overhead calculation since it is simply a loop that has no body and a trivial side effect on the iteration variable. Even procedure calls are eliminated by the compiler if it is clear the called procedure does not do anything. @BEGIN(Black Magic) You can do things to trick the compiler into keeping the loop. A null procedure the loop calls will do the trick if compiled separately. But then you have to also put a call to this null procedure in the main dhrystone loop. But this may do bad things to your numbers, especially if it affects your cache hit-rate. And this will change the numbers you get, not in a positive way. @END(Black Magic) I wish benchmarks would be rewritten to be a ultimately portable & really really smart about outwitting too-smart compilers. It would be nice to be able to run a benchmark program totally unchanged. This would avoid the temptation or need to modify the tests. Mike
mash@mips.COM (John Mashey) (10/07/88)
In article <10498@reed.UUCP> mdr@reed.UUCP (Mike Rutenberg) writes: ... >But it is so hard to make it run and yet be "the same code." >The problem is that to get good results with a given benchmark within a >given system, you often do have to tweak things to get comparable >numbers, often holding your breath that it all works out. >In compiling the dhrystone benchmark, a C compiler I use will remove >the loop overhead calculation since it is simply a loop that has >no body and a trivial side effect on the iteration variable. Even >procedure calls are eliminated by the compiler if it is clear the >called procedure does not do anything. >@BEGIN(Black Magic) >You can do things to trick the compiler into keeping the loop. A null >procedure the loop calls will do the trick if compiled separately. But >then you have to also put a call to this null procedure in the main >dhrystone loop. But this may do bad things to your numbers, especially >if it affects your cache hit-rate. And this will change the numbers >you get, not in a positive way. >@END(Black Magic) >I wish benchmarks would be rewritten to be a ultimately portable & >really really smart about outwitting too-smart compilers. It would be >nice to be able to run a benchmark program totally unchanged. This >would avoid the temptation or need to modify the tests. (back from 3 weeks' Down Under; it will take a while to catch up!) ONE MORE TIME: use large, real programs as benchmarks. do NOT use small programs as benchmarks be especially careful of small synthetic benchmarks Two of the most counterproductive things people can be doing are: a) Tuning compilers to optimize small benchmarks, especially with optimizations that don't really matter much on real programs. (Optimizations that actually matter elsewhere are fine.) b) Continually reworking synthetic benchmarks to stay ahead of advances in compiler optimization. It is sad how much effort across this business has gone down the rat-holes. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
grenley@nsc.nsc.com (George Grenley) (10/08/88)
The following discussion started with a posting of mine about organizing some head to head benchmark comparisons. I wanted to give all interested parties a chance to look at one another's hardware... The primary reason is because it is difficult if not impossible to reproduce most vendors' benchmark numbers - and I specifically include my employer, NSC, in this category. We publish 16600 for Dhry1.1 at 30 mhz, no wait state - but no '532 has ever run exactly that number. (it came from the simulator). One reason is simply, most companies won't spend the money to go out and get other companies' hardware to test. In article <4655@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes: >In article <10498@reed.UUCP> mdr@reed.UUCP (Mike Rutenberg) writes: ... >>But it is so hard to make it run and yet be "the same code." (deleted, refernce to why Dhry is susceptible to over optimization...) > >>@BEGIN(Black Magic) >>You can do things to trick the compiler into keeping the loop. A null >>procedure the loop calls will do the trick if compiled separately. But >>then you have to also put a call to this null procedure in the main >>dhrystone loop. But this may do bad things to your numbers, especially >>if it affects your cache hit-rate. And this will change the numbers >>you get, not in a positive way. >>@END(Black Magic) > >>I wish benchmarks would be rewritten to be a ultimately portable & >>really really smart about outwitting too-smart compilers. It would be >>nice to be able to run a benchmark program totally unchanged. This >>would avoid the temptation or need to modify the tests. AGREED! SO LET'S DO IT! Time for Dhry 3.0, or whatever. It seems to me the easiest way to tackle the loop-that-does-nothing problem is to have it do something, preferably process a variable that is supplied at run time, so the compiler cannot know what it is going to do... But in any case, some new CPU benchmarks need to be developed. Perhaps we can all agree that an existing one is suitable, or perhaps we need to create a new one. >(back from 3 weeks' Down Under; it will take a while to catch up!) > >ONE MORE TIME: > use large, real programs as benchmarks. > do NOT use small programs as benchmarks > be especially careful of small synthetic benchmarks > >Two of the most counterproductive things people can be doing are: > a) Tuning compilers to optimize small benchmarks, especially > with optimizations that don't really matter much on real > programs. (Optimizations that actually matter elsewhere are fine.) > b) Continually reworking synthetic benchmarks to stay ahead > of advances in compiler optimization. >It is sad how much effort across this business has gone down the rat-holes. Agreed on all counts, especially regarding waste effort (quadbyte string compare on a certain new processor - can you say "dhrystone in microcode"?) The only drawback to John's generally correct suggestion is the lack of any standards for larger benchmark programs, for integer. Whet and Linpak seem pretty well established as FP b'marks, although I wonder whether they're not a bit cooked sometimes... (I heard once that a Fortran compiler was released which SPECIFICALLY checked the soruce to see if it was Whet, and if it was, stuck in a VERY fast routine). Someone on this net suggested using GNU's public domain versions of various Unix utilities (grep, nroff, etc). Sounds like a good plan to me - it doesn't matter if they're unix compatible, just so they compile and run. Perhaps the first step is to convene a working group to standardize this stuff and promote its use. I volunteer. Any others? George Grenley NSC
bcase@cup.portal.com (10/09/88)
George Grenley says: >Agreed on all counts, especially regarding waste effort (quadbyte string >compare on a certain new processor - can you say "dhrystone in microcode"?) I don't think you really want to start this line of argument; I mean, if we are talking about wasted effort, there are many who would agree that persuing aggressive implementations of certain processor architectures is plenty wasted (please believe me, I don't necessarily mean the 32000). Besides, that quadbyte string compare isn't dhrystone in microcode, there ain't no microcode! :-) :-) :-) (by traditional definitions anyway). If SPARC or MIPS or somebody takes over the market completely, then what effort was wasted and what wasn't? Let's all have a nice day.
henry@utzoo.uucp (Henry Spencer) (10/09/88)
In article <6868@nsc.nsc.com> grenley@nsc.nsc.com.UUCP (George Grenley) writes: >Someone on this net suggested using GNU's public domain versions of various >Unix utilities (grep, nroff, etc). Sounds like a good plan to me... Note three complications: 1. The GNU stuff is ***NOT*** public domain. 2. It tends to be huge. 3. It really knows it's on a 32-bit machine. These are not necessarily fatal problems, although #3 may be a problem for wider-word machines, but they are worth concern. -- The meek can have the Earth; | Henry Spencer at U of Toronto Zoology the rest of us have other plans.|uunet!attcan!utzoo!henry henry@zoo.toronto.edu
rik@june.cs.washington.edu (Rik Littlefield) (10/09/88)
Many postings in this stream seem to assume that "large, real" programs are somehow the most fair to use for benchmarking. That's not necessarily true. Any program that has had all or most of its development on a single system has undoubtedly been tuned for best performance ON THAT SYSTEM. Look at the series of postings on "Duff's device" (an unrolled loop) -- systems without instruction caches (or with large ones :-) tend to produce programs that use Duff's device, those with small caches encourage using tight loops instead. If somebody's compiler doesn't do induction on array index expressions, they tend to write critical loops using pointers. Etc, etc. I'd guess that an awful lot of Unix programs have been tuned to whatever it is that pcc does or doesn't do. The point is, large real programs tend to have long histories that bias them in favor of old compiler technology and architectures. Another problem with large real programs is that it's often very difficult to tell what the benchmark results mean. Does nroff run fast on system Q because Q does stream I/O especially well, or because Q is really good at optimizing some 10-line inner loop that shoves around characters? If I can't read the code or tell where it's spending its time, how can I possibly relate a benchmark result to some different program or application? Personally, I get a lot more insight out of a few hundred lines of good test cases that I can understand in detail. Now, I'm all in favor of benchmarking large real programs, particularly the ones that *I* like to run. They also make a very nice sanity check to guard against silly benchmark deficiencies like do-nothing loops and results that can be determined at compile time. But if cost constraints make me pick one or the other, I'll take the suite of synthetic tests any day. --Rik
pardo@june.cs.washington.edu (David Keppel) (10/10/88)
rik@june.cs.washington.edu (Rik Littlefield) writes: >[ large "real" program benchmarks vs. synthetic benchmarks ] Oh, gee, an opportunity to apply the scientific method :-) (a) Benchmark a bunch of computer systems (hardware/os/compiler) using synthetic benchmarks. (b) Compare the benchmark performance to observations in the "real" world. (c) Learn something about benchmarks, refine your synthetic benchmarks. (d) go to (a) (Oh no, not a GOTO!) People like Kahan bitched to other computer scientists about floating point inconsistancy. Now people formally study floating point numbers as a subject. Performance modelling is a formal area, but I don't know anybody studying "benchmarks" as a formal subject. When people do, benchmarks may get much better. ;-D on ( We're in the dark about benchmurking ) Pardo -- pardo@cs.washington.edu {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo
mash@mips.COM (John Mashey) (10/10/88)
In article <1988Oct9.011633.13259@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >Note three complications: > 1. The GNU stuff is ***NOT*** public domain. > 2. It tends to be huge. Reasons why people ahve liked these as potential benchmarks are: 1. Although not public domain, "generally available" is much better than "proprietary", which is unfortunately true for many otherwise desirable benchmarks. 2. Tends to be huge. GOOD! Most of the common itneger benchmarks are toys or near-toys, unlike the floating-point ones. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
grenley@nsc.nsc.com (George Grenley) (10/10/88)
In article <4853@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes: >In article <1988Oct9.011633.13259@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: > >>Note three complications: > >> 1. The GNU stuff is ***NOT*** public domain. Let me apologize to the nice folks (creatures?) at GNU. I meant that GNU source was available & standard, but I in no way meant to imply it was in the public domain. Thanx to Henry at UT for pointing this out... >> 2. It tends to be huge. > >Reasons why people ahve liked these as potential benchmarks are: > 1. Although not public domain, "generally available" is much better > than "proprietary", which is unfortunately true for many otherwise > desirable benchmarks. > 2. Tends to be huge. GOOD! Most of the common itneger benchmarks are > toys or near-toys, unlike the floating-point ones. >-- >-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> I want to clarify my earlier posting about b'marks - I am primarily interested in CPU benchmarks, not system b'marks. This is partly because my employer makes chips, not systems (we tried that once, and we still get mail about it), and partly because, as a hardware engineer, I am interested in providing the kind of info that will assist other h/w engineers in making a CPU selection on the merits, not based on whimsy/guesswork etc. Unfortunately, this augers against large, OS dependent benchmark programs. I imagine that trying to make grep run "standalone" is not trivial.... I've received a lot of email on b'marking; one individual pointed out that the database community "scales" the size of the b'mark (i.e., size of dbase) to the size of machine. An interesting idea. I think we should consider taking some of the small integer b'marks, and "enlarge"them by having the program call itself recursively in a non-trivial way. Then, the test would consist of running the program at, say, 1 through 1000 levels of recursion, or whenver you run out of RAM. Then, publish the performance numbers. Comments? I am willing to volunteer to drive this if anyone (like, f'rinstance, someone who can code better than me) wants to help. Maybe we'll even get it O-f'shally blessed by IEEEEEEEE. Regards, George Grenley NSC ps to Henry Spencer - is UT still using any series 32000 iron? If so, drop me a note - I may have some news of interest to you.
mash@mips.COM (John Mashey) (10/12/88)
In article <6001@june.cs.washington.edu> rik@june.cs.washington.edu (Rik Littlefield) writes: >Many postings in this stream seem to assume that "large, real" programs are >somehow the most fair to use for benchmarking. That's not necessarily true. As we've said numerous times, the best benchmark for anybody is for them to run their own real applications, because such applications obviously have the highest correlation with what they'll see in real use. When I keep saying "use large, real programs", it's because I usually have in front of me numerous statistics about the behavior of programs that show that most of the toy benchmarks aren't very good predictors of the real applications, especially when applied to the higher-performance designs. Why is this? a) Toys don't stress cache designs, so that small caches and large ones act about the same, which is simply untrue for many real programs. (in this case, "cache" includes any place in the memory herarchy, including registers, stack caches, register windows, 1-to-n-level of memory caches, disk caches in main memory, etc. b) Toys don't stress limits. For example, consider the performance differences attributable to the different X86 memory models. c) Toys don't stress software. Anybody can compile Dhrystone or Whetstone, and many can optimize them. Compiling/optimizing Spice tells you a lot more. >Any program that has had all or most of its development on a single system >has undoubtedly been tuned for best performance ON THAT SYSTEM. Look at the >series of postings on "Duff's device" (an unrolled loop) -- systems without >instruction caches (or with large ones :-) tend to produce programs that use >Duff's device, those with small caches encourage using tight loops instead. >If somebody's compiler doesn't do induction on array index expressions, they >tend to write critical loops using pointers. Etc, etc. I'd guess that an >awful lot of Unix programs have been tuned to whatever it is that pcc does >or doesn't do. The point is, large real programs tend to have long >histories that bias them in favor of old compiler technology and >architectures. Most application software doesn't worry about this kind of thing very much: the 3rd-party folksd worry most about making things work across lots of machines. > >Another problem with large real programs is that it's often very difficult >to tell what the benchmark results mean. Does nroff run fast on system Q >because Q does stream I/O especially well, or because Q is really good at >optimizing some 10-line inner loop that shoves around characters? If I >can't read the code or tell where it's spending its time, how can I possibly >relate a benchmark result to some different program or application? >Personally, I get a lot more insight out of a few hundred lines of good test >cases that I can understand in detail. This is certainly true, although good measurement tools help you figure out where the time is going. Of course, if you have small benchmarks that give you good correlation with what you actually use, then you're OK, adn there's nothing wrong with using them, i.e., by definition, you're using something correlated with youre real applications. One of the points we've tried to make is that one must be very careful when using simple benchmarks to predict the performance across wider ranges of architecture and software. For example, simple benchmarks used to analyze PC-class machines, don't encessarily work very well for larger ones. (For PC-class machines, you can probably geta first-order prediction by knowing clock-rate, CPU type, and memory-latency). > >Now, I'm all in favor of benchmarking large real programs, particularly the >ones that *I* like to run. They also make a very nice sanity check to guard >against silly benchmark deficiencies like do-nothing loops and results that >can be determined at compile time. But if cost constraints make me pick one >or the other, I'll take the suite of synthetic tests any day. It is, of course, a goal for many people in this to create small synthetic benchmarks that accurately predict the behavior on large real applications, and this is a very desirable goal. It's merely hard! -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
lindsay@k.gp.cs.cmu.edu (Donald Lindsay) (10/12/88)
In article <6899@nsc.nsc.com> grenley@nsc.nsc.com.UUCP (George Grenley) writes: >I've received a lot of email on b'marking; one individual pointed out that >the database community "scales" the size of the b'mark (i.e., size of dbase) >to the size of machine. An interesting idea. I think we should consider >taking some of the small integer b'marks, and "enlarge"them by having the >program call itself recursively in a non-trivial way. Then, the test would >consist of running the program at, say, 1 through 1000 levels of recursion, >or whenver you run out of RAM. Then, publish the performance numbers. >Comments? I am willing to volunteer to drive this if anyone (like, f'rinstance, >someone who can code better than me) wants to help. First, I am solidly behind the idea that the best benchmark is the user's application. That said, synthetic benchmarks might as well be as good as they can be. So, some guidelines: - the code working set must be adjustable, without upper bound. - the data working set, likewise. - the compiler must be prevented from inlining. - the compiler must be prevented from eliminating dead code. - the benchmark must be small, so that it can be presented in full in reports. (This avoids the "slight change" problem, as well as permitting easy shipment.) There is a fairly simple way to achieve these ends. Do not write a benchmark program: write a program which writes out the benchmark program. A simple loop in the Generator program allows the creation of arbitrarily large source files. (Since compilers can get bent by this, the Generator should also generate multiple source files.) The procedure names will be somewhat unimaginative: f0001, f0002, and so on. If the source files are in C, then it's fair to generate macros and macro calls, simply to reduce the file space requirements. Next, the Generator should write the code to fill an array with pointers to these functions. Similarly, we need a data array. Next, we need a portable routine which generates pseudo-random numbers. (Portable mostly means that it avoids arithmetic overflow.) The quality of the randomness is unimportant, as long as it doesn't get stuck at 0 or other such silliness. The generated program will use the randoms to form subscripts, either into the data array, or into the function pointer array. In this way, we may control the size of the working sets. Since the functions should (largely) be accessed via the array, inlining is defeated. Avoid dead code. I have no comment concerning the contents of the routines: the Generator is independent of this, and should be able to generate several benchmarks (for instance, an integer one, and a float one). Since the benchmarks must be told how "big" to be, the benchmark report form should be written as part of the benchmark. This must specify how many runs must be made, and with exactly what parameters. -- Don lindsay@k.gp.cs.cmu.edu CMU Computer Science
tim@crackle.amd.com (Tim Olson) (10/13/88)
In article <3285@pt.cs.cmu.edu> lindsay@k.gp.cs.cmu.edu (Donald Lindsay) writes: | Next, the Generator should write the code to fill an array with pointers | to these functions. Similarly, we need a data array. | | Next, we need a portable routine which generates pseudo-random numbers. | (Portable mostly means that it avoids arithmetic overflow.) The quality of | the randomness is unimportant, as long as it doesn't get stuck at 0 or | other such silliness. The generated program will use the randoms to form | subscripts, either into the data array, or into the function pointer array. | In this way, we may control the size of the working sets. I once received a benchmark program that was similar in nature to this. It used a very large number of functions in separate source files, and a big switch statement that selected a function to call based upon a random number. I ran this benchmark, then I profiled it. It turned out that 30% of the runtime was in calculating the random number to use for the function selection! -- Tim Olson Advanced Micro Devices (tim@crackle.amd.com)
jon@jim.odr.oz (Jon Wells) (10/13/88)
From article <4655@winchester.mips.COM>, by mash@mips.COM (John Mashey): > In article <10498@reed.UUCP> mdr@reed.UUCP (Mike Rutenberg) writes: > ... >>But it is so hard to make it run and yet be "the same code." > >>The problem is that to get good results with a given benchmark within a >>given system, you often do have to tweak things to get comparable >>numbers, often holding your breath that it all works out. >> [ stuff deleted ...] > > ONE MORE TIME: > use large, real programs as benchmarks. > do NOT use small programs as benchmarks > be especially careful of small synthetic benchmarks >[ stuff deleted... ] Seems to me that there are two quite distinct classes of benchmarks required (maybe three). There are at least three different levels of information required.... A) The raw execution rate of a processor under ideal conditions. B) Now fast does the thing go in a particular system (memory config etc.). C) Now fast does the system, as a whole, go. The following comments do not apply to number crunchers which run under `ideal' conditions most of the time so A is perhaps the most important thing. A and B could be solved either by simulation or benchmarking, they are both *very* processor specific things, and as such the same `code' can and must be used for both. By the same code I mean the same *instructions*, if you have to write it in assembler because your compiler optimizes it out then do it. What you're trying find are the performance limits of a particular processor/configuration, that is, A gives you the upper limit of B, and B tells you how well our doing. Both these things are *only* of interest to the architects of processors and systems, and C is the *only* thing that the people buying these systems are interested in. I neither know nor care how many wheatstones this machine does, I do know that it takes a *very* long time to walk the directory hierarchy, so long that I've never bothered waiting for such a program to complete. C can only be found by running large complex things that approximate the systems' end use. I can see no better benchmark than the software that the system will run. Ken MacDonald's unix benchmark suite, MUSBUS, is one such example. The suite is floating around on various servers but you'll need a complete and well debugged unix system before you'll be able to use it. The point is, that both types of benchmarks are useful and of interest, just to different classes of people. jon. --
eugene@eos.UUCP (Eugene Miya) (10/13/88)
Oh no benchmarking wars again...... (sigh) In article <6868@nsc.nsc.com> grenley@nsc.nsc.com.UUCP (George Grenley) writes: >... (I heard once that a Fortran compiler was released >which SPECIFICALLY checked the soruce to see if it was Whet, and if it was, >stuck in a VERY fast routine). I checked this story out (months ago) . Without mentioning specific names within a VERY large computer company I discovered it was an APL compiler not a Fortran compiler. The benchmark was a simple Gaussian sum (3 APL characters). The benchmark adds 1 thru n, the compiler did what Gauss did: you know n(n+1)/2. It was placed there by the compiler writer who knew the person in the APL community who did this as a benchmark. Serves the benchmarker right. Another gross generalization from --eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov resident cynic at the Rock of Ages Home for Retired Hackers: "Mailers?! HA!", "If my mail does not reach you, please accept my apology." {uunet,hplabs,ncar,decwrl,allegra,tektronix}!ames!aurora!eugene "Send mail, avoid follow-ups. If enough, I'll summarize."
eugene@eos.UUCP (Eugene Miya) (10/14/88)
In article <3285@pt.cs.cmu.edu> lindsay@k.gp.cs.cmu.edu (Donald Lindsay) writes: >First, I am solidly behind the idea that the best benchmark is the user's >application. This is fine if you are Livermore and devote two Crays to running many runs of a single program. You have problems if you run a diversity of codes. Big programs (sorry John I can't completely) can be as deceptive as small. Big programs have more paths to test. It's just as blind (but in different ways) as small programs. Now using both you might get more. >There is a fairly simple way to achieve these ends. Do not write a >benchmark program: write a program which writes out the benchmark program. > ... much deleted I am working on this (with a company). How much are you willing to pay? Naw, just kidding, our ideas are too crude for a product. We are just writting tools to help. Another gross generalization from --eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov resident cynic at the Rock of Ages Home for Retired Hackers: "Mailers?! HA!", "If my mail does not reach you, please accept my apology." {uunet,hplabs,ncar,decwrl,allegra,tektronix}!ames!aurora!eugene "Send mail, avoid follow-ups. If enough, I'll summarize."
eugene@eos.UUCP (Eugene Miya) (10/15/88)
In article <6005@june.cs.washington.edu> pardo@cs.washington.edu (David Keppel) writes: >rik@june.cs.washington.edu (Rik Littlefield) writes: >>[ large "real" program benchmarks vs. synthetic benchmarks ] > >Oh, gee, an opportunity to apply the scientific method :-) > >(a) Benchmark a bunch of computer systems (hardware/os/compiler) > using synthetic benchmarks. >(b) Compare the benchmark performance to observations in the > "real" world. >(c) Learn something about benchmarks, refine your synthetic > benchmarks. >(d) go to (a) (Oh no, not a GOTO!) I am sorry. I don't see the scientific method in this. I don't see a theory, a hypothesis, a controlled experiment, nor even a control. 8-) Actually, don't worry, I get this all the time from the other "real" sciences myself. I do see the beginnings of empirical work. Better luck next time. Another gross generalization from --eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov resident cynic at the Rock of Ages Home for Retired Hackers: "Mailers?! HA!", "If my mail does not reach you, please accept my apology." {uunet,hplabs,ncar,decwrl,allegra,tektronix}!ames!aurora!eugene "Send mail, avoid follow-ups. If enough, I'll summarize."
eugene@eos.UUCP (Eugene Miya) (10/18/88)
In article <5356@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes: >In article <6001@june.cs.washington.edu> rik@june.cs.washington.edu (Rik Littlefield) writes: >>Many postings in this stream seem to assume that "large, real" programs are >>somehow the most fair to use for benchmarking. That's not necessarily true. >As we've said numerous times, the best benchmark for anybody is for them >to run their own real applications, because such applications obviously >have the highest correlation with what they'll see in real use. >When I keep saying "use large, real programs", it's because I usually >have in front of me numerous statistics about the behavior of programs >that show that most of the toy benchmarks aren't very good predictors of >the real applications, especially when applied to the higher-performance >designs. Why is this? > a) Toys don't stress cache designs, > b) Toys don't stress limits. > c) Toys don't stress software. REAL programs have certain biases and short comings, but I am unable to come up with an elioquent example. The problem comes with "large, real." Does large mean memory requirements? (Crank the arrays bigger, ever see an arrays proposed with 1 Tera word of memory? Read Cray Channels). Does large mean computationally complex? Each of these is true to a degree. Then there is the question of what constitutes "real," and I don't mean in the metaphysical sense. I call this "the tension of simplicity." It affects all we do with measurement: portability, interpretability, and how we run. I believe we have to play with some toys before jumping into "real" programs. We have to find out what makes them "real." [Many have ideas, but few are good.] I think if it weren't for toys, we wouldn't have things like the NeXT, the Mac, the Apple II. Computers would be big boxes behind glass windows, and we would be hung up on who's card deck would be submitted next. Another gross generalization from --eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov resident cynic at the Rock of Ages Home for Retired Hackers: "Mailers?! HA!", "If my mail does not reach you, please accept my apology." {uunet,hplabs,ncar,decwrl,allegra,tektronix}!ames!aurora!eugene "Send mail, avoid follow-ups. If enough, I'll summarize." Actually, I can think about one example at LLNL, but it's classified.
peter@stca77.stc.oz (Peter Jeremy) (10/19/88)
My comments in the following are very C orientated. I realise this is not very portable but, most of you will be familiar with C, most other languages are (or could be) capable of doing the same thing and I am not familiar with recent compiler capabilities in other languages. In article <3285@pt.cs.cmu.edu> lindsay@k.gp.cs.cmu.edu (Donald Lindsay) writes: >In article <6899@nsc.nsc.com> grenley@nsc.nsc.com.UUCP (George Grenley) writes: >> [ Offers to write scalable synthetic benchmark, if no-one else wants to ] > >First, I am solidly behind the idea that the best benchmark is the user's >application. I think we can all take this as read. Unfortunately in most cases it is impractical. Synthetic benchmarks are our best substitute, as long as we know what we are doing (marketroid "benchmark" results are a glaring example of not knowing what they are doing :-). >That said, synthetic benchmarks might as well be as good as they can be. >So, some guidelines: > [ code and data working sets fully adjustable, small benchmark presented > in full in the report ] > - the compiler must be prevented from inlining. I think this statement may need some more thought. putc() is a 'function' that has been 'inlined' since the beginning of C - it was implemented as a macro because, until very recently, C compilers didn't allow function inlining. Inlining small functions makes sense because the function call overhead (both size and time) is a significant portion of the size of the function. What is needed is a way to differentiate between the following classes of functions: 1) small library routines (eg strcpy, strcmp) 2) large very general library routines (printf, scanf) 3) other library routines 4) small synthetic routines simulating small routines 5) small synthetic routines simulating large routines Some recent C compilers are capable of inlining functions in group 1, and analysing parameters to functions in group 2 to possibly replace them with less generalised (and smaller) library routines. I see no reason to stop the compiler doing this (although it is generally possibly by compiler switches or include file changes) because it will do the same to _all_ code and a synthetic benchmark should be "typical" in this regard. Small routines that are simulating large routines must not be inlined. I think this is what Donald was talking about. Small routines that are simulating small routines are a grey area. In a typical large application that was written knowing that inlining functions was an option, the author might choose to inline some functions. Thus inline functions could be used in application programs and a benchmark should take this into account. I believe that a benchmark should take into account the capabilities of the software development environment since having a system that can execute "good" code (eg hand-crafted assembler) blindingly fast is not much good if the only compilers available generate atrocious code. This means that the benchmark should attempt to use all the compiler's capabilities, whilst preventing the compiler from mangling those routines that are simulating large blocks of application code. > - the compiler must be prevented from eliminating dead code. Why? If code is dead, it stays dead whether it is a synthetic benchmark or an application. What is needed is some way of differentiating between compilers that are capable of detecting (and removing) dead code in a large application, and those that are only capable of detecting dead code in "toy" situations (ie synthetic benchmarks). The problem with this requirement is that many (most?) compilers don't have the switches to allow this - they always remove the dead code they find. And a compiler that does support this switch probably does a better job of dead code detection. > [ Write a program to generate the benchmark program. Description of what the > generator program should do mostly deleted. ] > >Next, we need a portable routine which generates pseudo-random numbers. >(Portable mostly means that it avoids arithmetic overflow.) The quality of >the randomness is unimportant, as long as it doesn't get stuck at 0 or >other such silliness. It needs to be sufficiently random that the OS/hardware memory management and caching routines can't take advantage of the number or order of references. >Since the functions should (largely) be accessed via the array, inlining is >defeated. Avoid dead code. This automatically biases the result. Whilst I don't have figures, I suspect that very _few_ function calls in a typical application are indirect. Whilst this does prevent a compiler from using any global optimization tricks it might know, it also provides an unfair advantage to processors that can efficiently execute indirect function calls. -- Peter Jeremy (VK2PJ) peter@stca77.stc.oz Alcatel-STC Australia ...!munnari!stca77.stc.oz!peter 41 Mandible St peter%stca77.stc.oz@uunet.UU.NET ALEXANDRIA NSW 2015
bzs@xenna (Barry Shein) (10/19/88)
The short story on benchmarks is: Make a careful hypothesis and design an experiment which will provide relevant data relating to that hypothesis. Hypothesis: This processor/memory combination is faster than that one on simple integer operations. Experiment: Design and run a small benchmark which will allow you to time each. Hypothesis: This system is faster at running large programs which stress the virtual memory system. Experiment: Design and run a benchmark which will allow you to time each. Unfortunately one has to know how to relate their hypothesis to the benchmark design. Instrumentation and careful refinement of the hypothesis (eg. define "stress the virtual memory system") helps. Then there's the old rule of thumb that the speed of a computer system is measured from the moment you get an idea in your head to the moment you have the answer in your hands, any other measure is superfluous. I knew a scientist who insisted his Vax730 was many times faster than the huge campus IBM mainframe based on that rule. He could usually go from conception to answer on the 730 in less time than it took standing on line waiting for a user services person to explain what IEH700104 meant. I think he was right. -Barry Shein, ||Encore||
eugene@eos.UUCP (Eugene Miya) (10/19/88)
Barry (bless his soul!) gave us hypothesis (null) and experiment in benchmarking. Now, suggested exercise to reader. Where is the control? Not the control as in flow control or control of flow, but experiment control? (8-) for you Barry!) Another gross generalization from --eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov resident cynic at the Rock of Ages Home for Retired Hackers: "Mailers?! HA!", "If my mail does not reach you, please accept my apology." {uunet,hplabs,ncar,decwrl,allegra,tektronix}!ames!aurora!eugene "Send mail, avoid follow-ups. If enough, I'll summarize."
peter@stca77.stc.oz (Peter Jeremy) (10/20/88)
In article <1710@eos.UUCP> eugene@eos.UUCP (Eugene Miya) writes: >In article <6868@nsc.nsc.com> grenley@nsc.nsc.com.UUCP (George Grenley) writes: >>... (I heard once that a Fortran compiler was released >>which SPECIFICALLY checked the soruce to see if it was Whet, and if it was, >>stuck in a VERY fast routine). > >I checked this story out (months ago) . Without mentioning specific >names within a VERY large computer company I discovered it was an >APL compiler not a Fortran compiler. The benchmark was a simple >Gaussian sum (3 APL characters). The benchmark adds 1 thru n, the compiler >did what Gauss did: you know n(n+1)/2. It was placed there by the compiler >writer who knew the person in the APL community who did this as a >benchmark. Serves the benchmarker right. Presumably the benchmark was +/{iota}n. At least one APL _interpreter_ that I am aware of (IBM VSAPL) has an internal representation format designed to efficiently handle arithmetic progression vectors. All it stores is the tag, number of elements, value of first element and increment. This makes simple arithmetic and indexing into and by the array very efficient. Whether the extra code in the interpreter necessary to support this 'type' is justified on typical applications, or whether it was just put in for sales reasons, I don't know. Given that this is documented (in the manual on writing VSAPL Auxiliary Processors), it is hardly a great secret. For that matter +/{itoa}n is hardly a great benchmark. I far prefer things like {domino}?100 100{rho}1E6, it might not be a good benchmark, but its sure good for soaking up CPU time (and sure beats trying to do it in any other language :-). And whilst the rumour mill is running: Rumour has it that at least one PClown C compiler recognizes the Sieve of Eratosthenes Benchmark (so beloved by BYTE magazine) and spits out special code, or at least the optimiser was written with the Sieve in mind. -- Peter Jeremy (VK2PJ) peter@stca77.stc.oz Alcatel-STC Australia ...!munnari!stca77.stc.oz!peter 41 Mandible St peter%stca77.stc.oz@uunet.UU.NET ALEXANDRIA NSW 2015
tbray@watsol.waterloo.edu (Tim Bray) (10/20/88)
In article <3913@encore.UUCP> bzs@xenna (Barry Shein) writes: >Hypothesis: This processor/memory combination is faster than that >one on simple integer operations. >Hypothesis: This system is faster at running large programs which >stress the virtual memory system. There is an implicit claim here that one can divide programs into equivalence classes with names such as 'simple integer operations' and 'virtual memory stressers'. I don't believe that. Benchmarks are like trying to count feathers with boxing gloves on, but they won't go away, sigh... >Then there's the old rule of thumb that the speed of a computer system >is measured from the moment you get an idea in your head to the moment >you have the answer in your hands, any other measure is superfluous. Hear, hear. Tim Bray, New Oxford English Dictionary Project, U of Waterloo
pardo@june.cs.washington.edu (David Keppel) (10/23/88)
peter@stca77.stc.oz (Peter Jeremy) writes: >[ prevent/classify function inlining ] Inlining a given function on a given machine may speed the computer while inlining the same functin on another machine -- or even the same machine, but with a different level of optimization -- may *slow* the observed performance. In particular, I can hypothesize machines/situations in which calling getc() as a function is *faster* because the whole thing stays in-cache and lets other code stay in-cache too. In running a "real" system it is the effective speed that bothers me, not fast hardware/slow software or slow hardware/fast software. Conclusions: * Ultimately, benchmarking is very hard. * Even with a benchmark that tests a particular feature and a very good description of the workload, it may not be possible to extrapolate the combined performance from the individual performance. -- pardo@cs.washington.edu {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo
eugene@eos.UUCP (Eugene Miya) (02/28/90)
In article <132232@sun.Eng.Sun.COM> lm@sun.UUCP (Larry McVoy) writes: >>In article <3300102@m.cs.uiuc.edu> gillies@m.cs.uiuc.edu writes: >> [doesn't like SPEC] > >In article <36438@mips.mips.COM> mash@mips.COM (John Mashey) writes: >>I'm sad to hear that what we've done so far is "no better than Dhrystone", >>because if that's true, a whole bunch of us have wasted, in toto, at >>least several million $ to try to do something better.... > >I, for one, think SPEC is great. Oh well. Too bad. >On the other hand, SPEC is not the end all to beat all. No benchmark >is. If I could design the ideal benchmark, I'd design something that >had a bunch of knobs that I could turn, like an I/O knob, a CPU knob, a >memory knob, etc. I don't have this, so I run several different >benchmarks that measure these sorts of things. SPEC is one, Musbus is >another, and we have several internal/proprietary benchmarks as well. >Some people don't like you to quote one figure from one benchmark - I >like to see all the figures from all the benchmarks. The more data you >have the easier it is to weed out the spikes. Sorry, John, I tend to suspect SPEC spent a lot of money. Larry is not talking about a single program. This is something I am working on parts, when I get tiny bits of time. And like most research 90% of its failure. I do not believe the future lies in simply having more numbers. More numbers can just be more confusing. You want number? Try 42. Douglas Adams published that. The fundamental idea which separates people is whether or not you believe the whole a of benchmark equals or exceeds the sum of its parts. If you believe in "magic" i.e. known optimizations, features, etc. that wholes > than parts, then you aren't scientific about the problem. A person won't get anywhere and you can posit little green men who only come on Tuesdays as to why your code runs fast. I am not saying timings of parts should sum to a whole code, but as you work on higher and higher conceptual ideas of programs, you can factor these optimization, etc. into performance. Users simply concern with pure speed will inevitably be disappointed. I can point to analogies of performance in other areas. The idea of placing a VAX under a bell jam, gold-plating a code, etc. That's all covered in an article I read after visiting the NBS entitled Foundations of Metrology in an NBS journal. There's ways of doing this, but just like the platinum bar, there's limits of usefulness: hence why we use other measuring tools, why we refine atomic clocks, etc. Until we are willing to do that with computers, benchmarking won't get far. I don't get any warm fuzzy feeling from the Nelson, the Loops, Dongarra, etc. sure their's bit of truth, but you have to be willing to consider surrogates. We want to run (with benchmarks), but we have to crawl before walking and playing. We are going to need a progression of research. But most of you don't have the time or inclination to listen, so I will go back to my hacking. Another gross generalization from --eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov resident cynic at the Rock of Ages Home for Retired Hackers: "You trust the `reply' command with all those different mailers out there?" "If my mail does not reach you, please accept my apology." {ncar,decwrl,hplabs,uunet}!ames!eugene Do you expect anything BUT generalizations on the net? [If it ain't source, it ain't software -- D. Tweten]
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (03/01/90)
In article <6336@eos.UUCP> eugene@eos.UUCP (Eugene Miya) writes: >In article <132232@sun.Eng.Sun.COM> lm@sun.UUCP (Larry McVoy) writes: >>>In article <3300102@m.cs.uiuc.edu> gillies@m.cs.uiuc.edu writes: >>In article <36438@mips.mips.COM> mash@mips.COM (John Mashey) writes: : >>I, for one, think SPEC is great. >Oh well. Too bad. : >Sorry, John, I tend to suspect SPEC spent a lot of money. I, for another, like SPEC reasonably well. It does a good job of balancing integer and floating point requirements. It matches reasonably well what some vendors use as a definition of "MIPS". Overall, I think it is a good job. >You want number? Try 42. Douglas Adams published that. : >The fundamental idea which separates people is whether or not you believe >the whole a of benchmark equals or exceeds the sum of its parts. Of course, every techie realizes that benchmark numbers are just numbers. I can give you a list of 100 things that no existing benchmark measures well. But, as a defense against marketing droids, it is a reasonable first line of defense. >Users simply concern with pure speed will inevitably be disappointed. Those looking for a single number to characterize speed will inevitably be disappointed. Those looking for a number to demonstrate that certain kinds of programs will not experience bottlenecks may be much better served. The purpose to rating systems with numbers like SPECMARK is not to say, "My system is better than yours, because it is 17.6 SPECMARKS and yours is only 16.9" The purpose is eliminate systems from the solution space because they won't be fast enough. Marketing types will misuse it all the same, but so what? They have freedom of speech too. >I don't get any warm fuzzy feeling from the Nelson, the Loops, Dongarra, etc. Most people don't get warm fuzzies from benchmark programs. But, they can narrow your solution space if used correctly. You don't have to consider systems which are too slow for your job. You still need to apply other measures to make sure that the system meets all your requirements. The main problem that I see with using benchmark programs is that some *marketING* driven companies have a tendency to neglect things which aren't being measured. For example, context switching speed. I see some evidence that after initial euphoria over faster CPU speeds on micros, people are beginning to go back to fundamentals, and building more balanced systems. Of course, *MARKET* driven companies have been doing it all along. > [If it ain't source, it ain't software -- D. Tweten] Agreed. Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)604-6117