mark@mips.com (Mark G. Johnson) (04/20/91)
In article <1991Apr20.083301.28886@ux1.cso.uiuc.edu> andreess@mrlaxa.mrl.uiuc.edu (Marc Andreessen) writes: >I've been reading this newsgroup since its formation. It spends >about 95% of its time spewing out a plethora of meaningless numbers >on meaningless and trivial little pseudo-hacks which don't deserve >to be called benchmarks by any stretch of the imagination. > >These numbers are useless. > Useless? Meaningless? I would suggest that the words "useless" and/or "meaningless" be reserved for benchmarks that produce results that are uncorrelated (absolute value of correlation coefficient < 0.2) with "correct benchmark results". Whatever "correct benchmark results" are. Since Andreessen is at the University of Illinois, let's pretend, temporarily, that the Illinois Perfect Club is the definition of a correct benchmark. Then a useless benchmark is one which, after being run on a large subset of the same machines as have run the Illinois Perfect Club, produces a correlation coefficient r in the range (-0.2 < r < 0.2) -- that is, the candidate benchmark's results are uncorrelated with the Illinois Perfect Club results. We could go further and define a "misleading" benchmark as one which produces a correlation coefficient that is large and negative, i.e. ranks machines in the wrong order (compared to the Illinois Perfect Club, our temporary definition of a "correct benchmark"). Unfortunately, the "Dates Per Second" benchmark is attempting to investigate OS behavior, something the Illinois Perfect Club doesn't measure directly. So before declaring the DatesPerSecond benchmark to be "useless", we need to define what the correct results are, (measuring OS behavior), and then we can correlate the DatesPerSecond results to the correct results. Lacking such reference data, it is premature and perhaps a bit unscientific to assert the DatesPerSecond results are "useless". -- -- Mark Johnson MIPS Computer Systems, 930 E. Arques M/S 2-02, Sunnyvale, CA 94088-3650 (408) 524-8308 mark@mips.com {or ...!decwrl!mips!mark}
andreess@mrlaxs.mrl.uiuc.edu (Marc Andreessen) (04/21/91)
In article <2502@spim.mips.COM> mark@mips.com (Mark G. Johnson) writes: >Since Andreessen is at the University of Illinois, let's pretend, >temporarily, that the Illinois Perfect Club is the definition of a >correct benchmark. Then a useless benchmark is one which, after being >run on a large subset of the same machines as have run the Illinois >Perfect Club, produces a correlation coefficient r in the range >(-0.2 < r < 0.2) -- that is, the candidate benchmark's results are >uncorrelated with the Illinois Perfect Club results. You're joking. I hope. The Perfect Club (which I have nothing to do with; my address does not include 'csrd.uiuc.edu') benchmarks only one thing: execution time of a given subset of commonly-used scientific applications. The hope is that by examining the Perfect Club numbers, one might obtain an estimation of how fast one's own similarly-coded application would execute on a given machine. This says nothing about a broad range of interesting and not-so-interesting facets of computation (aka dates per second). If you really want to know how many dates per second your machine can generate, that's fine, but the numbers are still useless. I'm a little bit more concerned with how fast I can execute a given scientific application than how many dates I can spew out. Curiously enough, those concerns are shared with most of the scientific community; thus, Linpack, the Livermore Loops, and Perfect Club. Having said that, I might further humbly submit that dates/second is also a useless measure of OS performance. This should be obvious; if not, go back and read the original posting, wherein the author lists a wide range of reasons this hack is useless as a benchmark. Marc -- Marc Andreessen___________University of Illinois Materials Research Laboratory Internet: andreessen@uimrl7.mrl.uiuc.edu____________Bitnet: andreessen@uiucmrl
auvsaff@auvsun1.tamu.edu (David Safford) (04/22/91)
In article <2502@spim.mips.COM>, mark@mips.com (Mark G. Johnson) writes: |>In article <1991Apr20.083301.28886@ux1.cso.uiuc.edu> andreess@mrlaxa.mrl.uiuc.edu (Marc Andreessen) writes: |> >I've been reading this newsgroup since its formation. It spends |> >about 95% of its time spewing out a plethora of meaningless numbers |> >on meaningless and trivial little pseudo-hacks which don't deserve |> >to be called benchmarks by any stretch of the imagination. |> > |> >These numbers are useless. |> > |> |>Useless? Meaningless? I would suggest that the words "useless" and/or |>"meaningless" be reserved for benchmarks that produce results that are |>uncorrelated (absolute value of correlation coefficient < 0.2) with |>"correct benchmark results". |> --- stuff deleted |> -- Mark Johnson |> MIPS Computer Systems, 930 E. Arques M/S 2-02, Sunnyvale, CA 94088-3650 |> (408) 524-8308 mark@mips.com {or ...!decwrl!mips!mark} This is a joke, right? You are not seriously saying that statistical correlation signifies meaning, are you??? A statistical correlation is necessary, but certainly not sufficient, to indicate a meaningful relationship. Just because the per capita consumption of M&M's correlates with the rate of bank failures does not indicate any meaningful relationship. This is an apple and oranges comparison. The "date" "benchmark" is a meaningful measure of only 2 things: (1) whether the given system uses dynamic linking, and (2) how fast date runs, given (1). (1) is more easily determined other ways (2) is not important to me Let's please drop these meaningless tests, and look at more important problems, such as network, graphics, and multiprocessing benchmark approaches. dave safford auvsaff@auvsun1.tamu.edu Texas A&M University
rbart@shakespyr.Pyramid.COM (Bob Bart) (04/23/91)
In article <15080@helios.TAMU.EDU>, auvsaff@auvsun1.tamu.edu (David Safford) writes: |> Let's please drop these meaningless tests, and look at more important |> problems, such as network, graphics, and multiprocessing benchmark |> approaches. Bravo, I second the motion. I'm tired of all the nonsense benchmark discussions. -#------- Bob Bart - Performance Analyst 415-335-8101 ---###----- Pyramid Technology Corporation pyramid!pyrnova!rbart -----#####--- Mountain View, CA 94039 rbart@pyrnova.pyramid.com -------#######- U.S.A.
terry@venus.sunquest.com (Terry R. Friedrichsen) (04/23/91)
auvsaff@auvsun1.tamu.edu (David Safford) writes: >mark@mips.com (Mark G. Johnson) writes: >|>I would suggest that the words "useless" and/or >|>"meaningless" be reserved for benchmarks that produce results that are >|>uncorrelated (absolute value of correlation coefficient < 0.2) with >|>"correct benchmark results". >You are not seriously saying that statistical correlation signifies meaning, >are you??? > >A statistical correlation is necessary, but certainly not sufficient, to >indicate a meaningful relationship. Just because the per capita consumption >of M&M's correlates with the rate of bank failures does not indicate any >meaningful relationship. This is an apple and oranges comparison. No, and he didn't say that. Read what he wrote: "useless == uncorrelated". He did NOT say "correlated == useful". You yourself admit that correlation is a necessary, though insufficient, condition. One of the real problems I note in reading this group is that there are a LOT of folks who are weak on logic, in interpretation of benchmarks as well as other postings. >The "date" "benchmark" is a meaningful measure of only 2 things: (1) whether >the given system uses dynamic linking, and (2) how fast date runs, given (1). > >(1) is more easily determined other ways >(2) is not important to me There are a couple of problems here: first, the benchmark also indicates the speed of the operating system implementation's fork()/exec() services. That SHOULD be important to a lot of folks, with the exception of pure number-crunchers. Take John Hascall's suggestion and replace date(1) with true(1), and you'll do even better at measuring fork()/exec() (modulo the dynamic linking issue, of course). Second, since date(1) must do integer calculations, how fast it runs MUST be important to you, unless you do nothing but floating-point arithmetic. If I handed you a machine that took 47 seconds to run date, I suspect that you would find other aspects of its performance equally repulsive. Weak logic here again. On the other hand, the matrix300 benchmark is of absolutely no importance to me, since my matrices are a different size ;-). Terry R. Friedrichsen terry@venus.sunquest.com (Internet) uunet!sunquest!terry (Usenet) terry@sds.sdsc.edu (alternate address; I live in Tucson) Quote: "Do, or do not. There is no 'try'." - Yoda, The Empire Strikes Back
auvsaff@auvsun1.tamu.edu (David Safford) (04/23/91)
In article <18049@sunquest.UUCP>, terry@venus.sunquest.com (Terry R. Friedrichsen) writes: |>auvsaff@auvsun1.tamu.edu (David Safford) writes: |>>mark@mips.com (Mark G. Johnson) writes: |>> |>>|>I would suggest that the words "useless" and/or |>>|>"meaningless" be reserved for benchmarks that produce results that are |>>|>uncorrelated (absolute value of correlation coefficient < 0.2) with |>>|>"correct benchmark results". |>> |>>You are not seriously saying that statistical correlation signifies meaning, |>>are you??? |>> |>>A statistical correlation is necessary, but certainly not sufficient, to |>>indicate a meaningful relationship. Just because the per capita consumption |>>of M&M's correlates with the rate of bank failures does not indicate any |>>meaningful relationship. This is an apple and oranges comparison. |> |>No, and he didn't say that. Read what he wrote: "useless == uncorrelated". |>He did NOT say "correlated == useful". You yourself admit that correlation |>is a necessary, though insufficient, condition. |> YOU read it again, slowly and carefully. He said that "useless" should be RESERVED for uncorrelated benchmarks. This means that if a benchmark does correlate, it cannot be called useless, as this is RESERVED for uncorrelated. Using your nomenclature, he said "correlated != useless". Using the reasonable assumption that the sets of "useful" and "useless" benchmarks are disjoint and complementary, we can say that "!= useless" == "useful", and thus his statement reduces to "correlated == useful". To support the assumption that "useful" and "useless" are disjoint and complementary, attempt a contradiction. Suppose that the sets are not disjoint. This means that there must exist at least on benchmark that is both "useful" and "useless", clearly a contradiction. Suppose that there exists a benchmark that is neither "useful", nor "useless". But by definition, since this benchmark is not "useful", it must be "useless". Thus the reduction "!= useless" == "useful" must hold. QED. so he DID say, using your nomenclature, "correlated == useful". |>One of the real problems I note in reading this group is that there are a |>LOT of folks who are weak on logic, in interpretation of benchmarks as well |>as other postings. |> You provided an excellent example! Beam me up, Scotty, there's no intelligent life in THIS newsgroup.
aburto@marlin.NOSC.MIL (Alfred A. Aburto) (04/26/91)
A question I have (for a test, instead of that 'useless', 'meaningless' discussion) is: What is the correlation between Dhrystone 2.1 results and Integer SPECmarks? How 'bad' is Dhrystone really compared to Integer SPECmarks? Don't really need to compute a correlation, but just show a table of comparable results (Integer SPECmark results vs Dhrystone results relative to VAX-11/780). What if I took 4 (or N) integer programs (different than used by SPEC) and ran them on various systems and computed performance relative to the VAX-11/780. Would these integer results agree with the integer SPECmark results for the same systems? Would they even be close? Al Aburto aburto@marlin.nosc.mil
mash@mips.com (John Mashey) (04/26/91)
In article <1749@marlin.NOSC.MIL> aburto@marlin.NOSC.MIL (Alfred A. Aburto) writes: >What is the correlation between Dhrystone 2.1 results and Integer >SPECmarks? How 'bad' is Dhrystone really compared to Integer SPECmarks? >Don't really need to compute a correlation, but just show a table of >comparable results (Integer SPECmark results vs Dhrystone results relative >to VAX-11/780). I don't have the numbers handy, and am about to go out of town again. However, there are a number of combinations where Dhrystone would predict that machine A is 25% faster than machine B, but on SPEC integer, machine B is 25% faster than machine A, or equivalent combinations where the prediction is 50% off. Combinations like this include RS/6000 vs MIPS, or Intel i860 vs MIPS, at appropriate clock rates. A particular case is RS/6000 Model 320, which SPECints around 16, but Dhrystone (1.1) is around 27.5, versus MIPS Magnum (25Mhz, not the newer 33s), which has SPECint at 19.5, but has a lower Dhrystone than the RS/6000. If I find time, I'll dig out the numbers, but I've seen enough data over the years to have stopped collecting it. What it said was: a) Dhrystone ALWAYS gives a higher VAX-mips rating than SPECint. (except maybe the VAX-11/780 :-) 1.1 is worse (higher) than 2.1, but 2.1 is high also. the raio ranges from about 1.1 up to at least 1.6, maybe even as high as 2X. b) The Dhrystone:SPECint ratios grossly track with a single product line, except that small-cache machines of a family look more better on Dhrystone than on SPECint. > >What if I took 4 (or N) integer programs (different than used by SPEC) >and ran them on various systems and computed performance relative to >the VAX-11/780. Would these integer results agree with the integer >SPECmark results for the same systems? Would they even be close? Depends on the benchmarks. If you look at the data, you find that MIPS's mips-ratings are rather close to SPECint, and the reason is that the set of benchmarks we used itnernally for the integer side (which are actually include much worse cache-busters than SPECint), and account for a few billion cycles of execution .... correlate with SPECint to within 10% or closer ... and they existed BEFORE SPEC. Of course, one of the benchmarks (espresso) was included in both. Anyway, the answer is: if you run substantive integer benchmarks, single-user, I think SPECint is a pretty good predictor. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94088-3650
aburto@marlin.NOSC.MIL (Alfred A. Aburto) (04/30/91)
In article <2717@spim.mips.COM> mash@mips.com (John Mashey) writes: >I don't have the numbers handy, and am about to go out of town again. >However, there are a number of combinations where Dhrystone would predict >that machine A is 25% faster than machine B, but on SPEC integer achine B is 25% faster than machine A, or equivalent combinations where >the prediction is 50% off. Combinations like this include RS/6000 vs >MIPS, or Intel i860 vs MIPS, at appropriate clock rates. A particular >case is RS/6000 Model 320, which SPECints around 16, but Dhrystone (1.1) >is around 27.5, versus MIPS Magnum (25Mhz, not the newer 33s), which >has SPECint at 19.5, but has a lower Dhrystone than the RS/6000. >If I find time, I'll dig out the numbers, but I've seen enough data over >the years to have stopped collecting it. What it said was: > a) Dhrystone ALWAYS gives a higher VAX-mips rating than SPECint. > (except maybe the VAX-11/780 :-) 1.1 is worse (higher) than 2.1, > but 2.1 is high also. the raio ranges from about 1.1 up to at > least 1.6, maybe even as high as 2X. > b) The Dhrystone:SPECint ratios grossly track with a single > product line, except that small-cache machines of a family look > more better on Dhrystone than on SPECint. Here are some Dhrystone 1.1 and integer SPEC program comparison results I gathered. The Dhrystone 1.1 results came from an article by Walter Price of Motorola ('A Benchmark Tutorial', IEEE MICRO, Oct 1989, page 28). In the table below 'D/S' are the Dhrystone(1.1)/Sec results. I used the PEAK Dhrystone 1.1 results from Price's article for each system in the table. I used the peak numbers because I didn't want problems that might happen by posting the low numbers. Also the peak numbers were more consistent. That is, people tended to report the peak number for both the low AND high Dhrystone results in Price's article. So the low numbers tend to be doo-doo and the high numbers more reasonably consistent. As indicated in the table Dhrystone 1.1 ratio results are greater than the Integer SPEC ratio results by 14% to 24% with an average of 21% greater. This is pretty much as you indicated, but I didn't find any really abnormal results (probably due to a lack of enough data). Dhrystone 2.1 results would be useful too, but I don't have a data-base ..... An interesting result are the correlation coefficients across the various systems. The Dhrystone 1.1 ratios correlate rather well (0.90 to 0.99) with all 4 SPECratios and the SPECint (Geometric mean of SPECratio results). What this indicates, relative to the results in the table below, is that Dhrystone 1.1 predicts RELATIVE PERFORMANCE across the 10 systems examined just as well as GCC, espresso, li, eqntott, and SPECint. The correlation in performance prediction between these various programs is quite strong despite the fact that they are all really quite different programs with different instruction mixes. I suppose this makes some sense though because a CPUs performance (relative to other CPUs) is generally improved not for a few instructions but for all instructions. This would tend to make the correlation of performance ratios somewhat (there are no absolutes) independent of the instruction mix and thus the type of program. Dhrystone 1.1 SPECratio SPECint ------------- ---------------------- ------- System MHz D/S Ratio GCC ESP LI EQN DEC VAX 11/780 5.00 1870 1.0 1.0 1.0 1.0 1.0 1.0 HP 9000/340 16.67 6536 3.5 3.1 2.3 3.3 2.2 2.7 Sun 4/260 16.67 19900 10.6 9.9 7.8 9.1 8.3 8.7 Sun SPARCstation 1 20.00 22049 11.8 10.7 8.9 9.0 9.7 9.5 HP 9000/834 15.00 23441 12.5 10.2 8.9 11.7 10.1 10.2 MIPS RC2030 16.67 31200 16.7 8.6 11.8 14.2 11.5 11.3 DECstation 3100 16.67 26600 14.2 10.9 12.0 13.1 11.2 11.8 HP Apollo 10000 18.20 27000 14.4 12.8 12.9 11.1 11.1 11.9 SPARCstation 330 25.00 27777 14.9 13.8 11.6 11.2 12.6 12.3 MIPS M/120-5 16.67 31000 16.6 12.5 12.2 15.4 12.0 13.0 MIPS M/2000 25.00 47400 25.3 19.0 18.3 23.8 18.4 19.8 ------------------------------------------------------------------------- Arithmetic Mean 14.1 11.1 10.7 12.2 10.7 11.1 Standard Deviation 5.2 3.8 3.9 5.0 3.8 4.0 Correlation Coef WRT Dhry ratio ---- 0.90 0.98 0.98 0.98 0.99 Correlation Coef WRT GCC ratio ---- 0.92 0.85 0.95 ---- Correlation Coef WRT ESP ratio ---- 0.93 0.98 ---- Correlation Coef WRT LI ratio ---- 0.94 ---- Percent 'Error' by Dhrystone ---- 21.3 24.1 13.5 24.1 ---- Relative to SPEC Integer Programs. Al Aburto aburto@marlin.nosc.mil
patrick@convex.COM (Patrick F. McGehearty) (04/30/91)
In article <1751@marlin.NOSC.MIL> aburto@marlin.NOSC.MIL (Alfred A. Aburto) writes: > >As indicated in the table Dhrystone 1.1 ratio results are greater than the >Integer SPEC ratio results by 14% to 24% with an average of 21% greater. ... Of the systems listed, I believe all but the DEC VAX 11/780 came after Dhrystone started to be a widely quoted benchmark. Because it is a fairly small code, it is easy to use as a tool for focusing compiler tuning efforts. [After all, if you only have limited resources for tuning, you might as well use them where they can be shown to make a difference] I suspect that if DEC were to spend a few months tuning their C compiler (or maybe just rerun their latest release on a 11/780), they also could get another 20% out of the Dhrystone benchmarks. I also would not be surprised to see the SPEC numbers improve in the next few years for existing hardware with better compilers. What gets measured gets worked on. The advantage of benchmark suites like SPEC (and for you number crunchers, the Perfect Club and Slalom benchmarks) is that there is such a variety of coding styles and usages that the improvements are likely to benefit many real codes. In some cases, special tricks will be found that only benefit those codes, but for the most part, improvements will be made that help many programs run faster. Back to the issue that started this discussion: When you propose a new benchmark, consider how vendors will respond if people start using it for serious competitive evaluation. [If you don't want people to use it, why are you proposing it??] Will it encourage the vendors to improve the things you want improved? If not, can it be changed to do so? Show it to a few people (with DRAFT, DO NOT DUPLICATE marked all over it), and get their feedback. Then ask yourself again if it is useful. The 'date' benchmark has a number of serious flaws. A key one is that if the date operation were added to the command shell, it would go many times faster. Since the intent is to measure process spawning time, ... well, you get the point. A similar thing happened to the getpid system call. Some people at Berkeley wanted to know how fast a trivial system call was, so they could tune the syscall interface. They wrote a loop to call getpid() many times. This test was appropriate for their purposes. Later, this test (and many others) was made generally available. Some vendors chose to speed up this test by caching the process id in user space on the first getpid(), and avoiding the system call overhead for subsequent getpid()'s. There is nothing wrong with that optimization, just that it does nothing for real user programs. The main reason I don't like the date benchmark is that it encourages me (as vendor) to fix the wrong things. In addition, as a user, the benchmark does me little good, because I have little confidence that it will measure the same things I care about (at least it won't after the vendors start working on it if they take it seriously). The same reasoning applies to the 'bc' benchmark which ran through this news stream a while ago.
mash@mips.com (John Mashey) (04/30/91)
In article <1751@marlin.NOSC.MIL> aburto@marlin.NOSC.MIL (Alfred A. Aburto) writes: >In article <2717@spim.mips.COM> mash@mips.com (John Mashey) writes: >>I don't have the numbers handy, and am about to go out of town again. >>However, there are a number of combinations where Dhrystone would predict >>that machine A is 25% faster than machine B, but on SPEC integer >achine B is 25% faster than machine A, or equivalent combinations where >>the prediction is 50% off. Combinations like this include RS/6000 vs >>MIPS, or Intel i860 vs MIPS, at appropriate clock rates. A particular >>case is RS/6000 Model 320, which SPECints around 16, but Dhrystone (1.1) >>is around 27.5, versus MIPS Magnum (25Mhz, not the newer 33s), which >>has SPECint at 19.5, but has a lower Dhrystone than the RS/6000. >>If I find time, I'll dig out the numbers, but I've seen enough data over >>the years to have stopped collecting it. What it said was: >> a) Dhrystone ALWAYS gives a higher VAX-mips rating than SPECint. >> (except maybe the VAX-11/780 :-) 1.1 is worse (higher) than 2.1, >> but 2.1 is high also. the raio ranges from about 1.1 up to at >> least 1.6, maybe even as high as 2X. >> b) The Dhrystone:SPECint ratios grossly track with a single >> product line, except that small-cache machines of a family look >> more better on Dhrystone than on SPECint. >Here are some Dhrystone 1.1 and integer SPEC program comparison results >I gathered. > >The Dhrystone 1.1 results came from an article by Walter Price of Motorola >As indicated in the table Dhrystone 1.1 ratio results are greater than the >Integer SPEC ratio results by 14% to 24% with an average of 21% greater. >This is pretty much as you indicated, but I didn't find any really abnormal >results (probably due to a lack of enough data). Dhrystone 2.1 results >would be useful too, but I don't have a data-base ..... For the high ones, look at IBM RS/6000, i860, or maybe Motorola 88K. AMD would probably be high, but doesn't have SPEC result published, to my knowledge. Note that IBM's 27.5/15.8 = 1.7+ ... and I think you'll find the i860 is probably up there as well ... and the DG workstation labeled 17 mips gets around 10 on SPECint ... I don't think we disagree, except in choice of data. Of the data points, the VAX is = 1 by definition. 4 of them are MIPS machines 1 is an HP [which is fairly similar to MIPS, and shares some roots in similar compiler technology] 3 are SPARCs I.e., as I said, within product lines you expect that the major determiner of speed is clock rate, and Dhrystone will show you that. As a minor point, for whatever reason, most of the dhrystone-vax-mips ratings in the world assume VAX-11/780 = 1,757 1.1 Dhrystones, which slightly raises the numbers everywhere. The major issue is (just to make sure people aren't confused by the posted table): IF you pick two machines at random, A and B: a) Dhry(A) and Dhry(B) will both give vax-mipsratings that are high. b) Dhry(A)/Dhry(B) will give reasonably good correlations with SPEC(A)/SPEC(B), especially if A and B are from same family or are related. Unfortunately, there are also plenty of data points, specically, with machines that included instructions to help strcpy, or have done certain optimizations, where you easily pick points where: Dhry(A) > Dhry(B) and SPEC(A) < SPEC(B), by a substantial margin. Thus, one must be careful to distinguish between the 2 statements: a) There is a good correlation between Dhrystone and SPEC (TRUE, in general, especially if you include the vast numbers of X86-based products usually listed). AND b) Dhrystone is a good enough predictor of PEC that you don't need to run SPEC .... FALSE, in practice, because you can get terribly surprised, because a signficant number of recent machines and architectures are OUTLIERs when you do a scatter plot of this kind of data. Now, of course, whether this matters or not depends on whether or not you think SPECint correlates with anything else :-) (As it happens, at least some of us think it does, because it correlates closer than dhrystone does with numerous large internal single-user benchmarks.) Anyway, maybe someone can put together a table with 1 MIPS 1 SPARC 1 RS/6000 1 i860 1 HP 1 88K 1 68K 1 486 and using the more common 1757 number... -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94088-3650
jvm@hpfcso.FC.HP.COM (Jack McClurg) (04/30/91)
/ hpfcso:comp.benchmarks / patrick@convex.COM (Patrick F. McGehearty) / 4:51 pm Apr 29, 1991 / > A similar thing happened to the getpid system call. Some people at Berkeley > wanted to know how fast a trivial system call was, so they could tune the > syscall interface. They wrote a loop to call getpid() many times. This > test was appropriate for their purposes. Later, this test (and many others) > was made generally available. Some vendors chose to speed up this test by > caching the process id in user space on the first getpid(), and avoiding the > system call overhead for subsequent getpid()'s. There is nothing wrong with > that optimization, just that it does nothing for real user programs. While I agree with just about everything else that Mr. McGehearty says in his response, the statement above is (As far as I know about HP) just speculation. The getpid system call is not cached on HP-UX, but the performance on getpid was so much better than expected (anomalously) that the trick above was assumed by the author of a paper comparing operating system performance with performance on common benchmarks. I cannot remember the name of the paper, but it pointed out some disconcerting (to me) areas where OS performance does not keep up with what should be expected from benchmarks. Jack McClurg
aburto@marlin.NOSC.MIL (Alfred A. Aburto) (05/02/91)
In article <2800@spim.mips.COM> mash@mips.com (John Mashey) writes: >In article <1751@marlin.NOSC.MIL> aburto@marlin.NOSC.MIL (Alfred A. Aburto) writes: >>In article <2717@spim.mips.COM> mash@mips.com (John Mashey) writes: >For the high ones, look at IBM RS/6000, i860, or maybe Motorola 88K. >AMD would probably be high, but doesn't have SPEC result published, >to my knowledge. >Note that IBM's 27.5/15.8 = 1.7+ ... and I think you'll find the i860 is >probably up there as well ... and the DG workstation labeled 17 mips >gets around 10 on SPECint ... > Sorry about the delayed response but I was out of town too. Unfortuately I don't have all that information. I did see i860 Dhrystone 2.1 and SPECmark numbers posted in this news group but they were not broken down like the other information I had so I didn't use that i860 information (I needed Dhrystone 1.1, and the SPECratios for each of the Integer SPEC programs). Anyone have this information for the i860 and RS/6000 systems? >I don't think we disagree, except in choice of data. Of the data points, >the VAX is = 1 by definition. >4 of them are MIPS machines >1 is an HP [which is fairly similar to MIPS, and shares some roots in > similar compiler technology] >3 are SPARCs > >I.e., as I said, within product lines you expect that the major determiner >of speed is clock rate, and Dhrystone will show you that. There were 3 HP systems and one was a 68030 type. I thought I went across product lines reasonably well, but it was not complete of course (things never are, but thats good I guess). I did the correlation with respect to clock rate across the 10 systems: Dhrystone V1.1 SPECratio SPECint Ratio ---------------------- GCC ESP LI EQN Correlation WRT Clock Speed: 0.54 0.69 0.54 0.45 0.60 0.58 The correlation with clock speed appears marginal which is interesting. There appear to be are other things going on besides just clock speed increases. >As a minor point, for whatever reason, most of the dhrystone-vax-mips >ratings in the world assume VAX-11/780 = 1,757 1.1 Dhrystones, >which slightly raises the numbers everywhere. Yes, I was aware of that, but I felt constrained to use the peak numbers as given in that article and the article indicated 1870 Dhrys/sec (1.1) peak for the the VAX 11/780. I've seen the 1757 Dhrys/sec (V1.1) referenced in IBM advertisements for their POWERstations, but that is all I know about that number. >The major issue is (just to make sure people aren't confused by the posted >table): >IF you pick two machines at random, A and B: > > a) Dhry(A) and Dhry(B) will both give vax-mips ratings that are high. > b) Dhry(A)/Dhry(B) will give reasonably good correlations with > SPEC(A)/SPEC(B), especially if A and B are from same family or > are related. Based on the results I'd say Dhry(A) and Dhry(B) yield VAX-MIPS ratings that are 14% to 24% high WHEN COMPARED to SPECint(A) and SPECint(B) VAX-MIPS ratings. I'd hesitate to infer anything beyond that as I'm still seeking more information. The results indicate that Dhry(A) / Dhry(B) ratios correlate strongly (I would say 'strongly' vice 'resonably good') with SPEC(A) / SPEC(B) ratios. The correlations were rather high after all with a minimum of 0.90 and max of 0.98. The average was 0.96 and the correlation with SPECint was 0.99. These high correlations may not hold up though if we had a larger data base to examine. I think we still need to sift through more data. The correlations were across several different CPU's so I don't agree that 'especially ...' part of b) above as part of the results. >Unfortunately, there are also plenty of data points, specifically, >with machines that included instructions to help strcpy, or have done >certain optimizations, where you easily pick points where: > Dhry(A) > Dhry(B) and SPEC(A) < SPEC(B), by a substantial margin. Yes, this is very true but please note that there are also cases where Dhry(A) < Dhry(B) AND SPEC(A) > SPEC(B). There is an example of this in the table of results I posted. It is not a substantial difference but still a difference. By the way, I'm not 'down on SPEC', or 'up on Dhrystone'. I think SPEC is the best thing that has happened to benchmarking recently. SPEC is certainly developing an excellent data base on system performance. A verifiable, repeatable, and solid data base. Something we've needed for a long time. Dhrystone results however are confusing, mostly because it is a small program, cache sensitive, and it can be optimized to such a large extent. So, yes, I agree one can get surprised and confused with the Dhrystone results. One needs to be very careful in using those numbers. I picked the peak numbers only because they appeared to be more consistent. If I had used the low numbers or an average of the low and high then I don't think the results would have been near the same. I was interested in the correlation of OTHER integer programs too with SPECint. I used Dhrystone because results were readily available. >Anyway, maybe someone can put together a table with > 1 MIPS > 1 SPARC > 1 RS/6000 > 1 i860 > 1 HP > 1 88K > 1 68K > 1 486 >and using the more common 1757 number... > Yes, the data must be available here and there and it would be good to get it altogether in one place (here I hope) ..... Al Aburto aburto@marlin.nosc.mil
wjb@cogsci.cog.jhu.edu (05/03/91)
In article <21720006@hpfcso.FC.HP.COM> jvm@hpfcso.FC.HP.COM (Jack McClurg) writes: >patrick@convex.COM (Patrick F. McGehearty) / 4:51 pm Apr 29, 1991 / wrote: >> [benchmarks and caching of process id for getpid() calls] > >[disavows any such caching in HPUX] I've heard this "rumor" as well. I always wondered why people didn't try to test it out by comparing getpid() and getppid() times. Your PID may not ever change, but your PPID will change to 1 if your parent exits. This would appear to require that getppid() enter the kernel. Actually, the real way to speed up both of these system calls is to map a page of memory read-only into the data space of the process and have the program read this data directly from where the kernel stores it. I guess this is another example of the dangers of micro benchmarks and making assumptions about how different system functions are implemented. Bill Bogstad
mash@mips.com (John Mashey) (05/03/91)
In article <1756@marlin.NOSC.MIL> aburto@marlin.NOSC.MIL (Alfred A. Aburto) writes: >>As a minor point, for whatever reason, most of the dhrystone-vax-mips >>ratings in the world assume VAX-11/780 = 1,757 1.1 Dhrystones, >>which slightly raises the numbers everywhere. >Yes, I was aware of that, but I felt constrained to use the peak numbers >as given in that article and the article indicated 1870 Dhrys/sec (1.1) >peak for the the VAX 11/780. I've seen the 1757 Dhrys/sec (V1.1) >referenced in IBM advertisements for their POWERstations, but that is all >I know about that number. That's the problem, of course with using base numbers that can change around on you [which is why the SPEC base numbers are frozen for eternity]. Using 1757 as the base are {Sun, HP, IBM, Motorola, and many others} So, I'll list here some various combinations, expressed both as SPECint, as in vax-mips-using-dhrystone-1.1, assuming 1757 is the base (because all of the following assume that, I think): SPECint dhry-mips Ratio System Notes 16.4 47.8 2.7X i860 @ 33Mhz SPECint published for Intel Star860, Summer 90; Dhrys from March 1989 i860 Performance Brief 15.8 27.5 1.7 RS/6000 320 SPEC #s and ads 12.9 20+ 1.6 HP 9000/425s 68040@ 25Mhz (I may be wrong on the 20, although that's what Moto usually calls a 25Mhz 68040, and the HP published Dhyr-mips number might actually be higher, but I can't seem to find it handy.) 26.4 41.5 1.6 MIPS Magnum/33 33MHz R3000A (NOTE: just to be perfectly clear, MIPSco has NEVER used a Dhrystone-mips rating as "the mips-rating" for a system. On our internal scale (our own set of benchmarks) this rates at 27, i.e., very close to SPECint). 38.1 57 1.5 HP Snakes,50Mhz [57 is from all of the ads...] 17.7 27.0 1.5 i486 @ 33Mhz Intel 486 Performance Brief, 1Q91. 20.7 28.5 1.4 Sun SS2 40Mhz Cypress SPARC >Based on the results I'd say Dhry(A) and Dhry(B) yield VAX-MIPS ratings >that are 14% to 24% high WHEN COMPARED to SPECint(A) and SPECint(B) VAX-MIPS >ratings. I'd hesitate to infer anything beyond that as I'm still seeking >more information. Well, above we have: 1 each of {IBM, MIPS, HP PA, SPARC} 1 each of {i486, 68040} which, I think covers a fair chunk of current computers..... And the LOWEST expansion was 40%..... Now, Dhrystone 2.1 generally would deflate these numbers by approx 10%; getting rid of strcpy-inlining would deflate by at another 20-30%, more in some cases. [For example, over the last few years, the best way to boost your Dhrystone was to put in strcpy/strcmp inlining, even though that has seldom been found to help realistic programs very much. the fiercest example of this is for the i860, whose code does the following: starts with strcpy of 30-byte constant string pads the string and the target to 32-bytes, which happens to be possible in this case aligns the string on 8-byte or better boundaries inlines the strcpy generates 2 16-byte loads and 2 16-byte store, which, being to and from a writeback cache, with zero-cache misses, goes fast. Of course, this bears no resemblance to any realistic character-pushing, and that fact rapidly shows up when you see the SPEC integer numbers...] Fortunately, (except in some ads in the Wall Street Journal :-), companies are emphasizing SPEC more and more these days. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94088-3650
meissner@osf.org (Michael Meissner) (05/04/91)
In article <02.May.91.132611.54@cogsci.cog.jhu.edu> wjb@cogsci.cog.jhu.edu writes: | I've heard this "rumor" as well. I always wondered why people | didn't try to test it out by comparing getpid() and getppid() times. Your | PID may not ever change, but your PPID will change to 1 if your parent exits. | This would appear to require that getppid() enter the kernel. Actually, the | real way to speed up both of these system calls is to map a page of memory | read-only into the data space of the process and have the program read this | data directly from where the kernel stores it. I guess this is another | example of the dangers of micro benchmarks and making assumptions about how | different system functions are implemented. But of course if you fork (), your pid also changes...... -- Michael Meissner email: meissner@osf.org phone: 617-621-8861 Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142 Considering the flames and intolerance, shouldn't USENET be spelled ABUSENET?
wjb@cogsci.cog.jhu.edu (Bill Bogstad) (05/04/91)
In article <MEISSNER.91May3135345@curley.osf.org> meissner@osf.org (Michael Meissner) writes: >>[I (Bill Bogstad wrote about timing getppid() vs. getpid()] >But of course if you fork (), your pid also changes...... Only sort of. The parents PID doesn't change. The child just gets a new one. The idea behind this micro benchmark was to measure the overhead of a system call. The getpid() and getppid() calls don't do any significant work in the kernel and therefore would appear to be a good choice for measuring this. The problem is that getpid() COULD be written to cache the PID since it only changes when you do a fork() (under the control of the program itself). A PID cache would work in the presence of fork() by having the fork() run time library invalidate the PID cache within the context of the process. The parent process ID of a process can change completely asynchronously with respect to the process itself. There is no way for a cache of the PPID to be maintained in the programs dataspace without reference to information that only the kernel knows about. You would have to use something like a shared page of memory which the kernel sets and the process reads. (As mentioned in my last message.) Bill Bogstad
mash@mips.com (John Mashey) (05/04/91)
In article <03.May.91.174910.55@cogsci.cog.jhu.edu> wjb@cogsci.cog.jhu.edu (Bill Bogstad) writes: ... > The idea behind this micro benchmark was to measure the overhead of >a system call. The getpid() and getppid() calls don't do any significant AS usual, micro-level benchmarks can surprise you, and as architectural variations arise, you can really get surprised. As an example, register-window machines can be quite sensitive to the depth of function calls, unlike most machines that are more sensitive to the number of function calls. For example, it is quite possible for SPARC to look real good on getpid, but incur substantial overhead on system calls that make many levels of calls into the kernel.... This is one more case where a micro-level benchmark gets zapped by archiectural changes. Besides the register-windows thing, one can be surprised by: 1) MMU subtleties 2) Differences in cache design 3) Differences in memory system design. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94088-3650
aburto@marlin.NOSC.MIL (Alfred A. Aburto) (05/10/91)
In article <3001@spim.mips.COM> mash@mips.com (John Mashey) writes: >In article <1756@marlin.NOSC.MIL> aburto@marlin.NOSC.MIL (Alfred A. Aburto) writes: Thanks for that additional information. I used it and SPEC data held at perelandra.cms.udel.edu (spec.sc in directory bench) to revise the table I posted (thanks to John McCalpin for the pointer to the SPEC data). I also corrected the percent mean error results (they were wrong --- incorrectly calculated!). I also changed to the 1757 Dhrys/sec figure for the VAX 11/780 although I believe a more correct 'peak' value is required here! We need to use numbers OF THE SAME TYPE when comparing performance! That is, we need high optimized Dhrystones/sec numbers in this case for ALL the systems including the VAX 11/780. The sensitivity of Dhrystone to optimization is probably the main reason the Dhrystone ratio deviates so widely from the SPECratios and SPECint. As a test of this I ran Dhrystone 1.1 on a Sun 4/260 system. The low number without optimization ('cc') was 8900 Dhrys/sec while the high number with optimization ('cc -O4 -DREG=register') was 20000 Dhrys/sec. These yield VAX-MIPS ratings of 5.1 and 11.4 respectively. To quantify this type of variability, with respect to optimization, one could for example take the average of the low and high numbers to give 14450 Dhrys/sec and a VAX-MIPS rating of 8.2. Compare this to the SPECint rating of 8.7 for the Sun 4/260 in the table below. The comparison is a lot more reasonable now! Other measures of performance such as the median over the results using different compiler options may be more appropriate of course. Doing all this is a lot of trouble though and almost no one does it. Most people naturally want to report the 'best' numbers, the peak numbers, and so we wind up with highly biased and sometimes confusing Dhrystone results. The easiest solution to all this is to let Dhrystone rest in peace and just use the more reliable SPEC numbers. But I'm not happy with this as I want benchmarking to cover a much wider territory than SPEC now covers. I don't have alot more to add relative to the results below except the correlation between ALL the programs is still very high across ALL the 18 systems EVEN with the IBM POWERstation 320 and Intel i860 (systems 11 and 12) greatly distorting the Dhrystone V1.1 results. Poor Dhrystone --- the IBM and Intel compilers seem to have chewed it up and spit out Dhrys/sec and MIPS ratings bearing little relation to other integer program results and the geometric mean of those results (SPECint). I wonder what the results (Dhrys/sec) would be if different compilers were used on systems 11 and 12 and if optimization was disabled (if at all possible)? The results, I'm sure, would be quite different. Also I wonder how sensitive the SPECratio results are relative to different compilers and different compiler options? Another interesting result in the table below is the consistent and (now) relatively high correlation with clock speed for all the programs (Dhrystone1.1, GCC, Espresso, Lisp Interpreter, Eqntott, and SPECint). Another thing, since I'm being so mouthy anyway :-), what if GCC was ported to run on non-UNIX systems (to the vast world of microcomputers)? Maybe then we could arrive at a worthy test program of integer performance on these 'small systems'? A test program not so sensitive to optimization as the Dhrystone. Dhrystone1.1 SPECratio SPECint ------------ ---------------------- ----- System MHz D/S Ratio GCC ESP LI EQN 00 DEC VAX 11/780 5.00 1757 1.0 1.0 1.0 1.0 1.0 1.0 01 HP 9000/340 16.67 6536 3.7 3.1 2.3 3.3 2.2 2.7 02 Sun 4/260 16.67 19900 11.3 9.9 7.8 9.1 8.3 8.7 03 Sun SPARCstation 1 20.00 22049 12.5 10.7 8.9 9.0 9.7 9.5 04 HP 9000/834 15.00 23441 13.3 10.2 8.9 11.7 10.1 10.2 05 MIPS RC2030 16.67 31200 17.8 8.6 11.8 14.2 11.5 11.3 06 DECstation 3100 16.67 26600 15.1 10.9 12.0 13.1 11.2 11.8 07 HP Apollo 10000 18.20 27000 15.4 12.8 12.9 11.1 11.1 11.9 08 Sun SPARCstation 330 25.00 27777 15.8 13.8 11.6 11.2 12.6 12.3 09 HP 9000/425s 25.00 35140? 20.0? 13.8 13.4 15.5 9.7 12.9 10 MIPS M/120-5 16.67 31000 17.6 12.5 12.2 15.4 12.0 13.0 11 IBM POWERstation 320 20.00 51832 29.3 13.7 16.3 15.6 17.7 15.8 12 Intel Star860 33.00 83985 47.8 12.4 20.1 17.7 17.8 16.7 13 AT&T Starserver E 33.00 47439 27.0 16.2 16.6 22.2 14.5 17.2 14 DECstation 5000/200 25.00 42519 24.2 17.3 18.5 21.8 18.4 18.9 15 MIPS M/2000 25.00 47400 27.0 19.0 18.3 23.8 18.4 19.8 16 Sun SPARCstation 2 40.00 50075 27.5 19.6 17.6 22.7 21.4 20.2 17 HP 9000/720 50.00 100149 57.0 35.2 42.5 36.1 40.6 38.5 18 HP 9000/730 66.00 133532 76.0 46.5 55.2 50.3 52.6 51.0 ------------------------------------------------------------------------- Arithmetic Mean 25.5 15.9 17.1 18.0 16.7 16.8 Standard Deviation 17.5 9.8 12.2 10.6 11.7 10.9 Correlation Coef WRT Clock Speed 0.92 0.93 0.93 0.92 0.93 0.94 Correlation Coef WRT Dhry ratio ---- 0.90 0.96 0.93 0.95 0.94 Correlation Coef WRT GCC ratio ---- 0.98 0.97 0.98 0.99 Correlation Coef WRT ESP ratio ---- 0.97 0.98 0.99 Correlation Coef WRT LI ratio ---- 0.97 0.99 Correlation Coef WRT EQN ratio ---- 0.99 ---- Percent Mean 'Error' by Dhrystone ---- 60.4 49.1 41.7 52.7 51.8 Relative to SPEC Integer Programs. Al Aburto aburto@marlin.nosc.mil