amos@taux01.nsc.com (Amos Shapir) (12/02/90)
[Quoted from the referenced article by ethan@thinc.UUCP (Ethan Lish of THINC)] > >Greetings - > > This _benchmark_ does *NOT* have a legitimate value! > Sure it doesn't; I wonder how no one else noted this yet: "bc" is probably the worst choice of a utility to benchmark by. On most UNIX systems, it just parses expressions, and forks "dc" to execute them ("dc" is a reverse- polish string based numeric interpreter). So the results depend on how fast your system forks, and how "bc" and "dc" communicate. Besides, there are several versions of "bc" (some of which do not fork "dc") and since the original version of "dc" was rather buggy, several versions of it too, some of which are major rewrites. The bottom line is: comparing "bc" runs on different systems is necessarily comparing apples and oranges (or at least plums & prunes) unless you're sure you have the same version of "bc", "dc", and UNIX. Results posted here so far indicate most comparisons are indeed meaningless. -- Amos Shapir amos@taux01.nsc.com, amos@nsc.nsc.com National Semiconductor (Israel) P.O.B. 3007, Herzlia 46104, Israel Tel. +972 52 522255 TWX: 33691, fax: +972-52-558322 GEO: 34 48 E / 32 10 N
mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (12/02/90)
>>>>> On 1 Dec 90 21:20:17 GMT, amos@taux01.nsc.com (Amos Shapir) said: [.... commenting on the "echo 2^5000/2^5000 | /bin/time bc" benchmark ...] > This benchmark has no validity! Amos> Sure it doesn't; I wonder how no one else noted this yet: "bc" Amos> is probably the worst choice of a utility to benchmark by. On Amos> most UNIX systems, it just parses expressions, and forks "dc" to Amos> execute them [...] Of course the biggest problem is that almost no one actually *uses* `bc' for any large amount of computation, so no vendor has any incentive to optimize its performance. A secondary problem is that one could trivially optimize the benchmark away by adding a constant-expression simplifier to `bc' before it calls `dc', but everyone already knew that.... (Maple evaluated the expression on my SGI 4D/25 in 0.4 seconds wall time). -- John D. McCalpin mccalpin@perelandra.cms.udel.edu Assistant Professor mccalpin@brahms.udel.edu College of Marine Studies, U. Del. J.MCCALPIN/OMNET
dale@convex.com (Dale Lancaster) (12/04/90)
>Amos> Sure it doesn't; I wonder how no one else noted this yet: "bc" >Amos> is probably the worst choice of a utility to benchmark by. On >Amos> most UNIX systems, it just parses expressions, and forks "dc" to >Amos> execute them [...] >Of course the biggest problem is that almost no one actually *uses* >`bc' for any large amount of computation, so no vendor has any >incentive to optimize its performance. For this reason it makes a good a benchmark as any. I suspect that most Unix based bc's/dc's work the same. This is a typical dusty-deck, no optimization piece of code that shows the performance of the machine from not just the hardware side but the software (compiler) side as well. Today you have to benchmark the compiler as much as the hardware. But of course the only true measure of performance is my own application :-) dml
de5@ornl.gov (Dave Sill) (12/04/90)
In article <5042@taux01.nsc.com>, amos@taux01.nsc.com (Amos Shapir) writes: >[Quoted from the referenced article by ethan@thinc.UUCP (Ethan Lish of THINC)] >> >>Greetings - >> >> This _benchmark_ does *NOT* have a legitimate value! >> > >Sure it doesn't; I wonder how no one else noted this yet: "bc" is probably >the worst choice of a utility to benchmark by. It may not be rigorous, but it does have value. For one thing, it's short enough to be memorized and easily typed at a box at, say, an expo. >On most UNIX systems, it >just parses expressions, and forks "dc" to execute them ("dc" is a reverse- >polish string based numeric interpreter). So the results depend on how >fast your system forks, and how "bc" and "dc" communicate. How does that invalidate the results? That's like penalizing an optimizing compiler for taking shortcuts the other one didn't. If the bc benchmark runs faster on system A than it does on B because vendor A took the time to optimize bc, then good for them! The danger is not some inherent unreliability in the benchmark, it's in incorrectly interpreting the results. This highlights very well the fundamental danger of benchmarking: generalization. Just because one system outperforms another on, say, a floating point benchmark, doesn't mean that it will *always* outperform it on all floating point code. >The bottom line is: comparing "bc" runs on different systems is necessarily >comparing apples and oranges (or at least plums & prunes) unless you're >sure you have the same version of "bc", "dc", and UNIX. Results posted >here so far indicate most comparisons are indeed meaningless. Bc is bc. If it takes 2^5000/2^5000 and correctly calculates the result, what does it matter how it gets there? I.e., this benchmark measures bc's performance. Interpreting it as a hardware benchmark is fallacious, since hardware performance is only one factor in the result. mccalpin@perelandra.cms.udel.edu (John D. McCalpin) replies to the same article: > >Of course the biggest problem is that almost no one actually *uses* >`bc' for any large amount of computation, so no vendor has any >incentive to optimize its performance. Ah, but would it be better to benchmark something the vendor has expected and optimized? If you're looking for actual performance instead of theoretical peak performance, perhaps it better to throw something unexpected at them. >A secondary problem is that one could trivially optimize the benchmark >away by adding a constant-expression simplifier to `bc' before it >calls `dc', but everyone already knew that.... Yes, but that would be readily apparent, wouldn't it? And it wouldn't invalidate the test. You just need to keep in mind that you're testing bc, and its dependence on hardware performance is only indirect. >(Maple evaluated the expression on my SGI 4D/25 in 0.4 seconds wall >time). Exactly, so Maple is faster than bc. You can't interpret this to mean that the SGI is faster than all the other systems that take longer to do it with bc. -- Dave Sill (de5@ornl.gov) Martin Marietta Energy Systems Workstation Support
jbuck@galileo.berkeley.edu (Joe Buck) (12/04/90)
In article <1990Dec3.191756.15280@cs.utk.edu>, de5@ornl.gov (Dave Sill) writes: > It [the bc bmark] may not be rigorous, but it does have value. > For one thing, it's short enough to be memorized and easily typed > at a box at, say, an expo. Surely you've heard the one about the drunk looking for his keys under the lamppost, even though he'd lost them in an alley. He explains that he's looking under the lamppost because the light is better. The bc "benchmark" is easy to measure, but the number is worthless. The drunk finds it easy to look under the lamppost, but he's not going to find what he's looking for. > >Of course the biggest problem is that almost no one actually *uses* > >`bc' for any large amount of computation, so no vendor has any > >incentive to optimize its performance. > Ah, but would it be better to benchmark something the vendor has > expected and optimized? If you're looking for actual performance > instead of theoretical peak performance, perhaps it better to throw > something unexpected at them. Even if your argument was true, I've seen the bc "benchmark" proposed years ago. Vendors already know about it. Here's a valid benchmark: how fast does a standard compiler run, with a standard input, compiling for a standard architecture? (This is the "gcc" test in SPEC). Other valid tests are: how fast does a widely used application program, like TeX, troff, SPICE, etc, run given a fairly complex input? The idea is to see how fast the computer runs programs that people actually use heavily. If a vendor's machine manages to do all these things fast, it will probably be fast on your real workload. -- Joe Buck jbuck@galileo.berkeley.edu {uunet,ucbvax}!galileo.berkeley.edu!jbuck
de5@ornl.gov (Dave Sill) (12/04/90)
In article <39871@ucbvax.BERKELEY.EDU>, jbuck@galileo.berkeley.edu (Joe Buck) writes: >In article <1990Dec3.191756.15280@cs.utk.edu>, de5@ornl.gov (Dave Sill) writes: >> It [the bc bmark] may not be rigorous, but it does have value. >> For one thing, it's short enough to be memorized and easily typed >> at a box at, say, an expo. > >[Amusing story about a drunk looking for his keys under a lamppost, >even though he'd lost them in an alley, because the light was >better.] > >The bc "benchmark" is easy to measure, but the number is worthless. Why do you say that? It's using a known tool to perform a known task, right? No, it won't tell whether machine X is faster than machine Y with absolute certainty. But it will tell me *something*. >> >Of course the biggest problem is that almost no one actually *uses* >> >`bc' for any large amount of computation, so no vendor has any >> >incentive to optimize its performance. > >> Ah, but would it be better to benchmark something the vendor has >> expected and optimized? If you're looking for actual performance >> instead of theoretical peak performance, perhaps it better to throw >> something unexpected at them. > >Even if your argument was true, I've seen the bc "benchmark" proposed >years ago. Vendors already know about it. Do they know *my* version of it? Maybe I've changed the numbers around, maybe I've changed the operations... >Here's a valid benchmark: how fast does a standard compiler run, with >a standard input, compiling for a standard architecture? (This is the >"gcc" test in SPEC). Other valid tests are: how fast does a widely >used application program, like TeX, troff, SPICE, etc, run given a >fairly complex input? I agree that both are valid tests. I don't agree that the bc test is fundamentally different. It's a smaller version of the same idea. >The idea is to see how fast the computer runs >programs that people actually use heavily. If a vendor's machine >manages to do all these things fast, it will probably be fast on your >real workload. Exactly, but you can't carry a SPEC tape with you wherever you go. I *can* carry the bc test with me and very quickly determine which end of the spectrum an unknown machine falls in. Will I base purchase decisions on such an admittedly trivial test? Of course not. Benchmarks are tools. Some are higher quality than others. Some take more skill to use than others. The existence of $12,000 table saws doesn't negate the ability of a $15 hand saw to cut a limb off a tree. They're different tools for different tasks. Our job as computing professionals is to select the appropriate tool for the task. -- Dave Sill (de5@ornl.gov) Martin Marietta Energy Systems Workstation Support
amos@taux01.nsc.com (Amos Shapir) (12/04/90)
[Quoted from the referenced article by Dave Sill <de5@ornl.gov>] |In article <5042@taux01.nsc.com>, amos@taux01.nsc.com (Amos Shapir) writes: | |>On most UNIX systems, it |>just parses expressions, and forks "dc" to execute them ("dc" is a reverse- |>polish string based numeric interpreter). So the results depend on how |>fast your system forks, and how "bc" and "dc" communicate. | |How does that invalidate the results? That's like penalizing an |optimizing compiler for taking shortcuts the other one didn't. If the |bc benchmark runs faster on system A than it does on B because vendor |A took the time to optimize bc, then good for them! The danger is not |some inherent unreliability in the benchmark, it's in incorrectly |interpreting the results. But all I see in this group is articles posting *one number* for this test; it may be ok to use it for a CPU-bound test (e.g., running "dc" alone), but running "bc" means creating two processes, and also measuring the overhead of context switches, etc. The problem is that most people do not know that this is happening, and use the results as if "bc" was doing what actually only "dc" does. | |Bc is bc. If it takes 2^5000/2^5000 and correctly calculates the |result, what does it matter how it gets there? I.e., this benchmark |measures bc's performance. Interpreting it as a hardware benchmark is |fallacious, since hardware performance is only one factor in the |result. But show me one article in this thread that does *not* treat this as a hardware benchmark; after all, that's exactly the charter of this group! "Bc" alone is not useful enough to warrant such an attention, and even if it was, benchmarks measuring its performance for its own sake would be posted in comp.bc, not here. -- Amos Shapir amos@taux01.nsc.com, amos@nsc.nsc.com National Semiconductor (Israel) P.O.B. 3007, Herzlia 46104, Israel Tel. +972 52 522255 TWX: 33691, fax: +972-52-558322 GEO: 34 48 E / 32 10 N
de5@ornl.gov (Dave Sill) (12/05/90)
In article <5045@taux01.nsc.com>, amos@taux01.nsc.com (Amos Shapir) writes: >[Quoted from the referenced article by Dave Sill <de5@ornl.gov>] >| >|How does that invalidate the results? That's like penalizing an >|optimizing compiler for taking shortcuts the other one didn't. If the >|bc benchmark runs faster on system A than it does on B because vendor >|A took the time to optimize bc, then good for them! The danger is not >|some inherent unreliability in the benchmark, it's in incorrectly >|interpreting the results. > >But all I see in this group is articles posting *one number* for this test; >it may be ok to use it for a CPU-bound test (e.g., running "dc" alone), >but running "bc" means creating two processes, and also measuring the overhead >of context switches, etc. The problem is that most people do not know >that this is happening, and use the results as if "bc" was doing what actually >only "dc" does. The man on the street could not care less how many context switches are involved--or even what a context switch *is*--all he cares about is how long it takes the computer to perform his task. That's one number: elapsed time. *You* might be extremely concerned about what's happening behind the scenes, but not everyone is. >|Bc is bc. If it takes 2^5000/2^5000 and correctly calculates the >|result, what does it matter how it gets there? I.e., this benchmark >|measures bc's performance. Interpreting it as a hardware benchmark is >|fallacious, since hardware performance is only one factor in the >|result. > >But show me one article in this thread that does *not* treat this as a >hardware benchmark; Show me one that does. I haven't seen *any* conclusions drawn from these results, except that one SPARCstation SLC seemed to take entirely too long. (Which points out another use of this trivial benchmark: a quick diagnostic to compare identical systems or to track a system through time.) >after all, that's exactly the charter of this group! No it isn't. I specifically included *all* types of benchmarking. Otherwise it would've been called comp.arch.benchmarks or comp.hw.benchmarks. >"Bc" alone is not useful enough to warrant such an attention, and even if it >was, benchmarks measuring its performance for its own sake would be >posted in comp.bc, not here. First of all, there is no comp.bc. Even if there was, this would be a very appropriate place. Just because the bc benchmark doesn't suit your needs doesn't mean it's worthless. -- Dave Sill (de5@ornl.gov) Martin Marietta Energy Systems Workstation Support
patrick@convex.COM (Patrick F. McGehearty) (12/06/90)
In article <1990Dec3.204027.16794@cs.utk.edu> Dave Sill <de5@ornl.gov> writes: >In article <39871@ucbvax.BERKELEY.EDU>, jbuck@galileo.berkeley.edu (Joe Buck) writes: >>In article <1990Dec3.191756.15280@cs.utk.edu>, de5@ornl.gov (Dave Sill) writes: >>> It [the bc bmark] may not be rigorous, but it does have value. >>> For one thing, it's short enough to be memorized and easily typed >>> at a box at, say, an expo. >> >>The bc "benchmark" is easy to measure, but the number is worthless. > >Why do you say that? It's using a known tool to perform a known task, >right? No, it won't tell whether machine X is faster than machine Y >with absolute certainty. But it will tell me *something*. > >I agree that both are valid tests. I don't agree that the bc test is >fundamentally different. It's a smaller version of the same idea. > >>The idea is to see how fast the computer runs >>programs that people actually use heavily. If a vendor's machine >>manages to do all these things fast, it will probably be fast on your >>real workload. > >Exactly, but you can't carry a SPEC tape with you wherever you go. I >*can* carry the bc test with me and very quickly determine which end >of the spectrum an unknown machine falls in. Will I base purchase >decisions on such an admittedly trivial test? Of course not. > I suggest that the bc benchmark is worse than worthless for several reasons. First, as pointed out, it is not measuring the raw add/multiply rate of the machine. It measures the "multi-precision arithmetic" capabilities as implemented by dc, which is mostly subroutine call/returns. Further, I have never seen a system where bc/dc is a significant user of cycles. Thus, the less than expert user will believe the measurements represent something different from reality. Second, for most machines, little architecture or compiler work has been (or should be) done to optimize this application. So you will not be able to tell the difference between those machines which have features useful to your application and those which do not. Third, widespread reporting of such a benchmark will encourage other, less knowledgable buyers to read more into the numbers than should be read. Fourth, if buyers use the benchmark, then vendors will be encouraged to put resources into enhancing their performance on it instead of enhancing something useful. This is a bad thing and the primary reason why I am posting. Bad benchmarks lead to lots of wasted effort. I use the Whetstone benchmark as a "proof by example". I know of several vendor development efforts (going back much more than 10 years, this is not a new phenomenon) which went to extreme efforts to improve their Whetstone results, including adding special microcode for certain instructions which the compiler only generated for for the Whetstone benchmark. Obviously, this particular trick only makes sense for the old style CISCy architectures, but you get the idea of what vendors will do to improve their benchmark results. There are similar stories for the Dhrystone benchmark. In these cases, the development efforts were not totally wasted. Efforts to speed up the transcendental functions (SIN, COS, etc) used in the Whetstones helped those applications which used the transcendentals. I see no value to most users of general purpose computing (scientific or business) in optimizing bc/dc. Many procurements require some minimum rate on a some well-known benchmark for a vendor to be even allowed to bid. If you can't make this number, you don't get a chance to show how good your architecture and compilers are for executing the customer's real application. There are even a significant number of customers who do not run benchmarks before purchase. They just rely on quoted numbers for well-known benchmarks. It is our duty as responsible professionals to develop and measure benchmarks that mean something and which explain what they mean. For workstations, the SPECmark benchmarks provide programs which are sufficiently complex as to avoid the trivial trick optimizations. If an optimization can make that set run faster, it will probably also apply to real application code. For scientific computers, the Perfect Club benchmarks serve the same purpose. They represent a dozen scientific applications with inner loops which cover a variety of code patterns which are found in real applications. The Livermore loops also have codes which are representative of inner loops of real applications. Improving these codes will improve real application performance. In a few years, their solution times will become so short as to require new problem definitions or data sets, but meanwhile, we in system development will have some meaningful metrics to work towards improving. If you really must have a "quick and dirty" benchmark, how about the following: program main real*8 a(256,256),b(256,256),c(256,256) call matmul(a,b,c,256) end subroutine matmul(a,b,c,n) real*8 a(n,n),b(n,n),c(n,n) do i = 1, n do j = 1, n c(i,j) = 0.0 do k = 1, n c(i,j) = c(i,j) + a(i,k)*b(k,j) enddo enddo enddo return end This is a basic matrix multiply loop which takes less than a second on a Convex C210. If you are running on a fast machine you might want to change 256 to 1024 (for 64 times more work). The matmul routine is separate from the main routine so that the optimizer cannot eliminate the work unless it performs interprocedural optimization or routine inlining. Be sure not to invoke any such options. This toy benchmark focuses on the floating point performance of the machine. It should show the architecture in a relatively favorable light if floating point is an important part of its product segment. It is large enough to blow most current cache systems if there is too great of a disparity between cache and non-cache processor performance. It is not hard to memorize or carry a copy around and type in. On a convex, execute with fc -O2 test.f -o test; /bin/time -e test -O2 requests the vectorization optimization. -O3 requests the parallel/vectorization optimization. The -e switch is a convex extension to /bin/time to provide extended accuracy to the microsecond level. Otherwise timing is only recorded to the nearest 100th of a second for compatibility with previous releases.
djh@xipe.osc.edu (David Heisterberg) (12/06/90)
In article <109872@convex.convex.com> patrick@convex.COM (Patrick F. McGehearty) writes: > program main > real*8 a(256,256),b(256,256),c(256,256) > call matmul(a,b,c,256) >... > do k = 1, n > c(i,j) = c(i,j) + a(i,k)*b(k,j) > enddo Won't Convex performance suffer here due to bank conflicts? At least on a CRAY, the above algorithm would not (or should not) find a place in any production code. The inner loop would be run over "i". -- David J. Heisterberg djh@osc.edu And you all know The Ohio Supercomputer Center djh@ohstpy.bitnet security Is mortals' Columbus, Ohio 43212 ohstpy::djh chiefest enemy.
mhcoffin@watmsg.uwaterloo.ca (Michael Coffin) (12/06/90)
In article <109872@convex.convex.com> patrick@convex.COM (Patrick F. McGehearty) writes: >I suggest that the bc benchmark is worse than worthless for several reasons. >[ ... ] >For workstations, the SPECmark benchmarks provide programs which are >sufficiently complex as to avoid the trivial trick optimizations. >[...] It is only worse than worthless if it doesn't correlate well with real performance. Does anyone have SPECmark numbers for these machines? It would be interesting to run correlations between SPECmark and BCmark. I'll be the correlation is high. -mike
lars@spectrum.CMC.COM (Lars Poulsen) (12/06/90)
I can hear Eugene chuckling. Most of the objections to the "bc benchmark" are valid, but they apply in some form or another to any benchmark. It's just that this one is so simple that the issues become obvious. -- / Lars Poulsen, SMTS Software Engineer CMC Rockwell lars@CMC.COM
de5@ornl.gov (Dave Sill) (12/06/90)
In article <109872@convex.convex.com>, patrick@convex.COM (Patrick F. McGehearty) writes: > >I suggest that the bc benchmark is worse than worthless for several reasons. > >First, as pointed out, it is not measuring the raw add/multiply rate >of the machine. Thus it is worthless for measuring raw arithmetic speed. Granted. >It measures the "multi-precision arithmetic" >capabilities as implemented by dc, which is mostly subroutine call/returns. So how does that make it worthless? E.g., assume I want to measure an inexact mix of arithmetic, function calls, context switches, etc. Does the bc test not depend on these types of tasks? >Further, I have never seen a system where bc/dc is a significant user of >cycles. Thus, the less than expert user will believe the measurements >represent something different from reality. I don't see how you got from the premise to your conclusion. As for the "less that expert user", if we fall into the trap of making twit-proof benchmarks and outlawing anything that's not obvious and general, we're not going to get anywhere. >Second, for most machines, little architecture or compiler work has >been (or should be) done to optimize this application. Lack of optimization does not invalidate a benchmark. >So you will not be >able to tell the difference between those machines which have features >useful to your application and those which do not. The bc test shouldn't be used to attempt to predict the performance of one's application unless one has specifically determined that such a comparison is valid. Granted. >Third, widespread reporting of such a benchmark will encourage other, >less knowledgable buyers to read more into the numbers than should be >read. The twit-proof trap, again. In this case, though, it's not really a big issue because we're not proposing the bc test be added to the SPEC suite. It's a limited-usefulness, extremely trivial benchmark. >Fourth, if buyers use the benchmark, then vendors will be encouraged >to put resources into enhancing their performance on it instead of >enhancing something useful. This is a bad thing and the primary reason >why I am posting. Bad benchmarks lead to lots of wasted effort. Oh come on! This is the twit-proof argument again. If someone's stupid enough to use bc to specify minimum performance in a procurement, then they absolutely deserve what they get. I'm certainly not going to lose any sleep over the possibility. >I use the Whetstone benchmark as a "proof by example". It seems like you're using it a proof by counterexample. > : >In these cases, the development efforts were not totally wasted. Efforts to >speed up the transcendental functions (SIN, COS, etc) used in the Whetstones >helped those applications which used the transcendentals. I see no value >to most users of general purpose computing (scientific or business) in >optimizing bc/dc. No vendor I know is stupid enough to devote their resources to optimizing bc just because a handful of people use it as a trivial benchmark. Even if it happened, though, what harm would there be in that? >Many procurements require some minimum rate on a some well-known benchmark >for a vendor to be even allowed to bid. If you can't make this number, you >don't get a chance to show how good your architecture and compilers are for >executing the customer's real application. There are even a significant >number of customers who do not run benchmarks before purchase. They >just rely on quoted numbers for well-known benchmarks. It is our duty >as responsible professionals to develop and measure benchmarks that mean >something and which explain what they mean. Exactly. This is what the major commercial suites are for. As I've already said, though, one doesn't want to have to carry a SPEC tape with them wherever they go. It's not always feasible to run a major, rigorous suite, and in many cases it's overkill. For example, and I've already pointed out this use of the bc test, let's say I'm working at my DECstation one day and it seems to be sluggish. Is it just my imagination, or is it really slower? Should I devote a day or so to running the SPEC suite just to find out, or should I type "echo 2^5000/2^5000 | /bin/time bc" and compare it to previous runs on the same machine or to a similarly configured system? >If you really must have a "quick and dirty" benchmark, how about the >following: > > [FORTRAN program deleted] Some of my systems don't have FORTRAN compilers. It's too long to easily remember and type in. It doesn't do a mix of function calls, arithmetic, forks, context switches, etc. As I've already said: benchmarks are tools. They come in all sizes, perform many different tasks, vary greatly in quality, etc. The bc benchmark doesn't do everything, and, like any tool, can be used for things it's not intended to be used for, but it *does* have its niche. -- Dave Sill (de5@ornl.gov) Martin Marietta Energy Systems Workstation Support
jbuck@galileo.berkeley.edu (Joe Buck) (12/07/90)
In article <1990Dec6.000702.16354@watmath.waterloo.edu>, mhcoffin@watmsg.uwaterloo.ca (Michael Coffin) writes: > It [bc benchmark]is only worse than worthless if it doesn't > correlate well with real > performance. Does anyone have SPECmark numbers for these machines? > It would be interesting to run correlations between SPECmark and > BCmark. I'll be the correlation is high. SPECmark is actually 10 numbers, not one. It might be interesting to see which of the 10 spec numbers correlates most strongly with the bc number, and what the coefficient of correlation is. Anyone have SPEC numbers to do the experiment with? -- Joe Buck jbuck@galileo.berkeley.edu {uunet,ucbvax}!galileo.berkeley.edu!jbuck
ciemo@bananapc.wpd.sgi.com (Dave Ciemiewicz) (12/07/90)
In article <dale.660251271@convex.convex.com>, dale@convex.com (Dale Lancaster) writes: |> >Amos> Sure it doesn't; I wonder how no one else noted this yet: "bc" |> >Amos> is probably the worst choice of a utility to benchmark by. On |> >Amos> most UNIX systems, it just parses expressions, and forks "dc" to |> >Amos> execute them [...] |> |> >Of course the biggest problem is that almost no one actually *uses* |> >`bc' for any large amount of computation, so no vendor has any |> >incentive to optimize its performance. |> |> For this reason it makes a good a benchmark as any. I suspect that most Unix |> based bc's/dc's work the same. This is a typical dusty-deck, no optimization |> piece of code that shows the performance of the machine from not just the hardware |> side but the software (compiler) side as well. Today you have to benchmark the |> compiler as much as the hardware. I really can't believe that this joke is being perpetuated. The idea of benchmarking is to create a reference standard by which different machines may *MEANINGFULLY* be compared. Since bc may actually be coded differently between systems, the comparisons become weaker. I just diff'ed the sources between BSD and SYSV versions of dc which is the compute engine for bc. There are changes to the SYSV version for robustness that may sway results one way or the other. For all intents and purposes, you might as well be comparing the execution of bc using different stop watches. Since it is purported that AIX is a complete rewrite of UNIX, they may have even rewritten something as simple as dc to avoid licensing which would also be a plausible explanation for the faster performance on the RS6000. It may not have anything to do with the IBM compilers. (Please correct me if I'm wrong.) This mockary of a benchmark does not fall into the category of "unchanged, dusty deck" code as many would believe. We are measuring numbers with different rulers. Not only that, the rulers are the cheap plastic kind we used in grade school. When you put two of them side-by-side, you find out they're different lengths. Meaningless for precise comparison. Possibly meaningless for gross comparisons. |> But of course the only true measure of performance |> is my own application :-) |> |> dml These are the best kind of comparisons of all. When I run my application on this system, what is the performance difference when I try to accomplish some real task. The ANSYS people port their product to a number of architectures and then run the products through scripted sets of tasks for comparisons. This may not be as swell or buzzwordy as MIPS or Dhrystones, yet, these comparisons are better for an individual trying to make comparisons between systems for running the ANSYS products than any of the other "metrics". Imagine buying a car based only on the horsepower, torque, and G ratings without actually taking the care for a test drive. The differences can be astounding. --- David Ciemiewicz P.S. I am speaking for myself. I like to live the delusion that my company supports my opinions however it really depends on what I'm talking about.
patrick@convex.COM (Patrick F. McGehearty) (12/07/90)
In article <1211@sunc.osc.edu> djh@xipe.osc.edu (David Heisterberg) writes: >In article <109872@convex.convex.com> patrick@convex.COM (Patrick F. McGehearty) writes: >> program main >> real*8 a(256,256),b(256,256),c(256,256) >> call matmul(a,b,c,256) >>... >> do k = 1, n >> c(i,j) = c(i,j) + a(i,k)*b(k,j) >> enddo > >Won't Convex performance suffer here due to bank conflicts? At least on >a CRAY, the above algorithm would not (or should not) find a place in any >production code. The inner loop would be run over "i". Just goes to show you (and me) how easy it is to not measure what you think you are measuring. There are many valid matrix-multiply patterns, but different patterns give different performance levels, with some better on some machines than others. The one I listed was on the top of my memory because it tests several of the optimization features of the Convex fc compiler. Thus, I use it occasionally to be sure we don't lose these features. In particular, it tests "loop splitting" and "loop interchange". Thus, the optimized version of the loop (which is generated invisibly to the user) is: do i = 1,n ! split off to allow interchange do j = 1,n c(i,j) = 0.0 enddo enddo do k = 1,n do j = 1,n do i = 1,n ! interchanged to avoid bank conflicts c(i,j) = c(i,j) + a(i,k)*b(k,j) enddo enddo enddo This version would be better for testing compute capabilities on most vector machines than the original. The following version would be better for testing compute capabilities on some machines with weak optimiziers, but worse for vector machines: do i = 1,n do j = 1,n sum = 0.0 do k = 1,n sum = sum + a(i,k)*b(k,j) enddo c(i,j) = sum enddo enddo The weak optimizers repeatedly compute the address of c(i,j) in the first version, but not the second. Unfortunately, the scalar sum causes less efficient code to be generated for many vector machines. So the same problem (matrix multiply) can show either of two machines to be better depending on small details of the problem statement. Which reemphasizes the point that trivial benchmarks mislead the less than expert. Note that I careful do not say 'twit'. It is very easy to make errors in benchmark development and analysis. I have seen all sorts of errors made in these areas by people who are competent in their own specialties. Most Computer Science and Computer Engineering curriculms provide very little training in measurement methodology. I learned much more about measurement from my Physics and Experimental Psychology courses than I did from my CS training. Physics labs teach about experimental error, and Psychology teachs about experimental design. The CS I was exposed to focused on modeling and analysis, with some discussion of modelling errors. Given this lack of training in measurement, specialists in the field need to be aware of the naiviety of the users of our results. A benchmark should have some well-understood relation to the real purposes for which a machine is to be used in order to have value. If the machine is for program development, then measure compilation of non-trivial programs. If the machine is for numerical work, measure some non-trivial application kernels. If databases, then run some db ops. Etc. Any single number of performance without a definition of proposed workload is of little value.
jrbd@craycos.com (James Davies) (12/07/90)
In article <dale.660251271@convex.convex.com> dale@convex.com (Dale Lancaster) writes: > >For this reason it makes a good a benchmark as any. I suspect that most Unix >based bc's/dc's work the same. This is a typical dusty-deck, no optimization >piece of code that shows the performance of the machine from not just the >hardware side but the software (compiler) side as well. I suspect it would take only a few hours of work on bc to add the peephole optimization X/X ==> 1. This would be totally pointless in practical terms, but it might sell machines if everyone keeps taking this "benchmark" so seriously :-) >Today you have to benchmark the compiler as much as the hardware. Unfortunately, this benchmark is too easy to cheat on (the above change relies on neither the compiler nor the hardware, just bc itself). >But of course the only true measure of performance is my own application :-) That's the rationale for benchmarks using more realistic programs, like SPEC and the Perfect Club. I think the bc "benchmark" points up a major reason that people like SPEC more than Perfect: it sums up a machine's performance in a single number. If the Perfect people started computing a "PerfectMark" by averaging their times, they might get more press coverage, but it wouldn't necessarily mean they were measuring anything more useful. People like easy answers to hard questions, I guess.
tim@proton.amd.com (Tim Olson) (12/07/90)
In article <1990Dec6.000702.16354@watmath.waterloo.edu> mhcoffin@watmsg.uwaterloo.ca (Michael Coffin) writes: | It is only worse than worthless if it doesn't correlate well with real | performance. Does anyone have SPECmark numbers for these machines? | It would be interesting to run correlations between SPECmark and | BCmark. I'll be the correlation is high. Well, it is fairly high, but that may be due to the limited intersection of machines which have SPECMarks and machines listed on the bc benchmark. Here's what I have found so far: SYSTEM TIME SPEC comp ratio ------ ---- MIPS RC6280 2.8 44.0 44.0 1.00 IBM RS/6000 Model 530 3.4 28.9 36.2 1.25 IBM RS/6000 Model 730 3.5 29.0 35.1 1.21 IBM RS/6000 320 4.4 22.3 27.9 1.25 Motorola 8612 33MHz 88k 5.2 17.8 23.6 1.33 SGI 4D25 8.1 12.2 15.2 1.25 SPARCstation 1+ 12.2 11.8 10.1 0.86 VAX 6410 (4.0) 12.8 6.8 9.6 1.41 SPARCstation 1 16.0 8.4 7.7 0.92 The "comp" column is a computed SPECMark based upon the bc time (1/t * 123) -- the 123 was chosen as the scale factor (assuming a linear relationship of 1/t with SPECMark) so that the first machine (MIPS RC6280) matched. Note that the only CISC machine in the listing, a VAX 6410, has a high ratio of real SPECMark to "computed" SPECMark, compared to the other machines. Also, the SPARCstations are predicted ~10% low, while the MIPS-based machines are predicted 25% high. *** PLEASE *** don't try to turn this into a quick way to guess a SPECMark; numbers are for informational purposes, only! ;-) -- -- Tim Olson Advanced Micro Devices (tim@amd.com)
lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (12/07/90)
In article <1990Dec6.221422.5676@mozart.amd.com> tim@amd.com (Tim Olson) writes: >The "comp" column is a computed SPECMark based upon the bc time (1/t * >123) -- the 123 was chosen as the scale factor (assuming a linear >relationship of 1/t with SPECMark) so that the first machine (MIPS >RC6280) matched. The Number You've All Been Waiting For: VAX 11/780 = 73.3 Evidently BCmarks come out smaller than SPECmarks. But then, we already knew that it was a budget benchmark. -- Don D.C.Lindsay
hooft@ruunsa.fys.ruu.nl (Rob Hooft) (12/07/90)
In <1211@sunc.osc.edu> djh@xipe.osc.edu (David Heisterberg) writes: >In article <109872@convex.convex.com> patrick@convex.COM (Patrick F. McGehearty) writes: >> program main >> call matmul(a,b,c,256) >>... >> do k = 1, n >> c(i,j) = c(i,j) + a(i,k)*b(k,j) >> enddo >Won't Convex performance suffer here due to bank conflicts? Convex is good! See below for the compiler results. Optimization by Loop for Routine MATMUL Line Iter. Reordering Optimizing / Special Exec. Num. Var. Transformation Transformation Mode ----------------------------------------------------------------------------- 5 I Dist 5-1 I FULL VECTOR Inter 5-2 I FULL VECTOR Inter 6-1 J Scalar 6-2 J Scalar 8-2 K Scalar Line Iter. Analysis Num. Var. ----------------------------------------------------------------------------- 5-1 I Interchanged to innermost 5-2 I Interchanged to innermost -- Rob Hooft, Chemistry department University of Utrecht. hooft@hutruu54.bitnet hooft@chem.ruu.nl hooft@fys.ruu.nl
zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) (12/07/90)
>>I suggest that the bc benchmark is worse than worthless for several reasons. >>[ ... ] > >It is only worse than worthless if it doesn't correlate well with real >performance. Does anyone have SPECmark numbers for these machines? >It would be interesting to run correlations between SPECmark and >BCmark. I'll be the correlation is high. I agree. Nobody claims it's a great benchmark, but I like to see more proof (possibly in the form of comparisons to SPEC) instead of so much speculation. I'm suprised at how well it agreed with the rankings I expected. -- Jon Zeeff (NIC handle JZ) zeeff@b-tech.ann-arbor.mi.us
jfc@athena.mit.edu (John F Carr) (12/08/90)
>Today you have to benchmark the compiler as much as the hardware
True: I noticed that dc for the IBM RT was compiled with pcc (the portable C
compiler, which does only peephole optimization). I recompiled with a
better compiler and the time for the test dropped from 45 seconds to 25
seconds. On a VAX, using gcc instead of pcc made about a 10% difference.
The VAX 3 series and the IBM RT make an interesting comparison. DEC claims
4.9 MIPS for the faster versions of the VAX 3 (VAX 3100 model 38, VAX 3900).
The RT is generally considered to be around 3-4 MIPS (Dhrystone v1 says
4.4). The "dc" benchmark (bc 2^5000/2^5000 pipes "2 5000^ 2 5000^/ps." to
dc on BSD UNIX) runs twice as fast on a VAX 3900 when the best compilers
available are used. On the other hand, a big number math package I've been
using runs faster on the RT even though the VAX version is written in
assebmly and the RT version in C. Floating point is usually faster on the
VAX, but because the RT uses a 68881 coprocessor it does better on
transcendental functions. Let me pick the benchmark and I'll demonstrate
that my RT is faster than a DECstation 3100 (~12 MIPS).
--
John Carr (jfc@athena.mit.edu)