khb@chiba.kbierman@sun.com (Keith Bierman - SPD Advanced Languages) (01/03/90)
-- Tim Olson comments:
.... integer mult in hw
>If they do, then they are not Instruction Set compatible. My copy of
the SPARC architecture manual lists only a MULScc (Multiply Step and
I don't recall supersets being outlawed in the v7 (currently
distributed spec) document.
Let's assume some random company wants to add integer multiply to
their SPARC box. Say some mythical company SolB.
SolB spins their own chip. With their own compilers they can obviously
generate whatever they want ... but what about all that Sun code ?
What about binary compatibility ?
When the Sun compilers can't optimize away an integer multiply they
generate a subroutine call (typically .mul as suggested in the v7
document).
So replacing .mul in the local versions of libc.so and libc.a would do
the trick nicely. Many (most ?) codes are dynamically linked, so they
get the speedup automatically. Those that are statically linked, will
have to be relinked for speedup.
What about that nasty subroutine call overhead ? Well if you don't
care about your code running on those evil Sun boxes, you can provide
(or your user's can crack open their copies of the Floating Point
Programmer's Guide) a .il template which will cause the .mul
subroutine call to be replaced with the native instruction.
--
Keith H. Bierman |*My thoughts are my own. !! kbierman@sun.com
It's Not My Fault | MTS --Only my work belongs to Sun*
I Voted for Bill & | Advanced Languages/Floating Point Group
Opus | "When the going gets Weird .. the Weird turn PRO"
"There is NO defense against the attack of the KILLER MICROS!"
Eugene Brooks
thor@stout.ucar.edu (Rich Neitzel) (01/03/90)
With all the talk about this subject I do not recall seeing any benchmarking of a sparc or any other system for that matter. The following table lists times generated by the Plum-Hall benchmark routines. (They were posted a while back to comp.misc.sources). There are three things that really stand out to my mind: 1> Integer multiplication on the sparc is horrible. However, even the HP RISC is bad. Since much of the code in the real world does more integer then floating point work, it appears that CICS can still more then hold its own. 2> In general, there is little difference between the RISC and CISC machines. For example the 68030 9000/370, 9000/835 and the SS1 all have very close timings. In fact, except for call+ret and floating point, the 68030 has better timing. Even among the "1st" generation machines the 68020 based 3/260 beats the 4/260 in the same areas. The floating pont timings are no real palm to the RISC systems, since they are measures of the fpa not the RISC cpu. 3> Sun seemly is playing fast and loose with users by claiming major improvements over their 680x0 line of machines. Worse, supposed upgrades are not living up to what one might exspect. Note that the 3/80 is lower in timing then the 3/260. Compare this to the HP 68030 machine and one wonders if Sun is purposely limiting the performance of their CISC machines - doubtless because their profit margin is lower on these. register auto auto int function auto int short long multiply call+ret double ------------------------------------------------------------------------------- cc: MVME-133 () .43 .74 .74 2.41 6.17 5.28 MVME-133 (-O4) .43 .53 .43 2.25 5.66 2.04 Sun-3/260 () 0.34 0.55 0.56 1.93 2.13 5.70 Sun-3/260 (-O4) 0.34 0.41 0.34 1.83 1.62 2.20 Sun-3/80 () 0.47 0.68 0.70 2.57 2.87 4.37 Sun-3/80 (-O4) 0.44 0.56 0.45 2.42 2.20 1.90 Sun-3E () 0.45 0.76 0.75 2.47 2.82 5.33 Sun-3E (-O4) 0.44 0.54 0.45 2.27 2.22 2.07 Sun-4 () 0.54 0.55 0.48 4.80 0.72 1.20 Sun-4 (-O4) 0.41 0.44 0.40 4.45 0.67 1.00 HP9000/370 (fpa -O) 0.22 0.26 0.22 1.35 3.96 0.62 HP9000/370 (-O) 0.21 0.26 0.22 1.35 3.08 1.21 HP9000/370(fpa no -O)0.26 0.40 0.36 1.44 4.42 1.56 HP9000/370 (no -O) 0.26 0.40 0.37 1.45 3.38 2.72 HP9000/835 (-O) 0.27 0.29 0.27 5.49 0.31 0.27 HP9000/835 (no -O) 0.29 0.53 0.45 5.62 0.31 0.59 Sun SS1 (no -O) 0.38 0.40 0.35 19.7 0.51 0.72 Sun SS1 (-O) 0.29 0.33 0.30 19.5 0.49 0.59 ------------------------------------------------------------------------------- Richard Neitzel National Center For Atmospheric Research Box 3000 Boulder, CO 80307-3000 303-497-2057 thor@thor.ucar.edu Torren med sitt skjegg Thor with the beard lokkar borni under sole-vegg calls the children to the sunny wall Gjo'i med sitt shinn Gjo with the pelts jagar borni inn. chases the children in. -------------------------------------------------------------------------------
levisonm@qucis.queensu.CA (Mark Levison) (01/03/90)
Rich Neitzel posted an article suggesting that from the timings he gave using the Plum Hall benchmark showed that RISC is often slower than CISC. Although I don't have my C Users Journal handy I recall that the original article said that these benchmarks where designed to be small and easily typed in at trade shows. They were also designed to stop good optimising compilers from doing as well as they might. These code is certainly not representative of a typical code. A better place to look might be the SPEC benchmarks. Mark Levison levisonm@qucis.queensu.ca #include <std_disclaimer.h> ---------------------A man to cheap to buy a real signature-------------
mash@mips.COM (John Mashey) (01/03/90)
In article <5842@ncar.ucar.edu> thor@stout.UCAR.EDU (Rich Neitzel) writes: > >With all the talk about this subject I do not recall seeing any benchmarking >of a sparc or any other system for that matter. The following table lists times >generated by the Plum-Hall benchmark routines. (They were posted a while back >to comp.misc.sources). There are three things that really stand out to my Could somebody post the critical parts of this again so we can look at it? Although I have high respect for Plum-Hall in general, I'm always nervous about micro-level benchmarks. Now, I hate to have to defend SPARC :-), but I must: realistic integer benchmarks that I know [like the SPEC ones] simply don't correlate with the results claimed below, at least not very much. The RISC machines are noticably faster on actual integer programs.... > 2> In general, there is little difference between the RISC and > CISC machines. For example the 68030 9000/370, 9000/835 and the SS1 > all have very close timings. In fact, except for call+ret and > floating point, the 68030 has better timing. Even among the "1st" > generation machines the 68020 based 3/260 beats the 4/260 in the > same areas. Again, this is why it would be nice to post the benchmark; one must always be very careful of micro-level benchmarks: I just don't believe that one can generalize from these results into thinking that a 33Mhz 68030 has faster integer performance overall than the RISCs... -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
tim@nucleus.amd.com (Tim Olson) (01/04/90)
In article <34058@mips.mips.COM> mash@mips.COM (John Mashey) writes: | In article <5842@ncar.ucar.edu> thor@stout.UCAR.EDU (Rich Neitzel) writes: | > | >With all the talk about this subject I do not recall seeing any benchmarking | >of a sparc or any other system for that matter. The following table lists times | >generated by the Plum-Hall benchmark routines. (They were posted a while back | >to comp.misc.sources). There are three things that really stand out to my | Could somebody post the critical parts of this again so we can | look at it? Although I have high respect for Plum-Hall in general, | I'm always nervous about micro-level benchmarks. Now, I hate to have | to defend SPARC :-), but I must: realistic integer benchmarks | that I know [like the SPEC ones] simply don't correlate with | the results claimed below, at least not very much. | The RISC machines are noticably faster on actual integer programs.... The benchmarks over-emphasize integer modulus. For example, the benchmark that reportedly tests register-integer variables looks like: /* benchreg - benchmark for register integers * Thomas Plum, Plum Hall Inc, 609-927-3770 * If machine traps overflow, use an unsigned type * Let T be the execution time in milliseconds * Then average time per operator = T/major usec * (Because the inner loop has exactly 1000 operations) */ #define STOR_CL register #define TYPE int #include <stdio.h> main(ac, av) int ac; char *av[]; { STOR_CL TYPE a, b, c; long d, major, atol(); static TYPE m[10] = {0}; major = atol(av[1]); printf("executing %ld iterations\n", major); a = b = (av[1][0] - '0'); for (d = 1; d <= major; ++d) { /* inner loop executes 1000 selected operations */ for (c = 1; c <= 40; ++c) { a = a + b + c; b = a >> 1; a = b % 10; m[a] = a; b = m[a] - b - c; a = b == c; b = a | c; a = !b; b = a + c; a = b > c; } } printf("a=%d\n", a); } and spends roughly 75% of its time performing the "%" operation. -- Tim Olson Advanced Micro Devices (tim@amd.com)
barnett@grymoire.crd.ge.com (Bruce Barnett) (01/04/90)
In article <5842@ncar.ucar.edu> thor@stout.UCAR.EDU (Rich Neitzel) writes: | register auto auto int function auto | int short long multiply call+ret double |cc: |Sun-4 () 0.54 0.55 0.48 4.80 0.72 1.20 |Sun-4 (-O4) 0.41 0.44 0.40 4.45 0.67 1.00 |Sun SS1 (no -O) 0.38 0.40 0.35 19.7 0.51 0.72 |Sun SS1 (-O) 0.29 0.33 0.30 19.5 0.49 0.59 I get: ------- Sun4/110 (no -O) 0.37 0.69 0.62 5.90 1.11 1.07 Sun4/110 (-O4) 0.26 0.34 0.26 5.30 0.73 0.83 SS1 (no -O) 0.38 0.40 0.36 3.43 0.51 0.72 SS1 (-O4) 0.30 0.33 0.30 3.30 0.49 0.60 The Sun 4/110 has the new FPU. There are several different FPU units around. Weitek and TI I believe. I remember something about early SparcStations being shipped with different FPU's. Not all SparcStations are created equal? -- Bruce G. Barnett barnett@crd.ge.com uunet!crdgw1!barnett
mash@mips.COM (John Mashey) (01/04/90)
In article <28594@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes: >In article <34058@mips.mips.COM> mash@mips.COM (John Mashey) writes: >| Could somebody post the critical parts of this again so we can >| look at it? Although I have high respect for Plum-Hall in general, >| I'm always nervous about micro-level benchmarks. Now, I hate to have >| to defend SPARC :-), but I must: realistic integer benchmarks >| that I know [like the SPEC ones] simply don't correlate with >| the results claimed below, at least not very much. >| The RISC machines are noticably faster on actual integer programs.... >The benchmarks over-emphasize integer modulus. For example, the >benchmark that reportedly tests register-integer variables looks like: ...... >and spends roughly 75% of its time performing the "%" operation. Like I said, I'm always nervous about micro-level benchmarks, even when done by smart people. Here is the summary, followed by details: SUMMARY OF MY ADVICE: 1) do NOT EVER use this benchmark to believe it means anything; if you have a copy, throw it away. 2) FORGET any conclusions that anyone has posted about relative performance of machines, based on this benchmark, other than the possible thought that multiply/divide integer don't happen to be done in hardware on SPARC and HP PA. 3) If you've ever told anyone this means much, please tell them you're sorry. DETAILS: Bo Thide kindly sent me a copy, and I took a quick look, finding similar results to Tim's: optimized R3000 code spent 60% of the time doing % (and remember, we have one of those in hardware...) Tables were given that looked like this: register auto auto int function auto int short long multiply call+ret double ------------------------------------------------------------------------------- cc: Sun-4 (-O4) 0.41 0.44 0.40 4.45 0.67 1.00 ...... WHAT'S RIGHT: 1) The code is very carefully done to eliminate surprise optimization. WHAT'S WRONG: 1,2,3) Columns 1, 2, and 3, which purport to measure the performance of various integer code, are completely dominated by the modulus operation, which is simply contrary to the statistics of the overpowering bulk of code out there. It would be plausible to generate something that had a mix of +, -, *, /, % and logic ops, using carefully chosen frequencies from a number of real programs (and even there, there are pitfalls), but something that does no * or /, and % way out of proportion, is guaranteed to blast a SPARC about as badly as it can be, relative to almost anything else. It won't help HP PAs much either.... Also, for column 1, optimized, I got 0% loads, and 5% stores, rather than the more typical 20% and 10%. 4) This column indeed measures the speed of integer multiply, in such a way that no compiler can do anything but do real multiplies with it. 5) This column measures the speed of function call/return with zero arguments. Unfortunately, different programs have different distributions of numbers and types of arguments, and many functions have arguments. Different machines differ greatly in the cost of passing arguments, and i nteh costs of passing different numbers of arguments.... 6) I haven't looked at the statistics of this much, except to notice there are equal numbers of FP * and /, which is also atypical. ---- 7) In general, although it's been said before in this newsgroup: a) People design computers using the statistics of real programs. b) The statistics of real programs differ, hence the tradeoffs you make depend on the benchmarks chosen. c) There are certain classes of codes for which at least one of integer *, /, or % are important enough, and cannot be gotten rid of even by a perfect compiler, where having these in hardware will help a lot.Over many realistic programs, hardware helps about enough that some people chose to include it, and some didn't. There's no way in the world that having it makes a 2-3X performance difference, overall, although you can find some real programs where it does. d) Like many synthetic benchmarks, it simply doesn't have a mixture of expressions that relates well to real compilers do, i.e., there is little that an optimizing compiler can do with this code, and a small number of registers are completely adequate. Neither of these two is generically true for real code. Anyway, these benchmarks mostly measure integer multiply and divide; these operations are where most RISCs have the least advantage over most CISCs; these operations are definitely what anybody would use to show that some CISC is faster than SPARC or PA; but it just doesn't correlate very well with the speeds on real programs. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
irf@kuling.UUCP (Bo Thide') (01/05/90)
In article <4411@crdgw1.crd.ge.com> barnett@grymoire.crd.ge.com (Bruce Barnett) writes: >I get: >------- >Sun4/110 (no -O) 0.37 0.69 0.62 5.90 1.11 1.07 >Sun4/110 (-O4) 0.26 0.34 0.26 5.30 0.73 0.83 >SS1 (no -O) 0.38 0.40 0.36 3.43 0.51 0.72 >SS1 (-O4) 0.30 0.33 0.30 3.30 0.49 0.60 I also get this now, but only if I use the inline library /usr/lib/libm.il. Without it, integer multiplication takes 6 times longer! Odd. -Bo
bs@linus.UUCP (Robert D. Silverman) (01/05/90)
In article <28594@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes: :In article <34058@mips.mips.COM> mash@mips.COM (John Mashey) writes: :| I'm always nervous about micro-level benchmarks. Now, I hate to have :| to defend SPARC :-), but I must: realistic integer benchmarks :| that I know [like the SPEC ones] simply don't correlate with :| the results claimed below, at least not very much. :| The RISC machines are noticably faster on actual integer programs.... : :The benchmarks over-emphasize integer modulus. For example, the Huh? I don't see multiple modulus operations in the loop below. I see ONE. How can one modulus operation inside a loop "over-emphasize" integer modulus? :benchmark that reportedly tests register-integer variables looks like: : :/* benchreg - benchmark for register integers Some code deleted. It contains a loop with perhaps 20 arithmetic operations inside it. Only 1 involves division (and/or remainder). Here's the loop contents: : for (c = 1; c <= 40; ++c) : { : a = a + b + c; : b = a >> 1; : a = b % 10; : m[a] = a; : b = m[a] - b - c; : a = b == c; : b = a | c; : a = !b; : b = a + c; : a = b > c; : } :and spends roughly 75% of its time performing the "%" operation. This is exactly my point! The fact that one operation takes 75% of the run time for a loop with about 20 operations indicates how badly SPARC does division. Most programs may not do a lot of division, but when a program DOES require it, the performance of the SPARC is a joke. -- Bob Silverman #include <std.disclaimer> Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs Mitre Corporation, Bedford, MA 01730
tim@nucleus.amd.com (Tim Olson) (01/08/90)
In article <85593@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes: | Huh? I don't see multiple modulus operations in the loop below. I see ONE. | How can one modulus operation inside a loop "over-emphasize" integer | modulus? Because it occurs at a much higher frequency in this loop than is "normally" found in most programs. In our collection of benchmark programs, out of 50K C lines (85K assembly lines) there were 104 integer division/modulo operations (most were integer division) -- about 0.1%. This is a static measurement -- I don't have the dynamic frequency handy, but it is also small. | :and spends roughly 75% of its time performing the "%" operation. | | This is exactly my point! The fact that one operation takes 75% of the run | time for a loop with about 20 operations indicates how badly SPARC does | division. Most programs may not do a lot of division, but when a program | DOES require it, the performance of the SPARC is a joke. Not just on SPARC -- it spends this amount of time on many architectures. It is very hard to greatly speed up division (although the TI guys seem to have done it). Division will usually be slower than other arithmetic operations by an order of magnitude. -- Tim Olson Advanced Micro Devices (tim@amd.com)
mash@mips.COM (John Mashey) (01/08/90)
In article <85593@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes: >In article <28594@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes: >:In article <34058@mips.mips.COM> mash@mips.COM (John Mashey) writes: > >:| I'm always nervous about micro-level benchmarks. Now, I hate to have >:| to defend SPARC :-), but I must: realistic integer benchmarks >:| that I know [like the SPEC ones] simply don't correlate with >:| the results claimed below, at least not very much. >:| The RISC machines are noticably faster on actual integer programs.... >: >:The benchmarks over-emphasize integer modulus. For example, the > >Huh? I don't see multiple modulus operations in the loop below. I see ONE. >How can one modulus operation inside a loop "over-emphasize" integer >modulus? Sorry, I should have been clearer, and said: "The benchmarks over-emphasize integer modulus compared with the dynamic frequencies of most programs." Consider 3 kinds of benchmarks: 1) Real programs, of substantial size. 2) Micro-level benchmarks that measure specific operations (like +, -, *, /, %, etc), and labele their results that way. 3) Small synthetic mixes (which is what the benchmark under discussion is doing). I prefer 1), but 2), is OK, if it's carefully labeled, and can provide useful information, although one must be careful to avoid compiler effects. For example, if one has those numbers, and if one has applications that are known to use integer *,/,% (as you do!), it would be pretty clear which machines would be good or bad, in this case. However, I don't like 3), when the results are misinterpreted (and they usually are). It picks a specific relative frequency of operations, and then people run around claiming that this predicts integer performance. This is nonsense: 1) Programs differ widely in their frequencies of operations. 2) If one wanted a simple loop that approximated the frequencies of UNIX C programs (like nroff/diff/grep....) and CASE/CAD integer tools (like ccom, as, ...espresso, etc), you'd look at the statistics of these things, and make up a loop that approximatecd this. You'd want something whose relative frequencies were: + - * (some mixture of * by constants & by variables) (and definitely many fewer *'s than +'s) / (less than *) % (less than /) 3) Now, for a variety of reasons, I wouldn't myself make up such a benchmark as a predictor, but it would certainly be better than the statistics of the benchmark being discussed. Anyway, I agree that SPARC performs poorly if you have the kind of program (like multiple-precision integer work, especially, or various others we've run across) where integer mul/div/modulus are inescapable, and where it doesn't even have a divide-step to help out. This does illustrate the nature of tradeoffs, and the care needed when selecting instructions: you should use the statistics of real programs, as many as possible, and you have to be careful that you don't leave out something that can make a huge difference, even if many programs don't need that. Remember: we put *, /, % in hardware, even though our predecessor Stanford MIPS and many other RISCs didn't, so I'm hardly arguing against them :-) I'm delighted to see SPARC dinged for not having them :-) But it's not fair to claim that an arbitrary mix of operations proves that SPARC (and then other RISCs, as though SPARC = RISC) has bad overall integer performance. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086