Divide on Sparc

khb@chiba.kbierman@sun.com (Keith Bierman - SPD Advanced Languages) (01/03/90)

-- Tim Olson comments:

.... integer mult in hw

>If they do, then they are not Instruction Set compatible.  My copy of
the SPARC architecture manual lists only a MULScc (Multiply Step and

I don't recall supersets being outlawed in the v7 (currently
distributed spec) document.

Let's assume some random company wants to add integer multiply to
their SPARC box. Say some mythical company SolB.

SolB spins their own chip. With their own compilers they can obviously
generate whatever they want ... but what about all that Sun code ?
What about binary compatibility ?

When the Sun compilers can't optimize away an integer multiply they
generate a subroutine call (typically .mul as suggested in the v7
document). 

So replacing .mul in the local versions of libc.so and libc.a would do
the trick nicely. Many (most ?) codes are dynamically linked, so they
get the speedup automatically. Those that are statically linked, will
have to be relinked for speedup.

What about that nasty subroutine call overhead ? Well if you don't
care about your code running on those evil Sun boxes, you can provide
(or your user's can crack open their copies of the Floating Point
Programmer's Guide) a .il template which will cause the .mul
subroutine call to be replaced with the native instruction.





	
--
Keith H. Bierman    |*My thoughts are my own. !! kbierman@sun.com
It's Not My Fault   |	MTS --Only my work belongs to Sun* 
I Voted for Bill &  | Advanced Languages/Floating Point Group            
Opus                | "When the going gets Weird .. the Weird turn PRO"

"There is NO defense against the attack of the KILLER MICROS!"
			Eugene Brooks

thor@stout.ucar.edu (Rich Neitzel) (01/03/90)

With all the talk about this subject I do not recall seeing any benchmarking
of a sparc or any other system for that matter. The following table lists times
generated by the Plum-Hall benchmark routines. (They were posted a while back
to comp.misc.sources). There are three things that really stand out to my
mind:

	1> Integer multiplication on the sparc is horrible. However, even
	   the HP RISC is bad. Since much of the code in the real world
	   does more integer then floating point work, it appears that CICS
	   can still more then hold its own.

	2> In general, there is little difference between the RISC and 
	   CISC machines. For example the 68030 9000/370, 9000/835 and the SS1
	   all have very close timings. In fact, except for call+ret and 
	   floating point, the 68030 has better timing. Even among the "1st"
	   generation machines the 68020 based 3/260 beats the 4/260 in the
	   same areas. The floating pont timings are no real palm to the 
	   RISC systems, since they are measures of the fpa not the RISC cpu.

	3> Sun seemly is playing fast and loose with users by claiming 
	   major improvements over their 680x0 line of machines. Worse, 
	   supposed upgrades are not living up to what one might exspect.
	   Note that the 3/80 is lower in timing then the 3/260. Compare this
	   to the HP 68030 machine and one wonders if Sun is purposely 
	   limiting the performance of their CISC machines - doubtless 
	   because their profit margin is lower on these.

                     register  auto      auto       int    function    auto
                       int     short     long    multiply  call+ret    double
-------------------------------------------------------------------------------
cc:
MVME-133  ()          .43       .74       .74      2.41      6.17      5.28    
MVME-133  (-O4)       .43       .53       .43      2.25      5.66      2.04    
Sun-3/260 ()         0.34      0.55      0.56      1.93      2.13      5.70    
Sun-3/260 (-O4)      0.34      0.41      0.34      1.83      1.62      2.20    
Sun-3/80  ()         0.47      0.68      0.70      2.57      2.87      4.37    
Sun-3/80  (-O4)      0.44      0.56      0.45      2.42      2.20      1.90    
Sun-3E    ()         0.45      0.76      0.75      2.47      2.82      5.33    
Sun-3E    (-O4)      0.44      0.54      0.45      2.27      2.22      2.07    
Sun-4     ()         0.54      0.55      0.48      4.80      0.72      1.20    
Sun-4     (-O4)      0.41      0.44      0.40      4.45      0.67      1.00    
HP9000/370 (fpa -O)  0.22      0.26      0.22      1.35      3.96      0.62
HP9000/370 (-O)      0.21      0.26      0.22      1.35      3.08      1.21
HP9000/370(fpa no -O)0.26      0.40      0.36      1.44      4.42      1.56
HP9000/370 (no -O)   0.26      0.40      0.37      1.45      3.38      2.72
HP9000/835 (-O)      0.27      0.29      0.27      5.49      0.31      0.27 
HP9000/835 (no -O)   0.29      0.53      0.45      5.62      0.31      0.59
Sun SS1 (no -O)      0.38      0.40      0.35     19.7       0.51      0.72
Sun SS1 (-O)         0.29      0.33      0.30     19.5       0.49      0.59
-------------------------------------------------------------------------------
			Richard Neitzel
			National Center For Atmospheric Research
			Box 3000
			Boulder, CO 80307-3000
			303-497-2057

			thor@thor.ucar.edu

    	Torren med sitt skjegg		Thor with the beard
    	lokkar borni under sole-vegg	calls the children to the sunny wall
    	Gjo'i med sitt shinn		Gjo with the pelts
    	jagar borni inn.		chases the children in.




-------------------------------------------------------------------------------

levisonm@qucis.queensu.CA (Mark Levison) (01/03/90)

  Rich Neitzel posted an article suggesting that from the timings he
gave using the Plum Hall benchmark showed that RISC is often slower
than CISC. Although I don't have my C Users Journal handy I recall
that the original article said that these benchmarks where designed to
be small and easily typed in at trade shows. They were also designed
to stop good optimising compilers from doing as well as they might.
These code is certainly not representative of a typical code. A better
place to look might be the SPEC benchmarks.

Mark Levison
levisonm@qucis.queensu.ca
#include <std_disclaimer.h>
---------------------A man to cheap to buy a real signature-------------

mash@mips.COM (John Mashey) (01/03/90)

In article <5842@ncar.ucar.edu> thor@stout.UCAR.EDU (Rich Neitzel) writes:
>
>With all the talk about this subject I do not recall seeing any benchmarking
>of a sparc or any other system for that matter. The following table lists times
>generated by the Plum-Hall benchmark routines. (They were posted a while back
>to comp.misc.sources). There are three things that really stand out to my
	Could somebody post the critical parts of this again so we can
	look at it?  Although I have high respect for Plum-Hall in general,
	I'm always nervous about micro-level benchmarks.  Now, I hate to have
	to defend SPARC :-), but I must: realistic integer benchmarks
	that I know [like the SPEC ones] simply don't correlate with
	the results claimed below, at least not very much.
	The RISC machines are noticably faster on actual integer programs....

>	2> In general, there is little difference between the RISC and 
>	   CISC machines. For example the 68030 9000/370, 9000/835 and the SS1
>	   all have very close timings. In fact, except for call+ret and 
>	   floating point, the 68030 has better timing. Even among the "1st"
>	   generation machines the 68020 based 3/260 beats the 4/260 in the
>	   same areas.

Again, this is why it would be nice to post the benchmark; one must
always be very careful of micro-level benchmarks: I just don't believe
that one can generalize from these results into thinking that a
33Mhz 68030 has faster integer performance overall than the RISCs...
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

tim@nucleus.amd.com (Tim Olson) (01/04/90)

In article <34058@mips.mips.COM> mash@mips.COM (John Mashey) writes:
| In article <5842@ncar.ucar.edu> thor@stout.UCAR.EDU (Rich Neitzel) writes:
| >
| >With all the talk about this subject I do not recall seeing any benchmarking
| >of a sparc or any other system for that matter. The following table lists times
| >generated by the Plum-Hall benchmark routines. (They were posted a while back
| >to comp.misc.sources). There are three things that really stand out to my
| 	Could somebody post the critical parts of this again so we can
| 	look at it?  Although I have high respect for Plum-Hall in general,
| 	I'm always nervous about micro-level benchmarks.  Now, I hate to have
| 	to defend SPARC :-), but I must: realistic integer benchmarks
| 	that I know [like the SPEC ones] simply don't correlate with
| 	the results claimed below, at least not very much.
| 	The RISC machines are noticably faster on actual integer programs....

The benchmarks over-emphasize integer modulus.  For example, the
benchmark that reportedly tests register-integer variables looks like:

/* benchreg - benchmark for  register  integers 
 * Thomas Plum, Plum Hall Inc, 609-927-3770
 * If machine traps overflow, use an  unsigned  type 
 * Let  T  be the execution time in milliseconds
 * Then  average time per operator  =  T/major  usec
 * (Because the inner loop has exactly 1000 operations)
 */
#define STOR_CL register
#define TYPE int
#include <stdio.h>
main(ac, av)
        int ac;
        char *av[];
        {
        STOR_CL TYPE a, b, c;
        long d, major, atol();
        static TYPE m[10] = {0};

        major = atol(av[1]);
        printf("executing %ld iterations\n", major);
        a = b = (av[1][0] - '0');
        for (d = 1; d <= major; ++d)
                {
                /* inner loop executes 1000 selected operations */
                for (c = 1; c <= 40; ++c)
                        {
                        a = a + b + c;
                        b = a >> 1;
                        a = b % 10;
                        m[a] = a;
                        b = m[a] - b - c;
                        a = b == c;
                        b = a | c;
                        a = !b;
                        b = a + c;
                        a = b > c;
                        }
                }
        printf("a=%d\n", a);
        }

and spends roughly 75% of its time performing the "%" operation.
	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

barnett@grymoire.crd.ge.com (Bruce Barnett) (01/04/90)

In article <5842@ncar.ucar.edu> thor@stout.UCAR.EDU (Rich Neitzel) writes:
|                     register  auto      auto       int    function    auto
|                       int     short     long    multiply  call+ret    double
|cc:
|Sun-4     ()         0.54      0.55      0.48      4.80      0.72      1.20    
|Sun-4     (-O4)      0.41      0.44      0.40      4.45      0.67      1.00    
|Sun SS1 (no -O)      0.38      0.40      0.35     19.7       0.51      0.72
|Sun SS1 (-O)         0.29      0.33      0.30     19.5       0.49      0.59

I get:
-------
Sun4/110 (no -O)     0.37      0.69      0.62      5.90      1.11      1.07    
Sun4/110 (-O4)       0.26      0.34      0.26      5.30      0.73      0.83    
SS1 (no -O)          0.38      0.40      0.36      3.43      0.51      0.72    
SS1 (-O4)            0.30      0.33      0.30      3.30      0.49      0.60    

The Sun 4/110 has the new FPU. 

There are several different FPU units around. Weitek and TI I believe.
I remember something about early SparcStations being shipped with different FPU's.

Not all SparcStations are created equal?
-- 
Bruce G. Barnett	barnett@crd.ge.com	uunet!crdgw1!barnett

mash@mips.COM (John Mashey) (01/04/90)

In article <28594@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
>In article <34058@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>| 	Could somebody post the critical parts of this again so we can
>| 	look at it?  Although I have high respect for Plum-Hall in general,
>| 	I'm always nervous about micro-level benchmarks.  Now, I hate to have
>| 	to defend SPARC :-), but I must: realistic integer benchmarks
>| 	that I know [like the SPEC ones] simply don't correlate with
>| 	the results claimed below, at least not very much.
>| 	The RISC machines are noticably faster on actual integer programs....

>The benchmarks over-emphasize integer modulus.  For example, the
>benchmark that reportedly tests register-integer variables looks like:
......
>and spends roughly 75% of its time performing the "%" operation.

Like I said, I'm always nervous about micro-level benchmarks, even when done
by smart people.  Here is the summary, followed by details:

SUMMARY OF MY ADVICE:
	1) do NOT EVER use this benchmark to believe it means anything;
	if you have a copy, throw it away.
	2) FORGET any conclusions that anyone has posted about relative
	performance of machines, based on this benchmark, other than
	the possible thought that multiply/divide integer don't happen
	to be done in hardware on SPARC and HP PA.
	3) If you've ever told anyone this means much, please tell them
	you're sorry.
DETAILS:
Bo Thide kindly sent me a copy, and I took a quick look, finding similar
results to Tim's:
	optimized R3000 code spent 60% of the time doing % (and remember, we
		have one of those in hardware...)

Tables were given that looked like this:
                     register  auto      auto       int    function    auto
                       int     short     long    multiply  call+ret    double
-------------------------------------------------------------------------------
cc:
Sun-4     (-O4)      0.41      0.44      0.40      4.45      0.67      1.00    
......

WHAT'S RIGHT:
1) The code is very carefully done to eliminate surprise optimization.

WHAT'S WRONG:
1,2,3) Columns 1, 2, and 3, which purport to measure the performance of various
integer code, are completely dominated by the modulus operation, which is
simply contrary to the statistics of the overpowering bulk of code out there.
It would be plausible to generate something that had a mix of +, -, *, /, %
and logic ops, using carefully chosen frequencies from a number of real
programs (and even there, there are pitfalls), but something that does no *
or /, and % way out of proportion, is guaranteed to blast a SPARC about
as badly as it can be, relative to almost anything else.  It won't help
HP PAs much either....
Also, for column 1, optimized, I got 0% loads, and 5% stores, rather than
the more typical 20% and 10%.

4) This column indeed measures the speed of integer multiply, in such a way
that no compiler can do anything but do real multiplies with it.

5) This column measures the speed of function call/return with zero arguments.
Unfortunately, different programs have different distributions of numbers and
types of arguments, and many functions have arguments.  Different machines
differ greatly in the cost of passing arguments, and i nteh costs of passing
different numbers of arguments....

6) I haven't looked at the statistics of this much, except to notice there are
equal numbers of FP * and /, which is also atypical.
----
7) In general, although it's been said before in this newsgroup:
	a) People design computers using the statistics of real programs.
	b) The statistics of real programs differ, hence the tradeoffs you
	make depend on the benchmarks chosen.
	c) There are certain classes of codes for which at least one of
	integer *, /, or % are important enough, and cannot be gotten rid
	of even by a perfect compiler, where having these in hardware
	will help a lot.Over many realistic programs, hardware helps
	about enough that some people chose to include it, and some didn't.
	There's no way in the world that having it makes a 2-3X performance
	difference, overall, although you can find some real programs where it
	does.
	d) Like many synthetic benchmarks, it simply doesn't have a mixture
	of expressions that relates well to real compilers do, i.e., there is
	little that an optimizing compiler can do with this code, and a small
	number of registers are completely adequate.  Neither of these two
	is generically true for real code.

Anyway, these benchmarks mostly measure integer multiply and divide;
these operations are where most RISCs have the least advantage over
most CISCs; these operations are definitely what anybody would use to
show that some CISC is faster than SPARC or PA; but it just doesn't
correlate very well with the speeds on real programs.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

irf@kuling.UUCP (Bo Thide') (01/05/90)

In article <4411@crdgw1.crd.ge.com> barnett@grymoire.crd.ge.com (Bruce Barnett) writes:
>I get:
>-------
>Sun4/110 (no -O)     0.37      0.69      0.62      5.90      1.11      1.07    
>Sun4/110 (-O4)       0.26      0.34      0.26      5.30      0.73      0.83    
>SS1 (no -O)          0.38      0.40      0.36      3.43      0.51      0.72    
>SS1 (-O4)            0.30      0.33      0.30      3.30      0.49      0.60    

I also get this now, but only if I use the inline library /usr/lib/libm.il.
Without it, integer multiplication takes 6 times longer!  Odd.


-Bo

bs@linus.UUCP (Robert D. Silverman) (01/05/90)

In article <28594@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
:In article <34058@mips.mips.COM> mash@mips.COM (John Mashey) writes:
 
:| 	I'm always nervous about micro-level benchmarks.  Now, I hate to have
:| 	to defend SPARC :-), but I must: realistic integer benchmarks
:| 	that I know [like the SPEC ones] simply don't correlate with
:| 	the results claimed below, at least not very much.
:| 	The RISC machines are noticably faster on actual integer programs....
:
:The benchmarks over-emphasize integer modulus.  For example, the
 
Huh? I don't see multiple modulus operations in the loop below. I see ONE.
How can one modulus operation inside a loop "over-emphasize" integer
modulus?

:benchmark that reportedly tests register-integer variables looks like:
:
:/* benchreg - benchmark for  register  integers 
 
Some code deleted. It contains a loop with perhaps 20 arithmetic operations
inside it. Only 1 involves division (and/or remainder). Here's the loop contents:

:                for (c = 1; c <= 40; ++c)
:                        {
:                        a = a + b + c;
:                        b = a >> 1;
:                        a = b % 10;
:                        m[a] = a;
:                        b = m[a] - b - c;
:                        a = b == c;
:                        b = a | c;
:                        a = !b;
:                        b = a + c;
:                        a = b > c;
:                        }
:and spends roughly 75% of its time performing the "%" operation.
 
This is exactly my point! The fact that one operation takes 75% of the run
time for a loop with about 20 operations indicates how badly SPARC does
division. Most programs may not do a lot of division, but when a program
DOES require it, the performance of the SPARC is a joke.
-- 
Bob Silverman
#include <std.disclaimer>
Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs
Mitre Corporation, Bedford, MA 01730

tim@nucleus.amd.com (Tim Olson) (01/08/90)

In article <85593@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes:
| Huh? I don't see multiple modulus operations in the loop below. I see ONE.
| How can one modulus operation inside a loop "over-emphasize" integer
| modulus?

Because it occurs at a much higher frequency in this loop than is
"normally" found in most programs.  In our collection of benchmark
programs, out of 50K C lines (85K assembly lines) there were 104
integer division/modulo operations (most were integer division) --
about 0.1%.  This is a static measurement -- I don't have the dynamic
frequency handy, but it is also small.

| :and spends roughly 75% of its time performing the "%" operation.
|  
| This is exactly my point! The fact that one operation takes 75% of the run
| time for a loop with about 20 operations indicates how badly SPARC does
| division. Most programs may not do a lot of division, but when a program
| DOES require it, the performance of the SPARC is a joke.

Not just on SPARC -- it spends this amount of time on many
architectures.  It is very hard to greatly speed up division (although
the TI guys seem to have done it).  Division will usually be slower
than other arithmetic operations by an order of magnitude.

	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

mash@mips.COM (John Mashey) (01/08/90)

In article <85593@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes:
>In article <28594@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
>:In article <34058@mips.mips.COM> mash@mips.COM (John Mashey) writes:
> 
>:| 	I'm always nervous about micro-level benchmarks.  Now, I hate to have
>:| 	to defend SPARC :-), but I must: realistic integer benchmarks
>:| 	that I know [like the SPEC ones] simply don't correlate with
>:| 	the results claimed below, at least not very much.
>:| 	The RISC machines are noticably faster on actual integer programs....
>:
>:The benchmarks over-emphasize integer modulus.  For example, the
> 
>Huh? I don't see multiple modulus operations in the loop below. I see ONE.
>How can one modulus operation inside a loop "over-emphasize" integer
>modulus?

Sorry, I should have been clearer, and said:
"The benchmarks over-emphasize integer modulus compared with the
dynamic frequencies of most programs."

Consider 3 kinds of benchmarks:
	1) Real programs, of substantial size.
	2) Micro-level benchmarks that measure specific operations (like
	+, -, *, /, %, etc), and labele their results that way.
	3) Small synthetic mixes (which is what the benchmark under discussion
	is doing).

I prefer 1), but 2), is OK, if it's carefully labeled, and can provide useful
information, although one must be careful to avoid compiler effects. For
example, if one has those numbers, and if one has applications that are known
to use integer *,/,% (as you do!), it would be pretty clear which machines
would be good or bad, in this case.

However, I don't like 3), when the results are misinterpreted (and they
usually are).  It picks a specific relative frequency of operations,
and then people run around claiming that this predicts integer performance.
This is nonsense:
	1) Programs differ widely in their frequencies of operations.
	2) If one wanted a simple loop that approximated the frequencies
	of UNIX C programs (like nroff/diff/grep....) and CASE/CAD integer
	tools (like ccom, as, ...espresso, etc), you'd look at the statistics
	of these things, and make up a loop that approximatecd this.
	You'd want something whose relative frequencies were:
		+
		-
		* (some mixture of * by constants & by variables)
		  (and definitely many fewer *'s than +'s)
		/ (less than *)
		% (less than /)
	3) Now, for a variety of reasons, I wouldn't myself make up such a
	benchmark as a predictor, but it would certainly be better than
	the statistics of the benchmark being discussed.
Anyway, I agree that SPARC performs poorly if you have the kind of program
(like multiple-precision integer work, especially, or various others
we've run across) where integer mul/div/modulus are inescapable,
and where it doesn't even have a divide-step to help out.  This does
illustrate the nature of tradeoffs, and the care needed when selecting 
instructions: you should use the statistics of real programs, as many as
possible, and you have to be careful that you don't leave out something
that can make a huge difference, even if many programs don't need that.
Remember: we put *, /, % in hardware,
even though our predecessor Stanford MIPS and many other RISCs didn't,
so I'm hardly arguing against them :-)  I'm delighted to see SPARC dinged
for not having them :-)  But it's not fair to claim that an arbitrary mix
of operations proves that SPARC (and then other RISCs, as though SPARC = RISC)
has bad overall integer performance.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086