[comp.arch] Questions on SparcStation 1 performance

jps@cs.brown.edu (John Shewchuk) (05/10/89)

I want to do the following many times:

   x = f * y;

Currently x and y are integers and f is floating point number.
In an attempt to speed up a program I tried replacing this with

   x = (a * y)/b;

where a and b are integers.  After compiling a loop that has 100,000
iterations of the code described above I timed the two different
versions.

To my surprise I found that the floating point version returned
quickly - it used about 1.6 seconds of user time - while the integer
version took much longer- about 43 seconds.

After looking at the code with adb it was noticed that the integer
multiply and divide were calls to routines like .mul but were using
the dynamic loading stubs.  We relinked the code using -Bstatic and
running time for the integer version went down to about 10 seconds.
However this still much slower than the floating point version.

What is going on here?  Intuitively, an integer multiply followed by
an integer divide should be faster than a convert int to float,
floating multiply, and convert float to int but this does not seem to
be the case.

Any suggestions on a way to increase performance?  

Is it possible to inline the calls to the library routines?

Do the new sparc chips have integer multiply and divide instructions?

Thanks,


John Shewchuk                                     jps@cs.brown.edu



John Shewchuk                                                jps@cs.brown.edu

schwartz@shire.cs.psu.edu (Scott Schwartz) (05/10/89)

In article <6008@brunix.UUCP>, jps@cs (John Shewchuk) writes:
>   x = (a * y)/b;  [100,000 iterations]

>To my surprise I found that the floating point version returned
>quickly - it used about 1.6 seconds of user time - while the integer
>version took much longer- about 43 seconds.

Could you give us some more information?  I just tried the following
on a Sun4/260, sans -O; it took 0.4 user, 0.0 sys. 

main()
{
	int x = 1, y = 9, a = 3, b = 4;
	int i = 100000;

	&x; &y; &a; &b;
	while (--i) {
		x = (a * y)/b;
		&x;
	}
}

Hey, the Sun4/110 does fp via kernel emulation: maybe the SparcStations
do integer that way!  :-) :-) 
-- 
Scott Schwartz		<schwartz@shire.cs.psu.edu>

suitti@haddock.ima.isc.com (Stephen Uitti) (05/10/89)

In article <6008@brunix.UUCP> jps@cs.brown.edu (John Shewchuk) writes:
>I want to do the following many times:
>   x = f * y;
>Currently x and y are integers and f is floating point number.
>
>In an attempt to speed up a program I tried replacing this with
>   x = (a * y)/b;
>where a and b are integers.  After compiling a loop that has 100,000
>iterations...[it was slower]...
>
>What is going on here?  Intuitively, an integer multiply followed by
>an integer divide should be faster than a convert int to float,
>floating multiply, and convert float to int but this does not seem to
>be the case.

On many architectures with integer & floating hardware support, integer
divide is slower than floating divide.  A float represented by 32 bits
has 24 bits of mantissa.  A 32 bit integer has (surprise) 32 bits.  It
takes less time to divide 24 bit quantities (and do add/subract type things
to the exponent) than to divide 32 bit type things.

It is often the case that divides take lots longer than conversions, and
multiplies don't take time.

On the other hand, if SPARC doesn't have the integer divide in hardware, but
does have hardware floating point, and has to call a routine for the integer
stuff, i'm surprised that its only 5 times slower.

On the PDP-11, one version of the ldiv (might have made it into 2.10 BSD)
routine was coded to use floating point divide for those models of PDP that
had floating hardware.  It was quicker.  There was a snafu about divide
by zero exceptions in an early version.

Stephen.

slackey@bbn.com (Stan Lackey) (05/11/89)

In article <13024@haddock.ima.isc.com> suitti@haddock.ima.isc.com (Stephen Uitti) writes:
>In article <6008@brunix.UUCP> jps@cs.brown.edu (John Shewchuk) writes:
>>What is going on here?  Intuitively, an integer multiply followed by
>>an integer divide should be faster than a convert int to float,
>>floating multiply, and convert float to int but this does not seem to
>>be the case.

>On the other hand, if SPARC doesn't have the integer divide in hardware, but
>does have hardware floating point, and has to call a routine for the integer
>stuff, i'm surprised that its only 5 times slower.

I believe the first SPARC had the microprocessor (incl. integer
processing) in a 20K gate gate array, and Weitek for FP.  Integer
divide had no hardware help (unlike multiply), so was probably like 3
or 4 cycles per bit.

The Weitek FP chips I think they used also didn't have FP divide, but
some hardware to support a reciprocal-divide algorithm.  The FP
multiplier would be involved in the loop.

Each iteration would do like 6 or 8 bits, but with the multiply would
take 5 or 6 cycles (for single precision).  So, given this, and that
integer divide generates 32 bits while FP does 24, FP divide looks
around 5 times faster than integer.

Note: The Weitek's only take a few cycles to do format conversions.

Could someone who knows clear up the "I think"'s and the "probably"'s?
Thanks.  -Stan

neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) (05/12/89)

All Sparc implementations do not have any hardware support for integer
divide and limited support for integer multiply in the form of the MULS
multiply step instruction. There should never be any support beyond this,
as this partitioning is part of the Sparc ARCHITECTURE. For multiply, the
support is really sufficient, as most multiplications can be catched at 
compile time (read a paper about HP Precision Architecture and their measure-
ments) and the speed of a software routine for the other cases with
support from the MULS instruction is quite good. Quoted from the Cypress
CY7C601 Sparc Architecture manual:
 
  ... Code for general mul subroutine ...
 
  This code has an optimization built in for short (less than 13-bit)
  multiplies. Short multiplies require 26 or 27 instruction cycles, long
  ones require 47 to 51 instruction cycles. For two positive numbers (the
  most common case), the cycle count is 47.
 
They go on and claim 25 and 46 to 48 cycles respectively for unsigned
multiplication. The integer divide code is much harder to understand and
the only give an estimate for the most common case, worst case being up to
4 times slower (under strange circumstances). The formula is
 
	CEIL(log2(dividend/divisor)/3) x ( 21.5 )  + some setup cycles
 
which translates into something from 30 to 260 cycles  if I understand
the formula correctly (which I probably don't).
 
	Burkhard Neidecker-Lutz, Digital CEC Karlsruhe, Project NESTOR
	neideck@nestvx.dec.com
 
PS: How comes that John Mashey always answers my questions in my "notes read
    delay slot" so that I'm looking plain stupid when my question appears
    AFTER his answer... :-)

mo@prisma (05/14/89)

"Never" is a dangerous word.  There may well be instructions added
to SPARC for integer multiply and divide.  This has been discussed,
but I don'th know the status of the issue.

rec@dg.dg.com (Robert Cousins) (05/15/89)

In article <13024@haddock.ima.isc.com> suitti@haddock.ima.isc.com (Stephen Uitti) writes:
>In article <6008@brunix.UUCP> jps@cs.brown.edu (John Shewchuk) writes:
>>I want to do the following many times:
>>   x = f * y;
>>Currently x and y are integers and f is floating point number.
>>
>>In an attempt to speed up a program I tried replacing this with
>>   x = (a * y)/b;
>>where a and b are integers.  After compiling a loop that has 100,000
>>iterations...[it was slower]...
>>
>>What is going on here?  Intuitively, an integer multiply followed by
>>an integer divide should be faster than a convert int to float,
>>floating multiply, and convert float to int but this does not seem to
>>be the case.
>
>On many architectures with integer & floating hardware support, integer
>divide is slower than floating divide.  
>
>Stephen.

There are numerous examples of "hot" processors which have little or no
support for integer divide.  If memory serves me correctly, the CDC 6xxx,
7xxx and 17x did not have integer divide instructions.  It was common for
a special instruction sequence to be generated instead which used a floating
point divide.  This make much sense when you realize two trends in computing:

	1.	In Fortran, there are relatively few integer divides in
		the FP heavy code these machines were designed to speed
		up.  The trend to Pascal, C and Ada has changed all of that,
		but many of today's older architectures still have their
		roots in 1966's Fortran IV standard.

	2.	Division/Modulo operations have been the source of numerous
		arguements from the early days.  The early GE computers
		had hardware divide but were forced to modify the fortran
		compiler to use a subroutine because it gave the wrong
		result ("wrong" as in different than software expected but
		numerically correct).  This is reflected today in the
		various language standards which provide differing 
		standards.  Ada provides both the REMAINDER and MOD
		operations which are almost (but not quite) identical.

Just as several RISC processors do not provide direct support for integer
multiply but instead depend upon a multiply iterate instruction sequence,
a similar divide iterate instruction makes sense for some applications.
Since it is possible to factor some constant divisions into faster
sequences of shifts, a smart compiler can still succeed in making
these divide weak machines perform well in some cases.

It has been my personal hobby to work on numerical applications such
as primes and factorials.  In my work, I have found it important that
both results of a division be available:  the quotient and remainder.
(I have historically used machines with EXPENSIVE division operations
making it important to avoid repeating the division.)  Careful management
of arguments and algorithms can bring about substantial performance differences.

Even today, high performance processors such as the 88000 use the floating
point units for integer division operations.  If you disable the floating
point unit on an 88100 and then attempt a division operation, you will 
trap!  (Since the FP unit is built into the CPU, this is a truly artificial
situation.)

Robert Cousins

Dept. Mgr, Workstation Dev't.
Data General Corp.

Speaking for myself alone.

seanf@sco.COM (Sean Fagan) (05/17/89)

In article <172@dg.dg.com> rec@dg.UUCP (Robert Cousins) writes:
>There are numerous examples of "hot" processors which have little or no
>support for integer divide.  If memory serves me correctly, the CDC 6xxx,
>7xxx and 17x did not have integer divide instructions.  

They also only supported integer multiply in a funky way:  if the exponent
field was 0, it did an integer 48-bit multiply.  Used the same unit as the
fp mult, of course.

>Even today, high performance processors such as the 88000 use the floating
>point units for integer division operations.

Well, actually, the 88000 (and the i860, and the Cray's) don't have divide.
They have reciporcate (actually, I belive they call it reciporocal
approximation, and, at least on the i860, you have to go through a couple of
iterations to get the "perfect" result).

-- 
Sean Eric Fagan  |    "With friends like these, who need hallucinations?"
seanf@sco.UUCP   |             -- Buddy, "Night Court"
(408) 458-1422   | Any opinions expressed are my own, not my employers'.

mpogue@dg.dg.com (Mike Pogue) (05/18/89)

In article <2727@scolex.sco.COM> seanf@scolex.UUCP (Sean Fagan) writes:
>
>Well, actually, the 88000 (and the i860, and the Cray's) don't have divide.
>They have reciporcate (actually, I belive they call it reciporocal
>approximation, and, at least on the i860, you have to go through a couple of
>iterations to get the "perfect" result).
>

   Quote from the MC88100 Risc Microprocessor User's Manual:

	page 3-48, description of the FDIV instruction.

	Description:  The contents of the rS1 and rS2 registers [any g.p. source
		register] are checked for reserved operands.  If no reserved operands
	 	are found, the rS1 operand is divided by the rS2 operand according
		to the IEEE 754.  The result is placed in the rD register.  Any
		combination of single- and double-precision operands can be specified.

  Obviously, the 88K DOES do divide.  There are also instructions for signed and unsigned
integer divide.

Mike Pogue
Data General Corp.

My opinions are my own....