jps@cs.brown.edu (John Shewchuk) (05/10/89)
I want to do the following many times: x = f * y; Currently x and y are integers and f is floating point number. In an attempt to speed up a program I tried replacing this with x = (a * y)/b; where a and b are integers. After compiling a loop that has 100,000 iterations of the code described above I timed the two different versions. To my surprise I found that the floating point version returned quickly - it used about 1.6 seconds of user time - while the integer version took much longer- about 43 seconds. After looking at the code with adb it was noticed that the integer multiply and divide were calls to routines like .mul but were using the dynamic loading stubs. We relinked the code using -Bstatic and running time for the integer version went down to about 10 seconds. However this still much slower than the floating point version. What is going on here? Intuitively, an integer multiply followed by an integer divide should be faster than a convert int to float, floating multiply, and convert float to int but this does not seem to be the case. Any suggestions on a way to increase performance? Is it possible to inline the calls to the library routines? Do the new sparc chips have integer multiply and divide instructions? Thanks, John Shewchuk jps@cs.brown.edu John Shewchuk jps@cs.brown.edu
schwartz@shire.cs.psu.edu (Scott Schwartz) (05/10/89)
In article <6008@brunix.UUCP>, jps@cs (John Shewchuk) writes: > x = (a * y)/b; [100,000 iterations] >To my surprise I found that the floating point version returned >quickly - it used about 1.6 seconds of user time - while the integer >version took much longer- about 43 seconds. Could you give us some more information? I just tried the following on a Sun4/260, sans -O; it took 0.4 user, 0.0 sys. main() { int x = 1, y = 9, a = 3, b = 4; int i = 100000; &x; &y; &a; &b; while (--i) { x = (a * y)/b; &x; } } Hey, the Sun4/110 does fp via kernel emulation: maybe the SparcStations do integer that way! :-) :-) -- Scott Schwartz <schwartz@shire.cs.psu.edu>
suitti@haddock.ima.isc.com (Stephen Uitti) (05/10/89)
In article <6008@brunix.UUCP> jps@cs.brown.edu (John Shewchuk) writes: >I want to do the following many times: > x = f * y; >Currently x and y are integers and f is floating point number. > >In an attempt to speed up a program I tried replacing this with > x = (a * y)/b; >where a and b are integers. After compiling a loop that has 100,000 >iterations...[it was slower]... > >What is going on here? Intuitively, an integer multiply followed by >an integer divide should be faster than a convert int to float, >floating multiply, and convert float to int but this does not seem to >be the case. On many architectures with integer & floating hardware support, integer divide is slower than floating divide. A float represented by 32 bits has 24 bits of mantissa. A 32 bit integer has (surprise) 32 bits. It takes less time to divide 24 bit quantities (and do add/subract type things to the exponent) than to divide 32 bit type things. It is often the case that divides take lots longer than conversions, and multiplies don't take time. On the other hand, if SPARC doesn't have the integer divide in hardware, but does have hardware floating point, and has to call a routine for the integer stuff, i'm surprised that its only 5 times slower. On the PDP-11, one version of the ldiv (might have made it into 2.10 BSD) routine was coded to use floating point divide for those models of PDP that had floating hardware. It was quicker. There was a snafu about divide by zero exceptions in an early version. Stephen.
slackey@bbn.com (Stan Lackey) (05/11/89)
In article <13024@haddock.ima.isc.com> suitti@haddock.ima.isc.com (Stephen Uitti) writes: >In article <6008@brunix.UUCP> jps@cs.brown.edu (John Shewchuk) writes: >>What is going on here? Intuitively, an integer multiply followed by >>an integer divide should be faster than a convert int to float, >>floating multiply, and convert float to int but this does not seem to >>be the case. >On the other hand, if SPARC doesn't have the integer divide in hardware, but >does have hardware floating point, and has to call a routine for the integer >stuff, i'm surprised that its only 5 times slower. I believe the first SPARC had the microprocessor (incl. integer processing) in a 20K gate gate array, and Weitek for FP. Integer divide had no hardware help (unlike multiply), so was probably like 3 or 4 cycles per bit. The Weitek FP chips I think they used also didn't have FP divide, but some hardware to support a reciprocal-divide algorithm. The FP multiplier would be involved in the loop. Each iteration would do like 6 or 8 bits, but with the multiply would take 5 or 6 cycles (for single precision). So, given this, and that integer divide generates 32 bits while FP does 24, FP divide looks around 5 times faster than integer. Note: The Weitek's only take a few cycles to do format conversions. Could someone who knows clear up the "I think"'s and the "probably"'s? Thanks. -Stan
neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) (05/12/89)
All Sparc implementations do not have any hardware support for integer divide and limited support for integer multiply in the form of the MULS multiply step instruction. There should never be any support beyond this, as this partitioning is part of the Sparc ARCHITECTURE. For multiply, the support is really sufficient, as most multiplications can be catched at compile time (read a paper about HP Precision Architecture and their measure- ments) and the speed of a software routine for the other cases with support from the MULS instruction is quite good. Quoted from the Cypress CY7C601 Sparc Architecture manual: ... Code for general mul subroutine ... This code has an optimization built in for short (less than 13-bit) multiplies. Short multiplies require 26 or 27 instruction cycles, long ones require 47 to 51 instruction cycles. For two positive numbers (the most common case), the cycle count is 47. They go on and claim 25 and 46 to 48 cycles respectively for unsigned multiplication. The integer divide code is much harder to understand and the only give an estimate for the most common case, worst case being up to 4 times slower (under strange circumstances). The formula is CEIL(log2(dividend/divisor)/3) x ( 21.5 ) + some setup cycles which translates into something from 30 to 260 cycles if I understand the formula correctly (which I probably don't). Burkhard Neidecker-Lutz, Digital CEC Karlsruhe, Project NESTOR neideck@nestvx.dec.com PS: How comes that John Mashey always answers my questions in my "notes read delay slot" so that I'm looking plain stupid when my question appears AFTER his answer... :-)
mo@prisma (05/14/89)
"Never" is a dangerous word. There may well be instructions added to SPARC for integer multiply and divide. This has been discussed, but I don'th know the status of the issue.
rec@dg.dg.com (Robert Cousins) (05/15/89)
In article <13024@haddock.ima.isc.com> suitti@haddock.ima.isc.com (Stephen Uitti) writes: >In article <6008@brunix.UUCP> jps@cs.brown.edu (John Shewchuk) writes: >>I want to do the following many times: >> x = f * y; >>Currently x and y are integers and f is floating point number. >> >>In an attempt to speed up a program I tried replacing this with >> x = (a * y)/b; >>where a and b are integers. After compiling a loop that has 100,000 >>iterations...[it was slower]... >> >>What is going on here? Intuitively, an integer multiply followed by >>an integer divide should be faster than a convert int to float, >>floating multiply, and convert float to int but this does not seem to >>be the case. > >On many architectures with integer & floating hardware support, integer >divide is slower than floating divide. > >Stephen. There are numerous examples of "hot" processors which have little or no support for integer divide. If memory serves me correctly, the CDC 6xxx, 7xxx and 17x did not have integer divide instructions. It was common for a special instruction sequence to be generated instead which used a floating point divide. This make much sense when you realize two trends in computing: 1. In Fortran, there are relatively few integer divides in the FP heavy code these machines were designed to speed up. The trend to Pascal, C and Ada has changed all of that, but many of today's older architectures still have their roots in 1966's Fortran IV standard. 2. Division/Modulo operations have been the source of numerous arguements from the early days. The early GE computers had hardware divide but were forced to modify the fortran compiler to use a subroutine because it gave the wrong result ("wrong" as in different than software expected but numerically correct). This is reflected today in the various language standards which provide differing standards. Ada provides both the REMAINDER and MOD operations which are almost (but not quite) identical. Just as several RISC processors do not provide direct support for integer multiply but instead depend upon a multiply iterate instruction sequence, a similar divide iterate instruction makes sense for some applications. Since it is possible to factor some constant divisions into faster sequences of shifts, a smart compiler can still succeed in making these divide weak machines perform well in some cases. It has been my personal hobby to work on numerical applications such as primes and factorials. In my work, I have found it important that both results of a division be available: the quotient and remainder. (I have historically used machines with EXPENSIVE division operations making it important to avoid repeating the division.) Careful management of arguments and algorithms can bring about substantial performance differences. Even today, high performance processors such as the 88000 use the floating point units for integer division operations. If you disable the floating point unit on an 88100 and then attempt a division operation, you will trap! (Since the FP unit is built into the CPU, this is a truly artificial situation.) Robert Cousins Dept. Mgr, Workstation Dev't. Data General Corp. Speaking for myself alone.
seanf@sco.COM (Sean Fagan) (05/17/89)
In article <172@dg.dg.com> rec@dg.UUCP (Robert Cousins) writes: >There are numerous examples of "hot" processors which have little or no >support for integer divide. If memory serves me correctly, the CDC 6xxx, >7xxx and 17x did not have integer divide instructions. They also only supported integer multiply in a funky way: if the exponent field was 0, it did an integer 48-bit multiply. Used the same unit as the fp mult, of course. >Even today, high performance processors such as the 88000 use the floating >point units for integer division operations. Well, actually, the 88000 (and the i860, and the Cray's) don't have divide. They have reciporcate (actually, I belive they call it reciporocal approximation, and, at least on the i860, you have to go through a couple of iterations to get the "perfect" result). -- Sean Eric Fagan | "With friends like these, who need hallucinations?" seanf@sco.UUCP | -- Buddy, "Night Court" (408) 458-1422 | Any opinions expressed are my own, not my employers'.
mpogue@dg.dg.com (Mike Pogue) (05/18/89)
In article <2727@scolex.sco.COM> seanf@scolex.UUCP (Sean Fagan) writes: > >Well, actually, the 88000 (and the i860, and the Cray's) don't have divide. >They have reciporcate (actually, I belive they call it reciporocal >approximation, and, at least on the i860, you have to go through a couple of >iterations to get the "perfect" result). > Quote from the MC88100 Risc Microprocessor User's Manual: page 3-48, description of the FDIV instruction. Description: The contents of the rS1 and rS2 registers [any g.p. source register] are checked for reserved operands. If no reserved operands are found, the rS1 operand is divided by the rS2 operand according to the IEEE 754. The result is placed in the rD register. Any combination of single- and double-precision operands can be specified. Obviously, the 88K DOES do divide. There are also instructions for signed and unsigned integer divide. Mike Pogue Data General Corp. My opinions are my own....