ward@cfa.UUCP (Steve Ward) (03/27/85)
The purpose of providing this information is to inform, not to promote a particular manufacturer, computer, or microprocessor. I am currently designing a VMEbus CPU card using the MC68000 and the NS32081, among other components. I am a Vax computer user. The Weitek information turned up as I was typing this message, so it is included, though very incomplete. Here are some timing specifications for register-to-register floating point instructions on a variety of computers and microprocessors. All specifications are for hardware-assisted floating point operations using the specified hardware floating point coprocessor. The Vax timings are for the standard DEC floating point accelerator hardware. These specificatons are taken from literature provided by the vendors. The Weitek chip set is supposed to be an MC68020 coprocessor implementation. However, it is not known whether the timings given here are actually register-to-register timings with the Weitek chip set interfaced as a coprocessor to an MC68020, or whether the timings are of operations internal to the Weitek chip set. The literature is not clear, mainly because it is taken from trade journal press releases. The Weitek chip set is so fast that I opted to included it here, even though I do not have data sheets for the parts as of yet. All other specifications are taken directly from the vendors' data sheets and other vendor technical documents. It is unclear for many scientific minicomputer users that the NS320xx/NS32081 and MC680xx/MC68881 combinations will be viable substitutes for the 11/750 and 11/780 Vaxes. The rumored announcement/arrival of the so-called MicroVax II which alleges to provide 11/780 performance (90% or better for floating point, up to 105% for other instructions) should also address the financial arguments by placing an 11/780 class (including floating point) computer into the microcomputer workstation market. The MC68020/WTL1164,WTL1165 combination looks very interesting. As of the date of this posting the MC68881 has only been distributed as samples in the 12.5 MHZ version. FADD = 32 bit single precision register-to-register floating point add. FSUB = 32 bit single precision register-to-register floating point substract. FMUL = 32 bit single precision register-to-register floating point multiply. FDIV = 32 bit single precision register-to-register floating point divide. DADD = 64 bit double precision register-to-register floating point add. DSUB = 64 bit double precision register-to-register floating point subtract. DMUL = 64 bit double precision register-to-register floating point multiply. DDIV = 64 bit double precision register-to-register floating point divide. ===================================== * * * ALL TIMINGS ARE IN MICROSECONDS * * * ===================================== Computer or Microprocessor FADD FSUB FMUL FDIV DADD DSUB DMUL DDIV ========================= ==== ==== ==== ==== ==== ==== ==== ==== 11/780 Vax w/FPA 1.19 1.19 1.19 4.60 2.39 2.59 3.40 8.82 11/750 Vax w/FPA 1.76 1.75 2.27 6.44 2.63 2.63 4.69 12.80 11/730 Vax w/FPA 4.81 7.85 9.88 11.56 9.85 13.27 23.07 23.34 MC68020/MC68881 16.67 MHZ 2.80 2.80 3.10 3.80 2.80 2.80 4.00 5.90 MC68020/MC68881 12.5 MHZ 3.73 3.73 4.13 5.07 3.73 3.73 5.33 7.87 NS32016/NS32081 10 MHZ 7.40 7.40 4.80 8.90 7.40 7.40 6.20 11.90 NS32016/NS32081 8 MHZ 9.25 9.25 6.00 11.13 9.25 9.25 7.75 14.86 Weitek WTL1164,WTL1165 - - 0.360 - - - 0.600 - The Weitek line is very bare. The fact that an MC68020 floating point coprocessor is being manufactured by Weitek is interesting, and it at least promises to be very fast. Perhaps someone else can look into the Weitek situation. The trade journal "new product" information was pretty flaky. Steven M. Ward Center for Astrophysics 60 Garden Street Cambridge, MA 02138 (617) 495-7466 {genrad, allegra, ihnp4, harvard}!wjh12!cfa!ward
dgh@sun.uucp (David Hough) (03/29/85)
In article <133@cfa.UUCP> ward@cfa.UUCP (Steve Ward) writes: > >Here are some timing specifications for register-to-register floating >point instructions on a variety of computers and microprocessors. >These specificatons are taken from literature provided by the vendors. Unfortunately such comparisons don't convey very much useful information. What is more relevant is a comparison of execution times for a real program compiled by a real compiler running on a real system. Unfortunately marketeers seldom are interested in providing useful verifiable information. The only benchmark that I know that is widely cited and realistic for a problem of actual computational interest is the "Linpack" benchmark published by Dongarra and others. The benchmark solves 100x100 systems of linear equations using routines from the Linpack library. 32 bit floating point times range from 35 milliseconds to 2813 seconds. 64 bit floating point times range from 21 milliseconds to 149 seconds. Detailed results are available from Jack Dongarra at Argonne. David Hough
jack@boring.UUCP (04/03/85)
In article <133@cfa.UUCP> ward@cfa.UUCP (Steve Ward) writes: > Computer > or > Microprocessor FADD FSUB FMUL FDIV DADD DSUB DMUL DDIV >========================= ==== ==== ==== ==== ==== ==== ==== ==== > ... >NS32016/NS32081 10 MHZ 7.40 7.40 4.80 8.90 7.40 7.40 6.20 11.90 > >NS32016/NS32081 8 MHZ 9.25 9.25 6.00 11.13 9.25 9.25 7.75 14.86 > Is this true? It *does* sound funny to me that additions and subtractions take longer than multiplies.... The only thing I can imagine from the table is that *ADD/*SUB are always done in double mode, but this doesn't seem to make sense if there is hardware to do multiplies/divides in single precsision. Can anyone enlighten me on this??? >Weitek WTL1164,WTL1165 - - 0.360 - - - 0.600 - > >The Weitek line is very bare. The fact that an MC68020 floating point >coprocessor is being manufactured by Weitek is interesting, and it at least >promises to be very fast. Perhaps someone else can look into the Weitek >situation. The trade journal "new product" information was pretty flaky. > > >Steven M. Ward >Center for Astrophysics >60 Garden Street >Cambridge, MA 02138 >(617) 495-7466 >{genrad, allegra, ihnp4, harvard}!wjh12!cfa!ward -- Jack Jansen, {decvax|philabs|seismo}!mcvax!jack It's wrong to wish on space hardware.
brooks@lll-crg.ARPA (Eugene D. Brooks III) (04/05/85)
> > In article <133@cfa.UUCP> ward@cfa.UUCP (Steve Ward) writes: > > Computer > > or > > Microprocessor FADD FSUB FMUL FDIV DADD DSUB DMUL DDIV > >========================= ==== ==== ==== ==== ==== ==== ==== ==== > > ... > >NS32016/NS32081 10 MHZ 7.40 7.40 4.80 8.90 7.40 7.40 6.20 11.90 > > > >NS32016/NS32081 8 MHZ 9.25 9.25 6.00 11.13 9.25 9.25 7.75 14.86 > > > Is this true? It *does* sound funny to me that additions and > subtractions take longer than multiplies.... > The only thing I can imagine from the table is that *ADD/*SUB are always > done in double mode, but this doesn't seem to make sense if there is > hardware to do multiplies/divides in single precsision. > > Can anyone enlighten me on this??? Yes its true that an fadd takes longer than a fmult on the NS32032 according to the NS manual. I assume that NS knows what the timing is for their HW. A floating point add or subtract requires a shift if the mantessa to get the exponents to agree before adding. The multiply does not require this. Perhaps the 16081 has to do the shift a bit at a time in microcode. Anyone from National on the net to answer this one?
doug@terak.UUCP (Doug Pardee) (04/05/85)
> > Microprocessor FADD FSUB FMUL FDIV DADD DSUB DMUL DDIV > >NS32016/NS32081 10 MHZ 7.40 7.40 4.80 8.90 7.40 7.40 6.20 11.90 > >NS32016/NS32081 8 MHZ 9.25 9.25 6.00 11.13 9.25 9.25 7.75 14.86 > > Is this true? It *does* sound funny to me that additions and > subtractions take longer than multiplies.... I don't know about the 32081 specifically, but the usual reason for this kind of behavior is that before a floating point addition or subtraction can be performed, the operand with the lesser exponent must be "de-normalized" to have the same exponent as the other operand. Since the '081 operates on 53-bit fractions, this might take up to 53 shift operations, at one bit per clock cycle. Let's see, 53 * 100 ns = 5.3 microseconds just for pre-normalization. -- Doug Pardee -- Terak Corp. -- !{hao,ihnp4,decvax}!noao!terak!doug
cdl@mplvax.UUCP (Carl Lowenstein) (04/05/85)
In article <6370@boring.UUCP> jack@boring.UUCP (Jack Jansen) writes: > >In article <133@cfa.UUCP> ward@cfa.UUCP (Steve Ward) writes: >> Microprocessor FADD FSUB FMUL FDIV DADD DSUB DMUL DDIV >>NS32016/NS32081 10 MHZ 7.40 7.40 4.80 8.90 7.40 7.40 6.20 11.90 >> >Is this true? It *does* sound funny to me that additions and >subtractions take longer than multiplies.... > Jack Jansen, {decvax|philabs|seismo}!mcvax!jack If you think about it, floating-point addition and subtraction are much more difficult than multiplication. Multiplication involves only extraction and addition of the exponents, and multiplication of the fractions, with about a 25% probability of a single-bit renormalization. Addition requires the same extraction of exponent and fraction, followed by a variable shift to align binary points, the actual addition of the fractions, and then possibly a lot of renormalization, since the sum might overflow by one bit position, or underflow a lot due to cancellation between positive and negative operands. All this shifting for normalizing takes time. -- carl lowenstein marine physical lab u.c. san diego {ihnp4|decvax|akgua|dcdwest|ucbvax} !sdcsvax!mplvax!cdl
srm@nsc.UUCP (Richard Mateosian) (04/07/85)
In article <6370@boring.UUCP> jack@boring.UUCP (Jack Jansen) writes: >> Microprocessor FADD FSUB FMUL FDIV DADD DSUB DMUL DDIV >>========================= ==== ==== ==== ==== ==== ==== ==== ==== >>NS32016/NS32081 10 MHZ 7.40 7.40 4.80 8.90 7.40 7.40 6.20 11.90 >Is this true? It *does* sound funny to me that additions and >subtractions take longer than multiplies.... It's true that adds take longer than multiplies on the NS32081. It's because adds are done by repeated multiplications ( :-) ). The real reason is that there is fancier multiplication hardware on chip than addition hardware. Besides, additions require a normalization step before the operation as well as after. -- Richard Mateosian {allegra,cbosgd,decwrl,hplabs,ihnp4,seismo}!nsc!srm nsc!srm@decwrl.ARPA
phil@osiris.UUCP (Philip Kos) (04/08/85)
Into the fray! By my own reasoning (which can be pretty convoluted and obscure, when it's not just plain perverted, so feel free to correct me), the slower FADD/FSUB times are strange but not inexplicable. Several people have already posted a reasonable (and, I assume, correct) explanation of this phenomenon. The required denormalization of the smaller addend and ultimate normalization of the sum are obviously the culprits here. But the question remains: is this reasonable? Here, for your edification, is a review of how floating point arithmetic actually works. I haven't included division because it's an operation of a different color; algorithms for speeding it up quite unreasonably exist and aren't particularly hard to implement, and I don't know what algorithms different FP chip makers use. FADD/FSUB Arguments: two 32-bit (or 64-bit for DADD/DSUB) addends. Result: one 32-bit (or 64-bit) sum. Algorithm: 1. Calculate difference of exponents. 2. Shift smaller (in absolute magnitude) addend to the right until it aligns with the larger addend (can use the exponent difference as a counter preset). 3. Add. 4. Truncate and normalize sum. Notes: If the difference of exponents is greater than mantissa mantissa resolution (24 bits for single precision, 48 for double), no addition or subtraction needs to be done. The sum in these cases is simply the larger addend. Thus, no more than 24 (or 48) shifts need be done (allowing for 1 check bit). Maximum time = (max denormalization time) + (addition time) + (max normalization time) = ((exponent subtraction time) + (time for 24 (or 48) shifts)) + (addition time) + ((time for 23 (or 47) shifts) + (exponent adjust time)) Using a barrel shifter to align addends reduces the max. denormalization time by changing the order of the algorithm from O(n) to O(log2(n)). FMUL Arguments: 32-bit (or 64-bit for DMUL) multiplier and multiplicand. Result: one 64-bit (or 128-bit) product, only half of which is usually used. Algorithm: 1. Add multiplier and multiplicand exponents to generate product exponents. 2. Multiply by repeated addition (there may be special hardware for this, otherwise it's shift and con- ditionally add, shift and conditionally add, etc.) 3. Truncate and normalize product. Notes: Multiplication by repeated addition is an O(n) algorithm. If a special shift/add unit is used, multiply time is reduced because the total time is based on gate delays rather than bucket shifter clock cycles. Maximum time = (exponent addition time) + (multiplication time) + (normalization time) = (exponent addition time) + (24 (48) * (time for one add and shift)) + (one shift and exponent adjust) Conclusions: As has been noted, normalizing a product is almost always quicker than normalizing a sum. This is because in multi- plication you begin with two normalized numbers, which will yield a product needing at most 1 normalization shift. FADD, on the other hand, may need 23 (47) normalization shifts because of leading-bit cancellation. If a barrel shifter is available, the initial denormalization for FADD could be reduced significantly. A barrel shift may be as fast as a single-bit bucket shift. This would improve worst case performance by 23 (47) bucket shift clock cycles. It would probably improve the average FADD times enough to make it a faster operation than FMUL. I am surprised and dismayed that many commercial FPUs do not have barrel shifters. Granted, it's extra complexity which may not be justified by the *overall* performance increase, but we're only talking about a circuit with 48*log2(48) = ~288 multiplexors (not counting the 5 intermediate registers), which may be a significant chunk of the available silicon, but at today's densities shouldn't add that much complexity to the circuit. Am I asking for too much? (I do enjoy having my cake and eating it too...) Phil Kos The Johns Hopkins Hospital
BillW@SU-SCORE.ARPA (William Chops Westfield) (04/12/85)
You are probably asking for too much. Most of the floating point chips available use the IEEE floating point format, which means (I think) that they do all the math with 80 bit operands, and then convert to 32 or 64 bit formats on input and/or output. In the multply algorithm, you only have to calculate the most significant 24 bits of the product, right? This may add to performance of the multiply. Something I've always wondered about is whether any of the chips do an N*N bit multiply in hardware, and then use less iterations to get the final product - a 4*4 bit (unclocked) multiplier is not very complicated - how much would it speed up a 64*64 bit multiply? (Im too lazy to try to figure it out...) BillW
jack@boring.UUCP (04/12/85)
The point made by most of the people replying to my question why FADD is slower than FMUL is that the shifting done to normalize before the add causes that. If I examine an ADD, I come to the following steps (assume normalized numbers, and an N bit mantissa): 1. A maximum of N shifts, to make the exponents equal. 2. One addition. 3. A maximum of N shifts, to renormalize the result. For a multiply, I get 1. A maximum of N times a shift, plus a conditional addition. 2. One addition for the exponents (could probably be done in parallel, so let's forget it). Now, I think that a machine that a machine with the add *faster* than the shift is weird. Notice that I say *faster*. I can imagine 'just as fast', to simplify clocking,etc. but if your FMULs are faster than your FADDs, this means that your addware is faster than your shiftware (are these English words? No? Well, now they are). Unless I completely misunderstand the way multiplies are done (ROM lookup, maybe?) this reasoning seems to hold to me. -- Jack Jansen, {decvax|philabs|seismo}!mcvax!jack The shell is my oyster.
solworth@cornell.UUCP (Jon Solworth) (04/14/85)
The NS16XXX times for longer addition time than multiplication are not unreasonable from the scientific programmer point of view. Multiplies vastly outnumber adds in scientific code. Jon A. Solworth Cornell University
dgh@sun.uucp (David Hough) (04/17/85)
In article <930@cornell.UUCP> solworth@gvax.UUCP (Jon Solworth) writes: > > The NS16XXX times for longer addition time than multiplication are >not unreasonable from the scientific programmer point of view. Multiplies >vastly outnumber adds in scientific code. > Really? Are there any published papers with this result? I was under the impression that linear algebra algorithms typically take essentially equal numbers of floating point additions and multiplications, with comparatively few other floating point operations. Most scientific computation gets reduced to linear algebra sooner or later. David Hough
dsmith@hplabsc.UUCP (David Smith) (04/17/85)
The Pdp-11/60 floating point unit uses a (replicated?) lookup table for its multiplier. Several chunks of result are fed through an adder tree. Many of the product chunks can be laid end-to-end, so the adder tree doesn't require too many inputs.