[net.micro] Floating Point Comparisons

ward@cfa.UUCP (Steve Ward) (03/27/85)

The purpose of providing this information is to inform, not to promote
a particular manufacturer, computer, or microprocessor.  I am currently
designing a VMEbus CPU card using the MC68000 and the NS32081, among other
components.  I am a Vax computer user.  The Weitek information turned up
as I was typing this message, so it is included, though very incomplete.

Here are some timing specifications for register-to-register floating
point instructions on a variety of computers and microprocessors.  All
specifications are for hardware-assisted floating point operations using the
specified hardware floating point coprocessor.  The Vax timings are for the
standard DEC floating point accelerator hardware.

These specificatons are taken from literature provided by the vendors.

The Weitek chip set is supposed to be an MC68020 coprocessor implementation.
However, it is not known whether the timings given here are actually
register-to-register timings with the Weitek chip set interfaced as a
coprocessor to an MC68020, or whether the timings are of operations internal
to the Weitek chip set.  The literature is not clear, mainly because it is
taken from trade journal press releases.  The Weitek chip set is so fast
that I opted to included it here, even though I do not have data sheets for
the parts as of yet.  All other specifications are taken directly from the
vendors' data sheets and other vendor technical documents.

It is unclear for many scientific minicomputer users that the
NS320xx/NS32081 and MC680xx/MC68881 combinations will be viable
substitutes for the 11/750 and 11/780 Vaxes.  The rumored
announcement/arrival of the so-called MicroVax II which alleges to
provide 11/780 performance (90% or better for floating point, up to 105%
for other instructions) should also address the financial arguments by
placing an 11/780 class (including floating point) computer into the
microcomputer workstation market.  The MC68020/WTL1164,WTL1165 combination
looks very interesting.

As of the date of this posting the MC68881 has only been distributed as
samples in the 12.5 MHZ version.

FADD = 32 bit single precision register-to-register floating point add.
FSUB = 32 bit single precision register-to-register floating point substract.
FMUL = 32 bit single precision register-to-register floating point multiply.
FDIV = 32 bit single precision register-to-register floating point divide.
DADD = 64 bit double precision register-to-register floating point add.
DSUB = 64 bit double precision register-to-register floating point subtract.
DMUL = 64 bit double precision register-to-register floating point multiply.
DDIV = 64 bit double precision register-to-register floating point divide.



              =====================================
              *                                   *
              *  ALL TIMINGS ARE IN MICROSECONDS  *
              *                                   *
              =====================================


       Computer
          or
     Microprocessor           FADD  FSUB  FMUL  FDIV  DADD  DSUB  DMUL  DDIV
=========================     ====  ====  ====  ====  ====  ====  ====  ====

11/780 Vax w/FPA              1.19  1.19  1.19  4.60  2.39  2.59  3.40  8.82

11/750 Vax w/FPA              1.76  1.75  2.27  6.44  2.63  2.63  4.69 12.80

11/730 Vax w/FPA              4.81  7.85  9.88 11.56  9.85 13.27 23.07 23.34

MC68020/MC68881 16.67 MHZ     2.80  2.80  3.10  3.80  2.80  2.80  4.00  5.90

MC68020/MC68881 12.5 MHZ      3.73  3.73  4.13  5.07  3.73  3.73  5.33  7.87

NS32016/NS32081 10 MHZ        7.40  7.40  4.80  8.90  7.40  7.40  6.20 11.90

NS32016/NS32081 8 MHZ         9.25  9.25  6.00 11.13  9.25  9.25  7.75 14.86

Weitek WTL1164,WTL1165         -     -   0.360   -     -     -   0.600   -

The Weitek line is very bare.  The fact that an MC68020 floating point
coprocessor is being manufactured by Weitek is interesting, and it at least
promises to be very fast.  Perhaps someone else can look into the Weitek
situation.  The trade journal "new product" information was pretty flaky.


Steven M. Ward
Center for Astrophysics
60 Garden Street
Cambridge, MA 02138
(617) 495-7466
{genrad, allegra, ihnp4, harvard}!wjh12!cfa!ward

dgh@sun.uucp (David Hough) (03/29/85)

In article <133@cfa.UUCP> ward@cfa.UUCP (Steve Ward) writes:
>
>Here are some timing specifications for register-to-register floating
>point instructions on a variety of computers and microprocessors.
>These specificatons are taken from literature provided by the vendors.

Unfortunately such comparisons don't convey very much useful information.
What is more relevant is a comparison of execution times for a real
program compiled by a real compiler running on a real system.  
Unfortunately marketeers seldom are interested in providing useful
verifiable information.

The only benchmark that I know that is widely cited and realistic for
a problem of actual computational interest is the "Linpack" benchmark
published by Dongarra and others.

The benchmark solves 100x100 systems of linear equations using routines
from the Linpack library.  32 bit floating point times range from 35 
milliseconds to 2813 seconds.  64 bit floating point times range from
21 milliseconds to 149 seconds.

Detailed results are available from Jack Dongarra at Argonne.

David Hough

jack@boring.UUCP (04/03/85)

In article <133@cfa.UUCP> ward@cfa.UUCP (Steve Ward) writes:
>       Computer
>          or
>     Microprocessor           FADD  FSUB  FMUL  FDIV  DADD  DSUB  DMUL  DDIV
>=========================     ====  ====  ====  ====  ====  ====  ====  ====
>	...
>NS32016/NS32081 10 MHZ        7.40  7.40  4.80  8.90  7.40  7.40  6.20 11.90
>
>NS32016/NS32081 8 MHZ         9.25  9.25  6.00 11.13  9.25  9.25  7.75 14.86
>
Is this true? It *does* sound funny to me that additions and
subtractions take longer than multiplies....
The only thing I can imagine from the table is that *ADD/*SUB are always
done in double mode, but this doesn't seem to make sense if there is
hardware to do multiplies/divides in single precsision.

Can anyone enlighten me on this???
>Weitek WTL1164,WTL1165         -     -   0.360   -     -     -   0.600   -
>
>The Weitek line is very bare.  The fact that an MC68020 floating point
>coprocessor is being manufactured by Weitek is interesting, and it at least
>promises to be very fast.  Perhaps someone else can look into the Weitek
>situation.  The trade journal "new product" information was pretty flaky.
>
>
>Steven M. Ward
>Center for Astrophysics
>60 Garden Street
>Cambridge, MA 02138
>(617) 495-7466
>{genrad, allegra, ihnp4, harvard}!wjh12!cfa!ward


-- 
	Jack Jansen, {decvax|philabs|seismo}!mcvax!jack
It's wrong to wish on space hardware.

brooks@lll-crg.ARPA (Eugene D. Brooks III) (04/05/85)

> 
> In article <133@cfa.UUCP> ward@cfa.UUCP (Steve Ward) writes:
> >       Computer
> >          or
> >     Microprocessor           FADD  FSUB  FMUL  FDIV  DADD  DSUB  DMUL  DDIV
> >=========================     ====  ====  ====  ====  ====  ====  ====  ====
> >	...
> >NS32016/NS32081 10 MHZ        7.40  7.40  4.80  8.90  7.40  7.40  6.20 11.90
> >
> >NS32016/NS32081 8 MHZ         9.25  9.25  6.00 11.13  9.25  9.25  7.75 14.86
> >
> Is this true? It *does* sound funny to me that additions and
> subtractions take longer than multiplies....
> The only thing I can imagine from the table is that *ADD/*SUB are always
> done in double mode, but this doesn't seem to make sense if there is
> hardware to do multiplies/divides in single precsision.
> 
> Can anyone enlighten me on this???
Yes its true that an fadd takes longer than a fmult on the NS32032 according
to the NS manual.  I assume that NS knows what the timing is for their HW.

A floating point add or subtract requires a shift if the mantessa to get
the exponents to agree before adding.  The multiply does not require this.
Perhaps the 16081 has to do the shift a bit at a time in microcode.
Anyone from National on the net to answer this one?

doug@terak.UUCP (Doug Pardee) (04/05/85)

> >     Microprocessor           FADD  FSUB  FMUL  FDIV  DADD  DSUB  DMUL  DDIV
> >NS32016/NS32081 10 MHZ        7.40  7.40  4.80  8.90  7.40  7.40  6.20 11.90
> >NS32016/NS32081 8 MHZ         9.25  9.25  6.00 11.13  9.25  9.25  7.75 14.86
>
> Is this true? It *does* sound funny to me that additions and
> subtractions take longer than multiplies....

I don't know about the 32081 specifically, but the usual reason for this
kind of behavior is that before a floating point addition or subtraction
can be performed, the operand with the lesser exponent must be
"de-normalized" to have the same exponent as the other operand.  Since
the '081 operates on 53-bit fractions, this might take up to 53 shift
operations, at one bit per clock cycle.  Let's see, 53 * 100 ns = 5.3
microseconds just for pre-normalization.
-- 
Doug Pardee -- Terak Corp. -- !{hao,ihnp4,decvax}!noao!terak!doug

cdl@mplvax.UUCP (Carl Lowenstein) (04/05/85)

In article <6370@boring.UUCP> jack@boring.UUCP (Jack Jansen) writes:
>
>In article <133@cfa.UUCP> ward@cfa.UUCP (Steve Ward) writes:
>>     Microprocessor           FADD  FSUB  FMUL  FDIV  DADD  DSUB  DMUL  DDIV
>>NS32016/NS32081 10 MHZ        7.40  7.40  4.80  8.90  7.40  7.40  6.20 11.90
>>
>Is this true? It *does* sound funny to me that additions and
>subtractions take longer than multiplies....
>	Jack Jansen, {decvax|philabs|seismo}!mcvax!jack

If you think about it, floating-point addition and subtraction are much
more difficult than multiplication.  Multiplication involves only
extraction and addition of the exponents, and multiplication of the
fractions, with about a 25% probability of a single-bit renormalization.
Addition requires the same extraction of exponent and fraction, followed
by a variable shift to align binary points, the actual addition of
the fractions, and then possibly a lot of renormalization, since the
sum might overflow by one bit position, or underflow a lot due to
cancellation between positive and negative operands.  All this shifting
for normalizing takes time.

-- 
	carl lowenstein		marine physical lab	u.c. san diego
	{ihnp4|decvax|akgua|dcdwest|ucbvax}	!sdcsvax!mplvax!cdl

srm@nsc.UUCP (Richard Mateosian) (04/07/85)

In article <6370@boring.UUCP> jack@boring.UUCP (Jack Jansen) writes:

>>     Microprocessor           FADD  FSUB  FMUL  FDIV  DADD  DSUB  DMUL  DDIV
>>=========================     ====  ====  ====  ====  ====  ====  ====  ====
>>NS32016/NS32081 10 MHZ        7.40  7.40  4.80  8.90  7.40  7.40  6.20 11.90

>Is this true? It *does* sound funny to me that additions and
>subtractions take longer than multiplies....

It's true that adds take longer than multiplies on the NS32081.  It's because
adds are done by repeated multiplications ( :-) ).

The real reason is that there is fancier multiplication hardware on chip than
addition hardware.  Besides, additions require a normalization step before
the operation as well as after.

-- 
Richard Mateosian
{allegra,cbosgd,decwrl,hplabs,ihnp4,seismo}!nsc!srm    nsc!srm@decwrl.ARPA

phil@osiris.UUCP (Philip Kos) (04/08/85)

Into the fray!

By my own reasoning (which can be pretty convoluted and obscure,
when it's not just plain perverted, so feel free to  correct me),
the slower FADD/FSUB times are strange but not inexplicable.
Several people have already posted a reasonable (and, I assume,
correct) explanation of this phenomenon.

The required denormalization of the smaller addend and ultimate
normalization of the sum are obviously the culprits here.  But
the question remains:  is this reasonable?

Here, for your edification, is a review of how floating point
arithmetic actually works.  I haven't included division because
it's an operation of a different color; algorithms for speeding
it up quite unreasonably exist and aren't particularly hard to
implement, and I don't know what algorithms different FP chip
makers use.


FADD/FSUB

Arguments:  two 32-bit (or 64-bit for DADD/DSUB) addends.

Result:  one 32-bit (or 64-bit) sum.

Algorithm:
	1.  Calculate difference of exponents.
	2.  Shift smaller (in absolute magnitude) addend to the
		right until it aligns with the larger addend (can
		use the exponent difference as a counter preset).
	3.  Add.
	4.  Truncate and normalize sum.

Notes:
	If the difference of exponents is greater than mantissa
	mantissa resolution (24 bits for single precision, 48 for
	double), no addition or subtraction needs to be done.  The
	sum in these cases is simply the larger addend.  Thus, no
	more than 24 (or 48) shifts need be done (allowing for 1
	check bit).

	Maximum time = (max denormalization time) + (addition time)
	+ (max normalization time) = ((exponent subtraction time)
	+ (time for 24 (or 48) shifts)) + (addition time) + ((time
	for 23 (or 47) shifts) + (exponent adjust time))

	Using a barrel shifter to align addends reduces the max.
	denormalization time by changing the order of the algorithm
	from O(n) to O(log2(n)).


FMUL

Arguments:  32-bit (or 64-bit for DMUL) multiplier and multiplicand.

Result:  one 64-bit (or 128-bit) product, only half of which is
	usually used.

Algorithm:
	1.  Add multiplier and multiplicand exponents to generate
		product exponents.
	2.  Multiply by repeated addition (there may be special
		hardware for this, otherwise it's shift and con-
		ditionally add, shift and conditionally add, etc.)
	3.  Truncate and normalize product.

Notes:
	Multiplication by repeated addition is an O(n) algorithm.
	If a special shift/add unit is used, multiply time is reduced
	because the total time is based on gate delays rather than
	bucket shifter clock cycles.

	Maximum time = (exponent addition time) + (multiplication
	time) + (normalization time) = (exponent addition time) +
	(24 (48) * (time for one add and shift)) + (one shift and
	exponent adjust)


Conclusions:
	As has been noted, normalizing a product is almost always
	quicker than normalizing a sum.  This is because in multi-
	plication you begin with two normalized numbers, which will
	yield a product needing at most 1 normalization shift.  FADD,
	on the other hand, may need 23 (47) normalization shifts
	because of leading-bit cancellation.

	If a barrel shifter is available, the initial denormalization
	for FADD could be reduced significantly.  A barrel shift may
	be as fast as a single-bit bucket shift.  This would improve
	worst case performance by 23 (47) bucket shift clock cycles.
	It would probably improve the average FADD times enough to
	make it a faster operation than FMUL.


I am surprised and dismayed that many commercial FPUs do not have
barrel shifters.  Granted, it's extra complexity which may not be
justified by the *overall* performance increase, but we're only
talking about a circuit with 48*log2(48) = ~288 multiplexors (not
counting the 5 intermediate registers), which may be a significant
chunk of the available silicon, but at today's densities shouldn't
add that much complexity to the circuit.

Am I asking for too much?  (I do enjoy having my cake and eating it
too...)

					Phil Kos
					The Johns Hopkins Hospital

BillW@SU-SCORE.ARPA (William Chops Westfield) (04/12/85)

You are probably asking for too much.  Most of the floating point
chips available use the IEEE floating point format, which means
(I think) that they do all the math with 80 bit operands, and then
convert to 32 or 64 bit formats on input and/or output.

In the multply algorithm, you only have to calculate the most significant
24 bits of the product, right?  This may add to performance of the multiply.

Something I've always wondered about is whether any of the chips do
an N*N bit multiply in hardware, and then use less iterations to
get the final product - a 4*4 bit (unclocked) multiplier is not
very complicated - how much would it speed up a 64*64 bit multiply?
(Im too lazy to try to figure it out...)

BillW

jack@boring.UUCP (04/12/85)

The point made by most of the people replying to my question
why FADD is slower than FMUL is that the shifting done
to normalize before the add causes that.
If I examine an ADD, I come to the following steps (assume normalized
numbers, and an N bit mantissa):
1. A maximum of N shifts, to make the exponents equal.
2. One addition.
3. A maximum of N shifts, to renormalize the result.

For a multiply, I get
1. A maximum of N times a shift, plus a conditional addition.
2. One addition for the exponents (could probably be done in
   parallel, so let's forget it).

Now, I think that a machine that a machine with the add *faster*
than the shift is weird. Notice that I say *faster*. I can imagine
'just as fast', to simplify clocking,etc. but if your FMULs are
faster than your FADDs, this means that your addware is faster than
your shiftware (are these English words? No? Well, now they are).

Unless I completely misunderstand the way multiplies are done (ROM
lookup, maybe?) this reasoning seems to hold to me. 

-- 
	Jack Jansen, {decvax|philabs|seismo}!mcvax!jack
	The shell is my oyster.

solworth@cornell.UUCP (Jon Solworth) (04/14/85)

	The NS16XXX times for longer addition time than multiplication are
not unreasonable from the scientific programmer point of view.  Multiplies
vastly outnumber adds in scientific code.

Jon A. Solworth
Cornell University

dgh@sun.uucp (David Hough) (04/17/85)

In article <930@cornell.UUCP> solworth@gvax.UUCP (Jon Solworth) writes:
>
>	The NS16XXX times for longer addition time than multiplication are
>not unreasonable from the scientific programmer point of view.  Multiplies
>vastly outnumber adds in scientific code.
>
Really?  Are there any published papers with this result?  I was under
the impression that linear algebra algorithms typically take essentially
equal numbers of floating point additions and multiplications,
with comparatively few other floating point operations.  Most scientific
computation gets reduced to linear algebra sooner or later.

David Hough

dsmith@hplabsc.UUCP (David Smith) (04/17/85)

The Pdp-11/60 floating point unit uses a (replicated?) lookup table for its
multiplier.  Several chunks of result are fed through an adder tree.
Many of the product chunks can be laid end-to-end, so the adder tree
doesn't require too many inputs.