[net.arch] Floating point performance

ian@loral.UUCP (Ian Kaplan) (09/25/86)

			     ABSTRACT

   The scalar performance of microprocessors is increasing much faster than
   the floating point performance of floating point coprocessors.  
   Floating point performance is essential for many Fortran applications.  
   The purpose of this note is to encourage microprocessor designers to 
   concentrate more on floating point.

   The Current State of the World
   ------------------------------

   The "big three" microprocessor manufacturers (iNTEL, Motorola and
   National Semiconductor) have all announced 32-bit microprocessors.  
   These chips all have high scalar performance (at least for
   microprocessors).  
   
   Unfortunately the speed of the floating point co-processors has not 
   kept up.  Below are some approximate speeds for the various math 
   coprocessors.  The performance listed is "peak" performance, which 
   assumes that the operands are already in the floating point registers.

      Intel   80287   less than 0.1 MFLOP
      National        0.1 MFLOP
      Motorola        0.3 MFLOP

   Naturally these speeds will vary, depending on clock speed, but there
   are not far off.  None of these coprocessors has performance that is 
   close to 1 MFLOP.

   National has a chip (the 32310) that will allow the 32032 to be
   interfaced with the Weitek floating point ALU and multiply unit.
   This yields 0.8 MFLOPS with math error checking enabled and 1.2 MFLOPS
   with it disabled.  Unfortunately the 32310 solution not only entails a
   significant increase in component price (a 32310, two Weitek chips and
   support logic), but also consumes a lot of board space and power.  Whether 
   the increase in floating point performance justifies these increased 
   costs is arguable.  I would like to get more MFLOPS for my bucks.

   We Need More MFLOPS
   -------------------

   At one time (three years ago) 0.1 MFLOP was considered pretty good
   floating point performance.  This is no longer true.  Floating point
   performance tends to be measured now against the Weitek chips (or AMD or
   Analog Devices etc...) used in bit-slice applications (e.g., 10 MFLOPS).

   The floating point performance available with the Weitek chips would be
   wasted on the current generation of microprocessors.  Even the 
   fastest microprocessors currently available could not feed
   floating point operands to a coprocessor fast enough to keep up with the
   10 MFLOP rate.  What is needed is a coprocessor (or an integrated
   floating point processor) that will yield 1 to 2 MFLOPS.

   So far the only microprocessor that comes close to delivering 1 MFLOP of
   floating point performance is the Fairchild Clipper.  The Clipper has an
   on chip floating point unit rated at about 1 MFLOP.  

   Fortran, despite its many problems, is one of the most widely used 
   programming languages.  Despite the hopes of many computer scientists, 
   Fortran will never disappear (there will just be new versions of 
   Fortran).  The increasing power of microprocessors means that 
   more Fortran applications will be run on microprocessor based 
   systems.  Many Fortran applications are floating point bound, so the 
   increased scalar performance provided by the current generation of 
   32-bit microprocessors will only speed them up fractionally.  For
   these applications floating point performance will be the deciding
   factor.  The microprocessor vendor that realizes the importance of 
   floating point performance, and develops a solution, will capture a 
   large number of designs where Fortran performance is important.


		     Ian Kaplan
		     Loral Dataflow Group
		     Loral Instrumentation
	     USENET: {ucbvax,decvax,ihnp4}!sdcsvax!loral!ian
	     ARPA:   sdcc6!loral!ian@UCSD
	     USPS:   8401 Aero Dr. San Diego, CA 92123

henkp@nikhefk.uucp (Henk Peek) (10/05/86)

In article <340@euroies.UUCP> shepherd@euroies.UUCP (Roger Shepherd) writes:
>I was interested to read Ian Kaplan's (...!loral!ian)
> 
>For example, Ian quotes some performance figures as
> 
>   Intel 80287       < 0.1 MFLOP  (say 0.95 MFLOP at 8Mhz?)
>   National            0.1 MFLOP
>   Motorola            0.3 MFLOP
> 
>To these I'll add the figure for the INMOS transputer (no
>co-processor, floating point done in software)
> 
>   Inmos IMS T414-20   0.09 MFLOP (typical for * and /,
>                                   + and - are slower!)
>Roger Shepherd
>INMOS Limited, 1000 Aztec West, Almondsbury, Bristol, BS12 4SQ, GB
>USENET: ...!euroies!shepherd
>PHONE: +44 454 616616

On the Inmos seminair in the Netherlands Inmos claimed a floating-
point performance ~1 Mflop for the T800. It is a T414 with microcoded
floatingpoint on a single chip. The chip is pin compatible with
the t414-20. We have't received the handout. I do't have the
exact numbers. They claimed that you can get samples within a few
weeks. Has somone worked with this chip?

uucp:  henkp@nikhefk@mcvax   seismo!mcvax!nikhefk!henkp
mail:  Henk Peek, NIKHEF-K, PB 4395, 1009 AJ Amsterdam, Holland

shepherd@euroies.UUCP (Roger Shepherd) (10/08/86)

I was interested to read Ian Kaplan's (...!loral!ian) appeal
for microprocessors with fast floating point.  I am a little
concerned with the use of `peak' performance to characterise
the speed of part as I don't think that this necessarily
reflects the the USABLE performance of the part.  I think it
is instructive to look at MFLOPs compared with Whetstones. (A
good benchmark of performance on `typical' scientific
programs).
 
For example, Ian quotes some performance figures as
 
   Intel 80287       < 0.1 MFLOP  (say 0.95 MFLOP at 8Mhz?)
   National            0.1 MFLOP
   Motorola            0.3 MFLOP
 
To these I'll add the figure for the INMOS transputer (no
co-processor, floating point done in software)
 
   Inmos IMS T414-20   0.09 MFLOP (typical for * and /,
                                   + and - are slower!)
 
According to the figures I have to hand, these processors
compare somewhat differently when the Whetstone figures are
compared.  For example, I have single length Whetstone
figures as follows for these machines
 
                                 kWhets   MWhets/MFLOP
                                                    (normalised)
   Intel 80286/80287    (8 Mhz)   300         3.2      1.0
   NS 30032 & 32081    (10 Mhz)   128         1.3      0.4
   MC 68020 & 68881 (16 & 12.5)   755         2.5      0.8
 
   Inmos IMS T414B-20             663         7.4      2.3
 
The final column gives some feel for how effective these
processor/co-processor (just processor for the T414)
combinations are at turning MFLOPS into usable floating
point performance.
 
Also, I don't quite know why Ian likes the CLIPPER (three
chips on the picture of the (large) module I've seen) but
dislikes the NS 32310 (four chips); they seem to give the
same MFLOP rating. (Does anyone have Whetstone figures for
these two?)
 
Comparisons against Weiteks (or whatever) are also somewhat
suspect.  To use their peak data rate you have to use them in
pipelined mode, their scalar mode tends to be somewhat slower
and it might be possible to build a microprocessor system
that could feed them data and accept results at that rate.
However, if you're only using the chips in that mode I'm not
convinced that you really want all that silicon to be taken
up with a large pipelineable (?) multiplier; I'd rather have
a processor there!
 
On the same subject (sort of), what measure should be made
of the `goodness' of a floating point micro-processor? How
about MWhetstones per square centi-metre. (Or do all you
guys and girls still use inches? :-) ) Or, how about
MWhetstones per milliwatt?
-- 
Roger Shepherd
INMOS Limited, 1000 Aztec West, Almondsbury, Bristol, BS12 4SQ, GB
USENET: ...!euroies!shepherd
PHONE: +44 454 616616

ian@loral.UUCP (Ian Kaplan) (10/08/86)

In article <340@euroies.UUCP> shepherd@euroies.UUCP (Roger Shepherd) writes:
> 
>                                 kWhets   MWhets/MFLOP
>                                                    (normalised)
>   Intel 80286/80287    (8 Mhz)   300         3.2      1.0
>   NS 30032 & 32081    (10 Mhz)   128         1.3      0.4
>   MC 68020 & 68881 (16 & 12.5)   755         2.5      0.8
> 
>   Inmos IMS T414B-20             663         7.4      2.3
> 

  These figures are interesting.  I am surprised at the figure for the
  80287.  Intel uses this processor on the Intel Cube, which has very
  poor floating point performance.  The above table suggests that
  reasonable floating point performance could be achieved increasing the
  clock rate.  It is not clear to me that this is born out by reality.
  Does anyone have floating point performance numbers for a 12.5 MHz
  80287?

  
> 
>Also, I don't quite know why Ian likes the CLIPPER (three
>chips on the picture of the (large) module I've seen) but
>dislikes the NS 32310 (four chips); they seem to give the
>same MFLOP rating. (Does anyone have Whetstone figures for
>these two?)
> 

  The Whetstone figure for the 32310 is 1.137 MWhets and 0.8 MFLOP.

  I like the Clipper because I have the impression that is uses less board
  space and power than a 32032, a 32310 and two Weitek chips.
  There are other considerations that must be taken into account also,
  like the history of a product line.

>
>-- 
>Roger Shepherd
>INMOS Limited, 1000 Aztec West, Almondsbury, Bristol, BS12 4SQ, GB
>USENET: ...!euroies!shepherd
>PHONE: +44 454 616616


		     Ian Kaplan
		     Loral Dataflow Group
		     Loral Instrumentation
	     USENET: {ucbvax,decvax,ihnp4}!sdcsvax!loral!ian
	     ARPA:   sdcc6!loral!ian@UCSD
	     USPS:   8401 Aero Dr. San Diego, CA 92123

curry@nsc.UUCP (Ray Curry) (10/08/86)

>Path: nsc!pyramid!decwrl!decvax!ucbvax!ucbcad!nike!lll-crg!seismo!mcvax!euroies!shepherd
>From: shepherd@euroies.UUCP (Roger Shepherd)
>Newsgroups: net.arch
>Subject: Floating point performance
>Message-ID: <340@euroies.UUCP>
 
>dislikes the NS 32310 (four chips); they seem to give the
>same MFLOP rating. (Does anyone have Whetstone figures for
>these two?)
 
>Comparisons against Weiteks (or whatever) are also somewhat
>suspect.  To use their peak data rate you have to use them in
>pipelined mode, their scalar mode tends to be somewhat slower
 
-- 
>Roger Shepherd
>INMOS Limited, 1000 Aztec West, Almondsbury, Bristol, BS12 4SQ, GB
>USENET: ...!euroies!shepherd
>PHONE: +44 454 616616

Just by coincidence, I have been running some floating point benchmarks
on NS32081 floating point processor and thought I needed to respond
with some more up to date numbers.  I ran the single precision Whetstone 
on the NS32032 and NS32081 at 10MHz on the DB32000 board, and the NS32332
and NS32081 at 15 MHz on the DB332 board.  I don't know where the posted
32032-32081 number came from but I measure better even using our older
compiliers.  Our new compilers show marked improvement.

	32032-32081 (10MHz)		189 Kwhets (old compiler)
	32032-32081 (10MHz)		390 Kwhets (new compiler)
	32332-32081 (15MHz)		728 Kwhets (new compiler)

I used the 32332-32081 numbers to generate instruction counts to project
worst case performance for the NS32310 and the NS32381, worst case being
using the identical math routines and minimizing the pipelining of the
32310.  These project performance for the 32332-32381 (15MHz) at approx-
imately 1100-1200 KWhets and 32332-32310 (15MHz) at 1500-1600 KWhets. 
Since both the 32310 and 32381 will have new instructions that will
impact the math libraries, the real performance could be higher.
Just for interest, preliminary analysis is saying pipelining should
improve performance at least 15% overall (30% for the floating point
portion of the instruction mix).

I would like to add my own question to the value of benchmarks.  That
is what do the people on the net feel about transcendental functions?
The Whetstone seems to me to place more emphasis on them than real life.
One of the reasons for not including them directly in the 32081 was that
it was felt that implementing them in math routines instead of hardware 
was more cost effective.  Is this true or are transcendentals important
enough for the increased cost of implementing them in hardware?

srm@iris.berkeley.edu (Richard Mateosian) (10/09/86)

Just to set the record straight, at 10 MHz the NS32032/32081 achieves
about 400 kWhets/sec.  The NS32016/32081 gives around 300.  These figures
are from memory and may be off a little, but they are vastly better than
the 128 kWhets/sec cited in the referenced articles.

Richard Mateosian    ...ucbvax!ucbiris!srm 	     2919 Forest Avenue     
415/540-7745         srm%ucbiris@Berkeley.EDU        Berkeley, CA  94705    

henry@utzoo.UUCP (Henry Spencer) (10/10/86)

> ...what do the people on the net feel about transcendental functions?
> The Whetstone seems to me to place more emphasis on them than real life.
> One of the reasons for not including them directly in the 32081 was that
> it was felt that implementing them in math routines instead of hardware 
> was more cost effective.  Is this true or are transcendentals important
> enough for the increased cost of implementing them in hardware?

Personally, while I strongly suspect that a software implementation is
more cost-effective than doing them in hardware, putting them on-chip
strikes me as a marvellous way of getting them right once and for all
and encouraging everyone to use the done-right version.  (This does assume,
of course, that the chip-maker spends the necessary money to *get* them
right, which requires high-paid specialists and a lot of work.)  One
could get much the same effect with a bare-bones arithmetic chip and a
ROM chip containing the math routines, except that ROMs are too easy to
copy and you'd never recover the investment needed to do a good job.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry

stever@videovax.UUCP (10/11/86)

In article <340@euroies.UUCP>, Roger Shepherd (shepherd@euroies.UUCP)
writes:

> I was interested to read Ian Kaplan's (...!loral!ian) appeal
> for microprocessors with fast floating point.  . . .

> For example, Ian quotes some performance figures as
>  
>    Intel 80287       < 0.1 MFLOP  (say 0.95 MFLOP at 8Mhz?)
>    National            0.1 MFLOP
>    Motorola            0.3 MFLOP

Shouldn't the figure in parentheses for the 80287 be 0.095 MFLOP??

> To these I'll add the figure for the INMOS transputer (no
> co-processor, floating point done in software)
>  
>    Inmos IMS T414-20   0.09 MFLOP (typical for * and /,
>                                    + and - are slower!)
>  
> According to the figures I have to hand, these processors
> compare somewhat differently when the Whetstone figures are
> compared.  For example, I have single length Whetstone
> figures as follows for these machines
>  
>                                  kWhets   MWhets/MFLOP
>                                                     (normalised)
>    Intel 80286/80287    (8 Mhz)   300         3.2      1.0
>    NS 30032 & 32081    (10 Mhz)   128         1.3      0.4
>    MC 68020 & 68881 (16 & 12.5)   755         2.5      0.8
>  
>    Inmos IMS T414B-20             663         7.4      2.3
>  
> The final column gives some feel for how effective these
> processor/co-processor (just processor for the T414)
> combinations are at turning MFLOPS into usable floating
> point performance.

As one who has looked at the relative merits of various processors
and coprocessors before making a selection, I am not at all concerned
about "how effective [a] processor/co-processor combination[] [is] at
turning MFLOPS into usable floating point performance."  The bottom
line for an application is closely tied to the numbers in the "kWhets"
column.  The real question is, "How fast will it run my application?"

					Steve Rice

----------------------------------------------------------------------------
{decvax | hplabs | ihnp4 | uw-beaver}!tektronix!videovax!stever

mash@mips.UUCP (John Mashey) (10/13/86)

In article <1989@videovax.UUCP> stever@videovax.UUCP (Steven E. Rice) writes:
>In article <340@euroies.UUCP>, Roger Shepherd (shepherd@euroies.UUCP)
>writes:
>> .....  For example, I have single length Whetstone
>> figures as follows for these machines
>>                                  kWhets   MWhets/MFLOP
>>                                                     (normalised)
>>    Intel 80286/80287    (8 Mhz)   300         3.2      1.0
>>    NS 30032 & 32081    (10 Mhz)   128         1.3      0.4
>>    MC 68020 & 68881 (16 & 12.5)   755         2.5      0.8
>>  
>>    Inmos IMS T414B-20             663         7.4      2.3
>>  
>> The final column gives some feel for how effective these
>> proce)/co-processor (just processor for the T414)
>> combinations are at turning MFLOPS into usable floating
>> point performance.
>
>As one who has looked at the relative merits of various processors
>and coprocessors before making a selection, I am not at all concerned
>about "how effective [a] processor/co-processor combination[] [is] at
>turning MFLOPS into usable floating point performance."  The bottom
>line for an application is closely tied to the numbers in the "kWhets"
>column.  The real question is, "How fast will it run my application?"
--THE RIGHT QUESTION!-----------^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Note that this discussion is very akin to the "peak Mips" versus
"sustained Mips" versus "how fast does it run real programs" argument
in the integer side of the world. I think both Roger and Steven have
some useful points, and, in fact, don't seem to be to disagree very much:
	1) (Roger): MFLOPS don't mean very much. (see (1) below, etc)
	2) (Steven): and neither do Whetstones!
	3) (Roger): propose Whetstones / (peak MFLOPS) as architectural measure.
Note that most vendors spec MFLOPS using cached, back-to-back adds with
both arguments already in registers. For real programs, one also needs to
measure effects of:
	a) coprocessor interaction, i.e., can you load/store directly to
	the coprocessor from memory, or do you need to copy arguments
	thru the CPU? (can make large difference).
	b) Pipelining/overlap effects?
	c) Number of FP registers.
	d) Compiler effects.
(1)In general, peak MFlops don't seem to mean too much.  Whetstones seem to
test the FP libraries more than anything else (although this at least
measures SOMETHING a bit more real). (2) A lot of people like LINPACK
MFLops ratings, or Livermore Loops, although the former, at least,
also measures memory system very strongly, i.e., its bigger than almost
any cache, and that's quite characteristic of some codes, and totally
uncharacteristic of others.

(3) However, a useful attribute of Roger's measure's (or variant thereof)
is that looking at the measure (units of real performance) per Mhz,
you some idea of architectural efficiency, i.e., smaller numbers are
better, in that (cycle time) is likely to be a property of the technology,
and hard to improve, at a given level of technology. [This is clearly
a RISC-style argument of reducing the cycle count for delivered performance,
andthen letting technology carry you forward.]  Using the numbers above,
one gets KiloWhets / Mhz, for example:
Machine		Mhz	KWhet	KWhet/Mhz
80287		 8	 300	 40
32332-32081	15	 728	 50		(these from Ray Curry,
32332-32381	15	1200	 80		in <3833@nsc.UUCP>) (projected)
32332-32310	15	1600	100*		"" "" (projected)
Clipper?	33	1200?	 40		guess? anybody know better #?
68881		12.5	 755	 60		(from discussion)
68881		20	1240	 60		claimed by Moto, in SUN3-260
SUN FPA		16.6	1700	100*		DP (from Hough) (in SUN3-160)
MIPS R2360	 8	1160	140*		DP (interim, with restrictions)
MIPS R2010	 8	4500	560		DP (simulated)

The *'d ones are boards / controllers for Weitek parts.
The Kwhet/Mhz numbers were heavily rounded: 1-2 digits accuracy is about 
all you can extract from this, at best.  One can argue about the speed that
should be used for the 68881 systems, since the associated 68020 runs faster.
What you do see is (not surprisingly) that heavily microcoded designs
get less Kwhet/Mhz than those that use either the Weitek parts or are
not microcoded.

As usual, whether you think this means anything or not depends on whether or
not you think Whetstones are a good measure.  If not, it would help to
see other things proposed.  For some reason, Floating Point benchmarks seem
to vary pretty strongly in their behavioral patterns.
Also, if anybody has better numbers, it would be nice to see them.  At least
some of the ones in the list above are of uncertain parentage.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

dgh@sun.UUCP (10/15/86)

                            Mflops Per MHz

                             David Hough
			   dhough@sun.com

     I'd like to add to John Mashey's recent posting about floating-
point performance.  In the following table extracted and revised from
that posting, the Sun-3 measurements are mine; the MIPS numbers are
Mashey's.  All KW results indicate thousands of double precision Whet-
stone instructions per second.  Results marked * represent implementa-
tions based on Weitek chips.  As Mashey points out, it's not clear
whether the MHz should refer to the CPU or FPU, so I included both.

Machine         CPU Mhz FPU MHz  KW   KW/CPUMhz KW/FPUMHz

Sun-3/160+68881 16.7    16.7     955     60      60
Sun-3/160+68881 25      20      1240     50      60
Sun-3/160+FPA*  16.7    16.7    1840    100     100
Sun-3/260+FPA*  25      16.7    2600    100     160

MIPS R2360*      8       8      1160    140     140     (interim restrictions)
MIPS R2010       8       8      4500    560     560     (simulated)

     As you puzzle over the meaning of these results, remember that
elementary transcendental function routines have minor effect on Whet-
stone performance when the hardware is high-performance.  Whetstone
benchmark performance is mostly determined by the following code:

                DO 90 I=1,N8
                CALL P3(X,Y,Z)
   90           CONTINUE


        SUBROUTINE P3(X,Y,Z)
        IMPLICIT REAL (A-H,O-Z)
        COMMON T,T1,T2,E1(4),J,K,L
        X1 = X
        Y1 = Y
        X1 = T * (X1 + Y1)
        Y1 = T * (X1 + Y1)
        Z = (X1 + Y1) / T2
        RETURN
        END

On Weitek 1164/1165-based systems, execution time for the P3 loop is
dominated by the division operation, which is about 6 times slower
than an addition or multiplication and can't be overlapped with any
other operation, inhibiting pipelining.  Furthermore, not only can no
1164 operation overlap any 1165 operation, but parallel invocation of
P3 calls can't be justified without doing enough analysis to discover
something far more interesting: the best way to improve Whetstone per-
formance is to do enough global inter-procedural optimization in your
compiler to determine that P3 only needs to be called once.  This
gives a 2X performance increase with no hardware work at all! One MIPS
paper suggests that the MIPS compiler does this or something similar.
Maybe benchmark performance should be normalized for software as well
as hardware technology.

     I've discussed benchmarking issues at length in the Floating-
Point Programmer's Guide for the Sun Workstation, 3.2 Release, leading
to the recommendation that the nonlinear optimization and zero-finding
that P3 is intended to mimic is better benchmarked by the real thing,
such as the SPICE program.  Of course, SPICE is a complicated real
application and its performance is difficult to predict in advance,
and that makes marketing and management scientists everywhere uneasy.

     Linear problems are usually characterized by large dimension and
therefore memory and bus performance is as important as peak
floating-point performance; a Linpack benchmark with suitably-
dimensioned arrays is appropriate.

     I don't know whether RISC or CISC designs will prove to give the
most bang for the buck, but I do have some philosophical questions for
RISC gurus:  Is hardware floating point faster than software floating
point on RISC systems?  If so, and it is because the FPU technology is
faster than the CPU, then why isn't the CPU fabricated with that tech-
nology?  If it's just a matter of obtaining parallelism, then wouldn't
two identical CPU's work just as well and be more flexible for non-
floating-point applications? If there are functional units on the FPU
that aren't on the CPU, should they be on the CPU so non-floating-
point instructions can use them if desirable? If the CPU and FPU are
one chip, cycle times should be slower, but would the reduced communi-
cation overhead compensate?  If you use separate heterogeneous proces-
sors, don't you end up with ... a CISC?

aglew@ccvaxa.UUCP (10/15/86)

>> The final column gives some feel for how effective these
>> processor/co-processor (just processor for the T414)
>> combinations are at turning MFLOPS into usable floating
>> point performance.
>
>As one who has looked at the relative merits of various processors
>and coprocessors before making a selection, I am not at all concerned
>about "how effective [a] processor/co-processor combination[] [is] at
>turning MFLOPS into usable floating point performance."  The bottom
>line for an application is closely tied to the numbers in the "kWhets"
>column.  The real question is, "How fast will it run my application?"
>
>					Steve Rice

Well, not only that, but perhaps also "How much will it cost to run my
application?" Users don't care how effective an architecture is, they
only care what the final result is. People interested in design, however,
may be interested in the ratios, since a low performance product with
good figures of merit may show an approach that should be turned into
a high performance product.

jlg@lanl.ARPA (Jim Giles) (10/15/86)

In article <8184@sun.uucp> dgh@sun.UUCP writes:
>
>                            Mflops Per MHz
>
>                             David Hough
>			   dhough@sun.com
>
Mflops:(Millions of FLoating point OPerations per Second)
MHz: (Millions of cycles per second)

Therefore 'Mflops per MHz':(Millions^2 FLoating point OPeration cycles per
			    sec^2)

Sounds like an acceleration to me.  Must be a measure of how fast computer
speed is improving.  Still, the choice of units forces this number to
be small.      8-)

J. Giles
Los Alamos

mash@mips.UUCP (John Mashey) (10/16/86)

In article <8184@sun.uucp> dgh@sun.uucp (David Hough) writes:
>
>                            Mflops Per MHz
>     I'd like to add to John Mashey's recent posting about floating-
>point performance....
Thanks; as I'd said, parentage of numbers was suspect, so it's good to
see some I trust some more.
>
>Machine         CPU Mhz FPU MHz  KW   KW/CPUMhz KW/FPUMHz
>
>Sun-3/160+68881 16.7    16.7     955     60      60
Oops, I'd thought you guys used 12.5Mhz 68881s at one point [but I checked
the current literature and it says no.  Has it changed recently?

> ....  Whetstone
>benchmark performance is mostly determined by the following code:
> (bunch of code) ...
>
>On Weitek 1164/1165-based systems, execution time for the P3 loop is
>dominated by the division operation...
>something far more interesting: the best way to improve Whetstone per-
>formance is to do enough global inter-procedural optimization in your
>compiler to determine that P3 only needs to be called once.  This
>gives a 2X performance increase with no hardware work at all! One MIPS
>paper suggests that the MIPS compiler does this or something similar.
Actually, that's an optional optimizing phase whose heuristics are
still being tuned: we didn't use it on this, and in fact, don't generally
use them on synthetic benchmarks at all: it's too destructive!
(There's nothing like seeing functions being grabbed in-line, discovering
that they don't do anything, and then just optimizing the whole thing away.
At least Whetstone computes and prints some numbers, so some real
work got done.  Nevertheless, David's comments are appropriate, i.e., we
share the same skepticism of Whetstone, as I'd noted in the original
posting).

>Maybe benchmark performance should be normalized for software as well
>as hardware technology.
True! Some interesting work on that line was done over at Stanford by
Fred Chow, who did a machine-independent optimizer with multiple back-ends
to be able to compare machines using same compiler technology.  That's
probably the best way to factor it out.  The other interesting way is to
be able to turn optimizations on/off and see how much difference they make.
>
>     I've discussed benchmarking issues at length in the Floating-
>Point Programmer's Guide for the Sun Workstation, 3.2 Release, leading
Is this out yet? Sounds good.  Previous memos have been useful.
>to the recommendation that the nonlinear optimization and zero-finding
>that P3 is intended to mimic is better benchmarked by the real thing,
>such as the SPICE program.
Yes, although it would be awfully nice to have smaller hunks of it that
could be turned into reasonable-size benchmarks, especially ones that
could be simulated (in advance of CPU design) a little easier.
>
>     Linear problems are usually characterized by large dimension and
>therefore memory and bus performance is as important as peak
>floating-point performance; a Linpack benchmark with suitably-
>dimensioned arrays is appropriate.
Yes.
>
>     I don't know whether RISC or CISC designs will prove to give the
>most bang for the buck, but I do have some philosophical questions for
>RISC gurus:  Is hardware floating point faster than software floating
>point on RISC systems?  If so, and it is because the FPU technology is
>faster than the CPU, then why isn't the CPU fabricated with that tech-
>nology?  If it's just a matter of obtaining parallelism, then wouldn't
>two identical CPU's work just as well and be more flexible for non-
>floating-point applications? If there are functional units on the FPU
>that aren't on the CPU, should they be on the CPU so non-floating-
>point instructions can use them if desirable? If the CPU and FPU are
>one chip, cycle times should be slower, but would the reduced communi-
>cation overhead compensate?  If you use separate heterogeneous proces-
>sors, don't you end up with ... a CISC?

1) Is hardware FP faster?  Yes.
2) No, technology is the same, at least in our case.  I don't know what
other people do.
3) It's not just parallelism, but dedicating the right kind of hardware.
A 32-bit integer CPU has no particular reason to have the kinds of datapaths
an FPU needs.  There are functional units on the FPU, but they aren't ones
that help the CPU much (or they would have been on the CPU in the first place!)
4) Would reduced communication overhead compensate?  Probably not, at the
current state of technology that is generally available.  Right now, at least
in anything close to 2micron CMOS, if the FPU is part of the CPU chip, it
just has to be heavily microcoded.  It's only when chip shrinkage gets enough
that you can put the fastest FPU together with the CPU on 1 chip, and have
nothing better to put on that chip, that it's worth doing for performance.
(Note: there may be other reasons, or different price/performance aim points
for integrating them, but if you want FP performance, you must dedicate
significant silicon real-estate.)
5) Don't you end up with ... a CISC?  I'm not sure what this means.  RISC
means different things to different people.  What it usually means to us is:
	a) Design approach where hardware resources are concentrated on things
	that are performance-critical and universal.
	b) The belief that in making things fast, instructions and/or
	complex addressing formats drop out, NOT as a GOAL,but as a side-effect.
Thus, in our case, we designed a CPU that would go fast for integer performance,
and have a tight-coupled coprocessor interface that would let FP go fast also.
(Note: integer performance is universal, whereas FP is mostly bimodal:
people either don't care about it all, or want as much as they can get.)
When you measure integer programs, you make choices to include or delete
features, according to the statistics seen in measuring substantial programs.
You do the same thing for FP-intensive programs.  Guess what! You discover
that FP Adds, Subtracts, Multiplies (and maybe Divides) are:
	a) Good Things
	b) Not simulatable by integer arithmetic very quickly.
However, suppose that we'd discovered that FP Divide happened so seldom
that it could be simulated in software at an adequate performance level,
and that taking that silicon and using it to make FP Mult faster gave better
overall performance.  In that case, we might have done it that way.

In any case, we don't see any conflict in having a RISC with FP,
(or decimal, or ...anything where some important class of application needs
hardware thrown at it and can justify the cost of having it.)
Seymour Cray has been doing fast machines for years with similar design
principles (if at a different cost point!) and FP has certainly been there.

Anyway, thanks for the additional data.  Also, I'd be happy to see more
discussion on what metrics are reasonable [especially since the original
posting invented "Whetstones/MHz" on the spur of the moment, and there
have been some interesting side discussions generated, both on:
	a) Are KWhets a good choice?
	b) What's a MHz?
As can be seen, this business is still clearly in need of benchmarks that:
	a) measure something real.
	b) measure something understandable.
	c) are small enough that they can be run and simulated in reasonable
	time.
	d) predict real performance of adequate-sized classes of programs.
	e) are used by enough people that you can do comparisons.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

jbuck@epimass.UUCP (Joe Buck) (10/16/86)

In article <8575@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes:
>Mflops:(Millions of FLoating point OPerations per Second)
>MHz: (Millions of cycles per second)
>
>Therefore 'Mflops per MHz':(Millions^2 FLoating point OPeration cycles per
>			    sec^2)
>
>Sounds like an acceleration to me.  Must be a measure of how fast computer
>speed is improving.  Still, the choice of units forces this number to
>be small.      8-)

You multiplied instead of dividing.  If you had divided, you would
have found that the number measures floating point operations per
cycle.

-- 
- Joe Buck 	{hplabs,fortune}!oliveb!epimass!jbuck, nsc!csi!epimass!jbuck
  Entropic Processing, Inc., Cupertino, California

dennisg@fritz.UUCP (Dennis Griesser) (10/17/86)

In article <8184@sun.uucp> dgh@sun.UUCP writes:
>                            Mflops Per MHz

In article <8575@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes:
>Mflops:(Millions of FLoating point OPerations per Second)
>MHz: (Millions of cycles per second)
>
>Therefore 'Mflops per MHz':(Millions^2 FLoating point OPeration cycles per
>			    sec^2)
>
>Sounds like an acceleration to me.  Must be a measure of how fast computer
>speed is improving.  Still, the choice of units forces this number to
>be small.

You are not factoring the units out correctly...

million x flop      second          flop
-------------- x --------------- = -----
    second       million x cycle   cycle
 
Sounds reasonable to me.

ags@h.cc.purdue.edu (Dave Seaman) (10/17/86)

In article <8575@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes:
>In article <8184@sun.uucp> dgh@sun.UUCP writes:
>>
>>                            Mflops Per MHz
>>
>>                             David Hough
>>			   dhough@sun.com
>>
>Mflops:(Millions of FLoating point OPerations per Second)
>MHz: (Millions of cycles per second)
>
>Therefore 'Mflops per MHz':(Millions^2 FLoating point OPeration cycles per
>			    sec^2)

You didn't divide properly.  "Mflops per MHz" means "Mflops divided by MHz",
which, by the "invert the denominator and multiply" rule, comes out to

	"Floating point operations per cycle"

after cancelling the "millions of ... per second" from numerator and
denominator.

I'm not claiming that this is a particularly useful measure, but that's
what it means.
-- 
Dave Seaman	  					
ags@h.cc.purdue.edu

nather@ut-sally.UUCP (Ed Nather) (10/17/86)

When I first started messing with computers (longer ago than I like to
remember) I was discouraged to learn they could not handle numbers as
long as I sometimes needed.  Then I learned about floating point -- as
a way to get very large numbers into registers (and memory cells) of
limited length.  It sounded great until I learned that you give up
something when you do things that way --simple operations become much
more complex (and slower) using standard hardware.  Also, the aphorism
about using a lot of floating operations was brought home to me:
"Using floating point is like moving piles of sand around.  Every time
you move one you lose a little sand, and pick up a little dirt."

Has hardware technology progressed to the point where we might want to
consider making a VERY LARGE integer machine -- with integers long
enough so floating point operations would be unnecessary?  I'm not
sure how long they would have to be, but 512 bits sounds about right
to start with.  This would allow integers to have values up to about
10E150 or so, large enough for Avagadro's number or, with suitable
scaling, Planck's constant.  It would allow rapid integer operations
in place of floating point operations.  If you could add two 512-bit
integers in a couple of clock cycles, it should be pretty fast.

I guess this would be a somewhat different way of doing parallel
operations rather than serial ones.

Is this crazy?


-- 
Ed Nather
Astronomy Dept, U of Texas @ Austin
{allegra,ihnp4}!{noao,ut-sally}!utastro!nather
nather@astro.AS.UTEXAS.EDU

bobmon@iuvax.UUCP (Robert Montante) (10/17/86)

>>                            Mflops Per MHz
>>   [...]
>Mflops:(Millions of FLoating point OPerations per Second)
>MHz: (Millions of cycles per second)
>
>Therefore 'Mflops per MHz':(Millions^2 FLoating point OPeration cycles per
>			    sec^2)
>
>Sounds like an acceleration to me.  Must be a measure of how fast computer
>speed is improving.  Still, the choice of units forces this number to
>be small.      8-)

I get:     10e6  X  FLoating_point_OPerations / second
         -----------------------------------------------
           10e6  X  Cycles / second

which reduces to
                 FLoating_point_OPerations / Cycle

an apparent measure of instruction complexity.  But then, if you use a Floating
Point Accelerator, perhaps these interpretations are consistent.  8->

*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*

RAMontante
Computer Science				"Have you hugged ME today?"
Indiana University

henry@utzoo.UUCP (Henry Spencer) (10/17/86)

> ... marketing and management scientists ...
		    ^	       ^
Syntax error in above line:  incompatible concepts!
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry

ehj@mordor.ARPA (Eric H Jensen) (10/17/86)

In article <6028@ut-sally.UUCP> nather@ut-sally.UUCP (Ed Nather) writes:
>more complex (and slower) using standard hardware.  Also, the aphorism
>about using a lot of floating operations was brought home to me:
>"Using floating point is like moving piles of sand around.  Every time
>you move one you lose a little sand, and pick up a little dirt."

I thought numerical analysis was the plastic sheet you place your sand
on - with some thought (algorithm changes) you can control your
errors most of the time or at least understand them.  Then of course
there is always an Extended format ...

>Has hardware technology progressed to the point where we might want to
>consider making a VERY LARGE integer machine -- with integers long
>...
>scaling, Planck's constant.  It would allow rapid integer operations
>in place of floating point operations.  If you could add two 512-bit
>integers in a couple of clock cycles, it should be pretty fast.

I would not want to be the one to place and route the carry-lookahead
logic for a VERY fast 512 bit adder (you could avoid this by using the
tidbits approach but that has many other implications).  The real
killers would be multiply and divide.  If you really want large
integers use an efficient bignum package; hardware can help by
providing traps or micro-code support for overflow conditions.

-- 
eric h. jensen        (S1 Project @ Lawrence Livermore National Laboratory)
Phone: (415) 423-0229 USMail: LLNL, P.O. Box 5503, L-276, Livermore, Ca., 94550
ARPA:  ehj@angband    UUCP:   ...!decvax!decwrl!mordor!angband!ehj

josh@polaris.UUCP (Josh Knight) (10/18/86)

In article <16112@mordor.ARPA> ehj@mordor.UUCP (Eric H Jensen) writes:
>In article <6028@ut-sally.UUCP> nather@ut-sally.UUCP (Ed Nather) writes:
>>more complex (and slower) using standard hardware.  Also, the aphorism
>>about using a lot of floating operations was brought home to me:
>>"Using floating point is like moving piles of sand around.  Every time
>>you move one you lose a little sand, and pick up a little dirt."
>
>I thought numerical analysis was the plastic sheet you place your sand
>on - with some thought (algorithm changes) you can control your
>errors most of the time or at least understand them.  Then of course
>there is always an Extended format ...
>

There are two realistic limits to the precision used for a particular
problem, assuming that the time to accomplish the calculation is not
an issue.  The first is the precision of the input data (in astronomy
this tends to be less than single precision, somtimes significantly
less) and the second is the number of intermediate values you are willing
to store in your calcuation (i.e. memory).  The number of intermediate
values includes things like the fineness of the grid you use in an
approximation to a continuous problem formulation (although quantum
mechanics makes everything "grainy" at some point, numerical calculations
aren't usually done to that level).  As Eric points out, proper handling
of the calcuations and some extended precision (beyond what is kept in long
term storage) will provide all the precision that is available with the
given resources.  Indeed, the proposal to use long integers is wasteful
of the very resource that is usually in short supply in these calculations,
namely memory (reference to all the "no virtual memory MY Cray!" verbage).
When Ed stores the mass of a 10 solar mass star in his simulation of the
evolution of an open cluster as a 512 bit integer, approximately 500 of
the bits are wasted on meaningless precision.  The mass of the sun is
of order 10e33 grams, but the precision to which we know the mass is only
five or six decimal digits (limited, I believe, by the precision of G, the
gravitational coupling constant, but the masses of stars other than the
sun are typically much more poorly known), thus storing this number in
a 512 bit integer wastes almost all the bits, only 15-20 of them mean
anything.

I'll admit (before I get flamed) that the IBM 370 floating point format
has some deficiencies when it comes to numerical calculations (truncating
arithmetic and hexadecimal normalization).  I will also disclaim that
I speak only for myself, not my employer.
-- 

	Josh Knight, IBM T.J. Watson Research
 josh@ibm.com, josh@yktvmh.bitnet,  ...!philabs!polaris!josh

bzs@bu-cs.BU.EDU (Barry Shein) (10/19/86)

From: nather@ut-sally.UUCP (Ed Nather)
>Has hardware technology progressed to the point where we might want to
>consider making a VERY LARGE integer machine -- with integers long
>enough so floating point operations would be unnecessary?

Why wouldn't the packed decimal formats of machines like the IBM/370
be sufficient for most uses (31 decimal digits+sign expressed as
nibbles, slightly more complicated size range for multiplication and
division operands, basic arithmetic operations supported.) That's not
a huge range but it's a lot larger than 32-bit binary. I believe it
was developed because a while ago people like the gov't noticed you
can't do anything with 32-bit regs and their budgets, and floating
point was unacceptable for many things. There are no packed decimal
registers so I assume the instructions are basically memory-bandwidth
limited (not unusual.)

The VAX seems to support operand lengths up to 16-bits (65k*2 digits?
I've never tried it.) There is some primitive support for this (ABCD,
SBCD) in the 68K.

	-Barry Shein, Boston University

josh@polaris.UUCP (Josh Knight) (10/19/86)

Sorry about all the typos in <753@polaris>.
-- 

	Josh Knight, IBM T.J. Watson Research
 josh@ibm.com, josh@yktvmh.bitnet,  ...!philabs!polaris!josh

stuart@BMS-AT.UUCP (Stuart D. Gathman) (10/21/86)

In article <6028@ut-sally.UUCP>, nather@ut-sally.UUCP (Ed Nather) writes:

> long as I sometimes needed.  Then I learned about floating point . . .

>        . . . .  It sounded great until I learned that you give up
> something when you do things that way --simple operations become much
> more complex (and slower) using standard hardware. . . .

For problems appropriate to floating point, the input is already
imprecise.  Planck's constant is not known to more than a dozen
digits at most.  Good floating point software keeps track of
the remaining precision as computations proceed.  Even if the 
results were computed precisely using rational arithmetic,
the results would be more imprecise than the input.  Rounding
in floating point hardware contributes only a minor portion of
the imprecision of the result in properly designed software.

For problems unsuited to floating point, e.g. accounting, yes the
floating point hardware gets in the way.  For accounting one should
use large integers: 48 bits is plenty in practice and no special hardware
is needed.  The 'BCD' baloney often advocated is just that.  Monetary
amounts in accounting are integers.  'BCD' is sometimes used so that
decimal fractions round correctly, but the correct method is to use
integers.

Rational arithmetic is another place for large integers.  Numbers
are represented as the quotient of two large integers.  This is where
special hardware might help.  Symbolic math often uses rational 
arithmetic, but the large integers should be variable length.  Numbers
such as '1' and '2' are far more common than 100 digit monsters.
-- 
Stuart D. Gathman	<..!seismo!{vrdxhq|dgis}!BMS-AT!stuart>

gnu@hoptoad.uucp (John Gilmore) (10/22/86)

In article <6028@ut-sally.UUCP> nather@ut-sally.UUCP (Ed Nather) writes:
>"Using floating point is like moving piles of sand around.  Every time
>you move one you lose a little sand, and pick up a little dirt."

IEEE floating point requires an "exact" mode which causes a trap any
time the result of an operation is not exact.  This lets your
software know that it has picked up dirt, if it cares, and lets
particularly smart software change to extended precision, long integers,
or whatever.

I was wondering how you represent values <1 in your 512-bit integers...
or are you going to figure out binary points on the fly?  In that case
you might as well let hardware do it -- that's called floating point!
-- 
John Gilmore  {sun,ptsfa,lll-crg,ihnp4}!hoptoad!gnu   jgilmore@lll-crg.arpa
(C) Copyright 1986 by John Gilmore.             May the Source be with you!

rb@cci632.UUCP (Rex Ballard) (10/22/86)

In article <3153@h.cc.purdue.edu> ags@h.cc.purdue.edu.UUCP (Dave Seaman) writes:
>In article <8575@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes:
>>In article <8184@sun.uucp> dgh@sun.UUCP writes:
>>>                            Mflops Per MHz
>	"Floating point operations per cycle"
>
>after cancelling the "millions of ... per second" from numerator and
>denominator.
>
>I'm not claiming that this is a particularly useful measure, but that's
>what it means.

Sounds to me like what is really wanted is cycles/flop average.
This does give an indication of micro-code efficiency of the
archetecture used, based on average, rather than advertized
figures.

Obviously the more cycles/flop, the less inherantly efficient
the chip archetecture is.  By figuring these values using
whetstones or similar benchmarks, additional "off-chip" factors
such as set up overhead for the FPU calls are correctly included.

This sounds like an interesting measure of CPU/FPU archetecture
in the broader sense of the word.

>Dave Seaman	  					
>ags@h.cc.purdue.edu

kissell@garth.UUCP (10/23/86)

In article <725@mips.UUCP> mash@mips.UUCP (John Mashey) writes:
>                                                       Right now, at least
>in anything close to 2micron CMOS, if the FPU is part of the CPU chip, it
>just has to be heavily microcoded.

Oh?  What law of physics are we violating?  ;-)

Kevin D. Kissell
Fairchild Advanced Processor Division

peters@cubsvax.UUCP (Peter S. Shenkin) (10/23/86)

In article <BMS-AT.253> stuart@BMS-AT.UUCP (Stuart D. Gathman) writes:
>
>For problems appropriate to floating point, the input is already
>imprecise.  Planck's constant is not known to more than a dozen
>digits at most.  Good floating point software keeps track of
>the remaining precision as computations proceed.

???  I've never heard of this.  Could you say more?  Until you do, I will....
Read on.

>							...Rounding
>in floating point hardware contributes only a minor portion of
>the imprecision of the result in properly designed software.

I disagree.  Consider taking the average of a many floating point numbers
which are read in from a file, and which differ greatly in magnitude.
How many there are to average may not be known until EOF is encountered.
The "obvious" way of doing this is to accumulate the sum, then divide
by n.  But if some numbers are very large, the very small ones will
fall off the low end of the dynamic range, even if there are a lot of
them;  this problem is avoided if one uses higher precision (double
or extended) for the sum.  If declaring things this way is what you mean by 
properly designed software, OK.  But the precision needed for intermediate 
values of a computation may greatly exceed that needed for input and
output variables.  I call this a rounding problem.  I know of no "floating
point software" that will get rid of this.  There are, of course, programming
techniques for handling it, some of which are very clever.  Again, I suppose
you could say that if you don't implement them then you're not using 
properly designed software.  But these techniques are time-consuming to
build in to programs, and time consuming to execute;  therefore, they
should only be used where they're really needed.  But the whole point is that
the precision needed for intermediate results may GREATLY exceed that needed 
for input and output variables, and an important part of numerical analysis
is being able to figure out where that is.

Peter S. Shenkin	 Columbia Univ. Biology Dept., NY, NY  10027
{philabs,rna}!cubsvax!peters		cubsvax!peters@columbia.ARPA

ludemann@ubc-cs.UUCP (10/24/86)

In article <253@BMS-AT.UUCP> stuart@BMS-AT.UUCP writes:
>For problems unsuited to floating point, e.g. accounting, yes the
>floating point hardware gets in the way.  For accounting one should
>use large integers: 48 bits is plenty in practice and no special hardware
>is needed.  

As someone who has done accounting using floating point for 
accounting, I wish to point out that 8-byte floating point has more 
precision than 15 digits of BCD.  Remembering that the exponent only 
takes a "few" bits, I'll happily use floating point any day instead 
of integers (even 48 bit integers).  Integers work fine for 
accounting as long as one is adding and subtracting but if one has to 
multiply (admittedly, not often), there's big trouble.  After quite a 
number of attempts to make things balance to the penny, I changed to 
floating point and all my problems vanished (the code ran faster, 
too).  

franka@mmintl.UUCP (Frank Adams) (10/28/86)

In article <253@BMS-AT.UUCP> stuart@BMS-AT.UUCP writes:
>For problems unsuited to floating point, e.g. accounting, yes the
>floating point hardware gets in the way.  For accounting one should
>use large integers: 48 bits is plenty in practice and no special hardware
>is needed.

48 bits is not always adequate.  One sometimes has to perform operations
of the form a*(b/c), rounded to the nearest penny (integer).  Doing this
with integer arithmethic requires intermediate results with double the
precision of the final results.  With floating point, this is not necessary.

Frank Adams                           ihnp4!philabs!pwa-b!mmintl!franka
Multimate International    52 Oakland Ave North    E. Hartford, CT 06108

dik@mcvax.uucp (Dik T. Winter) (11/08/86)

In article <570@cubsvax.UUCP> peters@cubsvax.UUCP (Peter S. Shenkin) writes:
>In article <BMS-AT.253> stuart@BMS-AT.UUCP (Stuart D. Gathman) writes:
>>
>>                 Good floating point software keeps track of
>>the remaining precision as computations proceed.
>
>???  I've never heard of this.  Could you say more?  Until you do, I will....
>Read on.
>
>>							...Rounding
>>in floating point hardware contributes only a minor portion of
>>the imprecision of the result in properly designed software.
>
>I disagree.  Consider taking the average of a many floating point numbers
>which are read in from a file, and which differ greatly in magnitude.
>How many there are to average may not be known until EOF is encountered.
>The "obvious" way of doing this is to accumulate the sum, then divide
>by n.  But if some numbers are very large, the very small ones will
>fall off the low end of the dynamic range, even if there are a lot of
>them;  this problem is avoided if one uses higher precision (double
>or extended) for the sum.  If declaring things this way is what you mean by 
>properly designed software, OK.  But the precision needed for intermediate 
>values of a computation may greatly exceed that needed for input and
>output variables.  I call this a rounding problem.  I know of no "floating
>point software" that will get rid of this.
>
Well, there are at least three packages dealing with it: ACRITH from IBM
and ARITHMOS from Siemens (they are identical in fact) and a language called
PASCAL-SC on a KWS workstation (a bit obscure I am sure).  They are based
on the work by Gulisch et al. from the University of Karlsruhe.  They
use arithmetic with directed rounding and accumulation of dot products
in long registers (168 bytes on IBM).  On IBM there is microcode support
for this on the 4341 (or 4381 or 43?? or some such beast).

The main purpose is verification of results (at least, that is my opinion).
For instance on a set of linear equations find a solution interval that
contains the true solution with the constraint that the interval is as
small as possible.  They then first proceed finding an approximate solution
using standard techniques followed by an iterative scheme to obtain a
smallest interval using interval arithmetic combined with long registers.
This is superior toi standard interval arithmetic because the latter tends
to give much too large intervals.
-- 
dik t. winter, cwi, amsterdam, nederland
UUCP: {seismo,decvax,philabs,okstate,garfield}!mcvax!dik
  or: dik@mcvax.uucp
ARPA: dik%mcvax.uucp@seismo.css.gov