ian@loral.UUCP (Ian Kaplan) (09/25/86)
ABSTRACT The scalar performance of microprocessors is increasing much faster than the floating point performance of floating point coprocessors. Floating point performance is essential for many Fortran applications. The purpose of this note is to encourage microprocessor designers to concentrate more on floating point. The Current State of the World ------------------------------ The "big three" microprocessor manufacturers (iNTEL, Motorola and National Semiconductor) have all announced 32-bit microprocessors. These chips all have high scalar performance (at least for microprocessors). Unfortunately the speed of the floating point co-processors has not kept up. Below are some approximate speeds for the various math coprocessors. The performance listed is "peak" performance, which assumes that the operands are already in the floating point registers. Intel 80287 less than 0.1 MFLOP National 0.1 MFLOP Motorola 0.3 MFLOP Naturally these speeds will vary, depending on clock speed, but there are not far off. None of these coprocessors has performance that is close to 1 MFLOP. National has a chip (the 32310) that will allow the 32032 to be interfaced with the Weitek floating point ALU and multiply unit. This yields 0.8 MFLOPS with math error checking enabled and 1.2 MFLOPS with it disabled. Unfortunately the 32310 solution not only entails a significant increase in component price (a 32310, two Weitek chips and support logic), but also consumes a lot of board space and power. Whether the increase in floating point performance justifies these increased costs is arguable. I would like to get more MFLOPS for my bucks. We Need More MFLOPS ------------------- At one time (three years ago) 0.1 MFLOP was considered pretty good floating point performance. This is no longer true. Floating point performance tends to be measured now against the Weitek chips (or AMD or Analog Devices etc...) used in bit-slice applications (e.g., 10 MFLOPS). The floating point performance available with the Weitek chips would be wasted on the current generation of microprocessors. Even the fastest microprocessors currently available could not feed floating point operands to a coprocessor fast enough to keep up with the 10 MFLOP rate. What is needed is a coprocessor (or an integrated floating point processor) that will yield 1 to 2 MFLOPS. So far the only microprocessor that comes close to delivering 1 MFLOP of floating point performance is the Fairchild Clipper. The Clipper has an on chip floating point unit rated at about 1 MFLOP. Fortran, despite its many problems, is one of the most widely used programming languages. Despite the hopes of many computer scientists, Fortran will never disappear (there will just be new versions of Fortran). The increasing power of microprocessors means that more Fortran applications will be run on microprocessor based systems. Many Fortran applications are floating point bound, so the increased scalar performance provided by the current generation of 32-bit microprocessors will only speed them up fractionally. For these applications floating point performance will be the deciding factor. The microprocessor vendor that realizes the importance of floating point performance, and develops a solution, will capture a large number of designs where Fortran performance is important. Ian Kaplan Loral Dataflow Group Loral Instrumentation USENET: {ucbvax,decvax,ihnp4}!sdcsvax!loral!ian ARPA: sdcc6!loral!ian@UCSD USPS: 8401 Aero Dr. San Diego, CA 92123
henkp@nikhefk.uucp (Henk Peek) (10/05/86)
In article <340@euroies.UUCP> shepherd@euroies.UUCP (Roger Shepherd) writes: >I was interested to read Ian Kaplan's (...!loral!ian) > >For example, Ian quotes some performance figures as > > Intel 80287 < 0.1 MFLOP (say 0.95 MFLOP at 8Mhz?) > National 0.1 MFLOP > Motorola 0.3 MFLOP > >To these I'll add the figure for the INMOS transputer (no >co-processor, floating point done in software) > > Inmos IMS T414-20 0.09 MFLOP (typical for * and /, > + and - are slower!) >Roger Shepherd >INMOS Limited, 1000 Aztec West, Almondsbury, Bristol, BS12 4SQ, GB >USENET: ...!euroies!shepherd >PHONE: +44 454 616616 On the Inmos seminair in the Netherlands Inmos claimed a floating- point performance ~1 Mflop for the T800. It is a T414 with microcoded floatingpoint on a single chip. The chip is pin compatible with the t414-20. We have't received the handout. I do't have the exact numbers. They claimed that you can get samples within a few weeks. Has somone worked with this chip? uucp: henkp@nikhefk@mcvax seismo!mcvax!nikhefk!henkp mail: Henk Peek, NIKHEF-K, PB 4395, 1009 AJ Amsterdam, Holland
shepherd@euroies.UUCP (Roger Shepherd) (10/08/86)
I was interested to read Ian Kaplan's (...!loral!ian) appeal for microprocessors with fast floating point. I am a little concerned with the use of `peak' performance to characterise the speed of part as I don't think that this necessarily reflects the the USABLE performance of the part. I think it is instructive to look at MFLOPs compared with Whetstones. (A good benchmark of performance on `typical' scientific programs). For example, Ian quotes some performance figures as Intel 80287 < 0.1 MFLOP (say 0.95 MFLOP at 8Mhz?) National 0.1 MFLOP Motorola 0.3 MFLOP To these I'll add the figure for the INMOS transputer (no co-processor, floating point done in software) Inmos IMS T414-20 0.09 MFLOP (typical for * and /, + and - are slower!) According to the figures I have to hand, these processors compare somewhat differently when the Whetstone figures are compared. For example, I have single length Whetstone figures as follows for these machines kWhets MWhets/MFLOP (normalised) Intel 80286/80287 (8 Mhz) 300 3.2 1.0 NS 30032 & 32081 (10 Mhz) 128 1.3 0.4 MC 68020 & 68881 (16 & 12.5) 755 2.5 0.8 Inmos IMS T414B-20 663 7.4 2.3 The final column gives some feel for how effective these processor/co-processor (just processor for the T414) combinations are at turning MFLOPS into usable floating point performance. Also, I don't quite know why Ian likes the CLIPPER (three chips on the picture of the (large) module I've seen) but dislikes the NS 32310 (four chips); they seem to give the same MFLOP rating. (Does anyone have Whetstone figures for these two?) Comparisons against Weiteks (or whatever) are also somewhat suspect. To use their peak data rate you have to use them in pipelined mode, their scalar mode tends to be somewhat slower and it might be possible to build a microprocessor system that could feed them data and accept results at that rate. However, if you're only using the chips in that mode I'm not convinced that you really want all that silicon to be taken up with a large pipelineable (?) multiplier; I'd rather have a processor there! On the same subject (sort of), what measure should be made of the `goodness' of a floating point micro-processor? How about MWhetstones per square centi-metre. (Or do all you guys and girls still use inches? :-) ) Or, how about MWhetstones per milliwatt? -- Roger Shepherd INMOS Limited, 1000 Aztec West, Almondsbury, Bristol, BS12 4SQ, GB USENET: ...!euroies!shepherd PHONE: +44 454 616616
ian@loral.UUCP (Ian Kaplan) (10/08/86)
In article <340@euroies.UUCP> shepherd@euroies.UUCP (Roger Shepherd) writes: > > kWhets MWhets/MFLOP > (normalised) > Intel 80286/80287 (8 Mhz) 300 3.2 1.0 > NS 30032 & 32081 (10 Mhz) 128 1.3 0.4 > MC 68020 & 68881 (16 & 12.5) 755 2.5 0.8 > > Inmos IMS T414B-20 663 7.4 2.3 > These figures are interesting. I am surprised at the figure for the 80287. Intel uses this processor on the Intel Cube, which has very poor floating point performance. The above table suggests that reasonable floating point performance could be achieved increasing the clock rate. It is not clear to me that this is born out by reality. Does anyone have floating point performance numbers for a 12.5 MHz 80287? > >Also, I don't quite know why Ian likes the CLIPPER (three >chips on the picture of the (large) module I've seen) but >dislikes the NS 32310 (four chips); they seem to give the >same MFLOP rating. (Does anyone have Whetstone figures for >these two?) > The Whetstone figure for the 32310 is 1.137 MWhets and 0.8 MFLOP. I like the Clipper because I have the impression that is uses less board space and power than a 32032, a 32310 and two Weitek chips. There are other considerations that must be taken into account also, like the history of a product line. > >-- >Roger Shepherd >INMOS Limited, 1000 Aztec West, Almondsbury, Bristol, BS12 4SQ, GB >USENET: ...!euroies!shepherd >PHONE: +44 454 616616 Ian Kaplan Loral Dataflow Group Loral Instrumentation USENET: {ucbvax,decvax,ihnp4}!sdcsvax!loral!ian ARPA: sdcc6!loral!ian@UCSD USPS: 8401 Aero Dr. San Diego, CA 92123
curry@nsc.UUCP (Ray Curry) (10/08/86)
>Path: nsc!pyramid!decwrl!decvax!ucbvax!ucbcad!nike!lll-crg!seismo!mcvax!euroies!shepherd >From: shepherd@euroies.UUCP (Roger Shepherd) >Newsgroups: net.arch >Subject: Floating point performance >Message-ID: <340@euroies.UUCP> >dislikes the NS 32310 (four chips); they seem to give the >same MFLOP rating. (Does anyone have Whetstone figures for >these two?) >Comparisons against Weiteks (or whatever) are also somewhat >suspect. To use their peak data rate you have to use them in >pipelined mode, their scalar mode tends to be somewhat slower -- >Roger Shepherd >INMOS Limited, 1000 Aztec West, Almondsbury, Bristol, BS12 4SQ, GB >USENET: ...!euroies!shepherd >PHONE: +44 454 616616 Just by coincidence, I have been running some floating point benchmarks on NS32081 floating point processor and thought I needed to respond with some more up to date numbers. I ran the single precision Whetstone on the NS32032 and NS32081 at 10MHz on the DB32000 board, and the NS32332 and NS32081 at 15 MHz on the DB332 board. I don't know where the posted 32032-32081 number came from but I measure better even using our older compiliers. Our new compilers show marked improvement. 32032-32081 (10MHz) 189 Kwhets (old compiler) 32032-32081 (10MHz) 390 Kwhets (new compiler) 32332-32081 (15MHz) 728 Kwhets (new compiler) I used the 32332-32081 numbers to generate instruction counts to project worst case performance for the NS32310 and the NS32381, worst case being using the identical math routines and minimizing the pipelining of the 32310. These project performance for the 32332-32381 (15MHz) at approx- imately 1100-1200 KWhets and 32332-32310 (15MHz) at 1500-1600 KWhets. Since both the 32310 and 32381 will have new instructions that will impact the math libraries, the real performance could be higher. Just for interest, preliminary analysis is saying pipelining should improve performance at least 15% overall (30% for the floating point portion of the instruction mix). I would like to add my own question to the value of benchmarks. That is what do the people on the net feel about transcendental functions? The Whetstone seems to me to place more emphasis on them than real life. One of the reasons for not including them directly in the 32081 was that it was felt that implementing them in math routines instead of hardware was more cost effective. Is this true or are transcendentals important enough for the increased cost of implementing them in hardware?
srm@iris.berkeley.edu (Richard Mateosian) (10/09/86)
Just to set the record straight, at 10 MHz the NS32032/32081 achieves about 400 kWhets/sec. The NS32016/32081 gives around 300. These figures are from memory and may be off a little, but they are vastly better than the 128 kWhets/sec cited in the referenced articles. Richard Mateosian ...ucbvax!ucbiris!srm 2919 Forest Avenue 415/540-7745 srm%ucbiris@Berkeley.EDU Berkeley, CA 94705
henry@utzoo.UUCP (Henry Spencer) (10/10/86)
> ...what do the people on the net feel about transcendental functions? > The Whetstone seems to me to place more emphasis on them than real life. > One of the reasons for not including them directly in the 32081 was that > it was felt that implementing them in math routines instead of hardware > was more cost effective. Is this true or are transcendentals important > enough for the increased cost of implementing them in hardware? Personally, while I strongly suspect that a software implementation is more cost-effective than doing them in hardware, putting them on-chip strikes me as a marvellous way of getting them right once and for all and encouraging everyone to use the done-right version. (This does assume, of course, that the chip-maker spends the necessary money to *get* them right, which requires high-paid specialists and a lot of work.) One could get much the same effect with a bare-bones arithmetic chip and a ROM chip containing the math routines, except that ROMs are too easy to copy and you'd never recover the investment needed to do a good job. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,decvax,pyramid}!utzoo!henry
stever@videovax.UUCP (10/11/86)
In article <340@euroies.UUCP>, Roger Shepherd (shepherd@euroies.UUCP) writes: > I was interested to read Ian Kaplan's (...!loral!ian) appeal > for microprocessors with fast floating point. . . . > For example, Ian quotes some performance figures as > > Intel 80287 < 0.1 MFLOP (say 0.95 MFLOP at 8Mhz?) > National 0.1 MFLOP > Motorola 0.3 MFLOP Shouldn't the figure in parentheses for the 80287 be 0.095 MFLOP?? > To these I'll add the figure for the INMOS transputer (no > co-processor, floating point done in software) > > Inmos IMS T414-20 0.09 MFLOP (typical for * and /, > + and - are slower!) > > According to the figures I have to hand, these processors > compare somewhat differently when the Whetstone figures are > compared. For example, I have single length Whetstone > figures as follows for these machines > > kWhets MWhets/MFLOP > (normalised) > Intel 80286/80287 (8 Mhz) 300 3.2 1.0 > NS 30032 & 32081 (10 Mhz) 128 1.3 0.4 > MC 68020 & 68881 (16 & 12.5) 755 2.5 0.8 > > Inmos IMS T414B-20 663 7.4 2.3 > > The final column gives some feel for how effective these > processor/co-processor (just processor for the T414) > combinations are at turning MFLOPS into usable floating > point performance. As one who has looked at the relative merits of various processors and coprocessors before making a selection, I am not at all concerned about "how effective [a] processor/co-processor combination[] [is] at turning MFLOPS into usable floating point performance." The bottom line for an application is closely tied to the numbers in the "kWhets" column. The real question is, "How fast will it run my application?" Steve Rice ---------------------------------------------------------------------------- {decvax | hplabs | ihnp4 | uw-beaver}!tektronix!videovax!stever
mash@mips.UUCP (John Mashey) (10/13/86)
In article <1989@videovax.UUCP> stever@videovax.UUCP (Steven E. Rice) writes: >In article <340@euroies.UUCP>, Roger Shepherd (shepherd@euroies.UUCP) >writes: >> ..... For example, I have single length Whetstone >> figures as follows for these machines >> kWhets MWhets/MFLOP >> (normalised) >> Intel 80286/80287 (8 Mhz) 300 3.2 1.0 >> NS 30032 & 32081 (10 Mhz) 128 1.3 0.4 >> MC 68020 & 68881 (16 & 12.5) 755 2.5 0.8 >> >> Inmos IMS T414B-20 663 7.4 2.3 >> >> The final column gives some feel for how effective these >> proce)/co-processor (just processor for the T414) >> combinations are at turning MFLOPS into usable floating >> point performance. > >As one who has looked at the relative merits of various processors >and coprocessors before making a selection, I am not at all concerned >about "how effective [a] processor/co-processor combination[] [is] at >turning MFLOPS into usable floating point performance." The bottom >line for an application is closely tied to the numbers in the "kWhets" >column. The real question is, "How fast will it run my application?" --THE RIGHT QUESTION!-----------^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Note that this discussion is very akin to the "peak Mips" versus "sustained Mips" versus "how fast does it run real programs" argument in the integer side of the world. I think both Roger and Steven have some useful points, and, in fact, don't seem to be to disagree very much: 1) (Roger): MFLOPS don't mean very much. (see (1) below, etc) 2) (Steven): and neither do Whetstones! 3) (Roger): propose Whetstones / (peak MFLOPS) as architectural measure. Note that most vendors spec MFLOPS using cached, back-to-back adds with both arguments already in registers. For real programs, one also needs to measure effects of: a) coprocessor interaction, i.e., can you load/store directly to the coprocessor from memory, or do you need to copy arguments thru the CPU? (can make large difference). b) Pipelining/overlap effects? c) Number of FP registers. d) Compiler effects. (1)In general, peak MFlops don't seem to mean too much. Whetstones seem to test the FP libraries more than anything else (although this at least measures SOMETHING a bit more real). (2) A lot of people like LINPACK MFLops ratings, or Livermore Loops, although the former, at least, also measures memory system very strongly, i.e., its bigger than almost any cache, and that's quite characteristic of some codes, and totally uncharacteristic of others. (3) However, a useful attribute of Roger's measure's (or variant thereof) is that looking at the measure (units of real performance) per Mhz, you some idea of architectural efficiency, i.e., smaller numbers are better, in that (cycle time) is likely to be a property of the technology, and hard to improve, at a given level of technology. [This is clearly a RISC-style argument of reducing the cycle count for delivered performance, andthen letting technology carry you forward.] Using the numbers above, one gets KiloWhets / Mhz, for example: Machine Mhz KWhet KWhet/Mhz 80287 8 300 40 32332-32081 15 728 50 (these from Ray Curry, 32332-32381 15 1200 80 in <3833@nsc.UUCP>) (projected) 32332-32310 15 1600 100* "" "" (projected) Clipper? 33 1200? 40 guess? anybody know better #? 68881 12.5 755 60 (from discussion) 68881 20 1240 60 claimed by Moto, in SUN3-260 SUN FPA 16.6 1700 100* DP (from Hough) (in SUN3-160) MIPS R2360 8 1160 140* DP (interim, with restrictions) MIPS R2010 8 4500 560 DP (simulated) The *'d ones are boards / controllers for Weitek parts. The Kwhet/Mhz numbers were heavily rounded: 1-2 digits accuracy is about all you can extract from this, at best. One can argue about the speed that should be used for the 68881 systems, since the associated 68020 runs faster. What you do see is (not surprisingly) that heavily microcoded designs get less Kwhet/Mhz than those that use either the Weitek parts or are not microcoded. As usual, whether you think this means anything or not depends on whether or not you think Whetstones are a good measure. If not, it would help to see other things proposed. For some reason, Floating Point benchmarks seem to vary pretty strongly in their behavioral patterns. Also, if anybody has better numbers, it would be nice to see them. At least some of the ones in the list above are of uncertain parentage. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
dgh@sun.UUCP (10/15/86)
Mflops Per MHz David Hough dhough@sun.com I'd like to add to John Mashey's recent posting about floating- point performance. In the following table extracted and revised from that posting, the Sun-3 measurements are mine; the MIPS numbers are Mashey's. All KW results indicate thousands of double precision Whet- stone instructions per second. Results marked * represent implementa- tions based on Weitek chips. As Mashey points out, it's not clear whether the MHz should refer to the CPU or FPU, so I included both. Machine CPU Mhz FPU MHz KW KW/CPUMhz KW/FPUMHz Sun-3/160+68881 16.7 16.7 955 60 60 Sun-3/160+68881 25 20 1240 50 60 Sun-3/160+FPA* 16.7 16.7 1840 100 100 Sun-3/260+FPA* 25 16.7 2600 100 160 MIPS R2360* 8 8 1160 140 140 (interim restrictions) MIPS R2010 8 8 4500 560 560 (simulated) As you puzzle over the meaning of these results, remember that elementary transcendental function routines have minor effect on Whet- stone performance when the hardware is high-performance. Whetstone benchmark performance is mostly determined by the following code: DO 90 I=1,N8 CALL P3(X,Y,Z) 90 CONTINUE SUBROUTINE P3(X,Y,Z) IMPLICIT REAL (A-H,O-Z) COMMON T,T1,T2,E1(4),J,K,L X1 = X Y1 = Y X1 = T * (X1 + Y1) Y1 = T * (X1 + Y1) Z = (X1 + Y1) / T2 RETURN END On Weitek 1164/1165-based systems, execution time for the P3 loop is dominated by the division operation, which is about 6 times slower than an addition or multiplication and can't be overlapped with any other operation, inhibiting pipelining. Furthermore, not only can no 1164 operation overlap any 1165 operation, but parallel invocation of P3 calls can't be justified without doing enough analysis to discover something far more interesting: the best way to improve Whetstone per- formance is to do enough global inter-procedural optimization in your compiler to determine that P3 only needs to be called once. This gives a 2X performance increase with no hardware work at all! One MIPS paper suggests that the MIPS compiler does this or something similar. Maybe benchmark performance should be normalized for software as well as hardware technology. I've discussed benchmarking issues at length in the Floating- Point Programmer's Guide for the Sun Workstation, 3.2 Release, leading to the recommendation that the nonlinear optimization and zero-finding that P3 is intended to mimic is better benchmarked by the real thing, such as the SPICE program. Of course, SPICE is a complicated real application and its performance is difficult to predict in advance, and that makes marketing and management scientists everywhere uneasy. Linear problems are usually characterized by large dimension and therefore memory and bus performance is as important as peak floating-point performance; a Linpack benchmark with suitably- dimensioned arrays is appropriate. I don't know whether RISC or CISC designs will prove to give the most bang for the buck, but I do have some philosophical questions for RISC gurus: Is hardware floating point faster than software floating point on RISC systems? If so, and it is because the FPU technology is faster than the CPU, then why isn't the CPU fabricated with that tech- nology? If it's just a matter of obtaining parallelism, then wouldn't two identical CPU's work just as well and be more flexible for non- floating-point applications? If there are functional units on the FPU that aren't on the CPU, should they be on the CPU so non-floating- point instructions can use them if desirable? If the CPU and FPU are one chip, cycle times should be slower, but would the reduced communi- cation overhead compensate? If you use separate heterogeneous proces- sors, don't you end up with ... a CISC?
aglew@ccvaxa.UUCP (10/15/86)
>> The final column gives some feel for how effective these >> processor/co-processor (just processor for the T414) >> combinations are at turning MFLOPS into usable floating >> point performance. > >As one who has looked at the relative merits of various processors >and coprocessors before making a selection, I am not at all concerned >about "how effective [a] processor/co-processor combination[] [is] at >turning MFLOPS into usable floating point performance." The bottom >line for an application is closely tied to the numbers in the "kWhets" >column. The real question is, "How fast will it run my application?" > > Steve Rice Well, not only that, but perhaps also "How much will it cost to run my application?" Users don't care how effective an architecture is, they only care what the final result is. People interested in design, however, may be interested in the ratios, since a low performance product with good figures of merit may show an approach that should be turned into a high performance product.
jlg@lanl.ARPA (Jim Giles) (10/15/86)
In article <8184@sun.uucp> dgh@sun.UUCP writes: > > Mflops Per MHz > > David Hough > dhough@sun.com > Mflops:(Millions of FLoating point OPerations per Second) MHz: (Millions of cycles per second) Therefore 'Mflops per MHz':(Millions^2 FLoating point OPeration cycles per sec^2) Sounds like an acceleration to me. Must be a measure of how fast computer speed is improving. Still, the choice of units forces this number to be small. 8-) J. Giles Los Alamos
mash@mips.UUCP (John Mashey) (10/16/86)
In article <8184@sun.uucp> dgh@sun.uucp (David Hough) writes: > > Mflops Per MHz > I'd like to add to John Mashey's recent posting about floating- >point performance.... Thanks; as I'd said, parentage of numbers was suspect, so it's good to see some I trust some more. > >Machine CPU Mhz FPU MHz KW KW/CPUMhz KW/FPUMHz > >Sun-3/160+68881 16.7 16.7 955 60 60 Oops, I'd thought you guys used 12.5Mhz 68881s at one point [but I checked the current literature and it says no. Has it changed recently? > .... Whetstone >benchmark performance is mostly determined by the following code: > (bunch of code) ... > >On Weitek 1164/1165-based systems, execution time for the P3 loop is >dominated by the division operation... >something far more interesting: the best way to improve Whetstone per- >formance is to do enough global inter-procedural optimization in your >compiler to determine that P3 only needs to be called once. This >gives a 2X performance increase with no hardware work at all! One MIPS >paper suggests that the MIPS compiler does this or something similar. Actually, that's an optional optimizing phase whose heuristics are still being tuned: we didn't use it on this, and in fact, don't generally use them on synthetic benchmarks at all: it's too destructive! (There's nothing like seeing functions being grabbed in-line, discovering that they don't do anything, and then just optimizing the whole thing away. At least Whetstone computes and prints some numbers, so some real work got done. Nevertheless, David's comments are appropriate, i.e., we share the same skepticism of Whetstone, as I'd noted in the original posting). >Maybe benchmark performance should be normalized for software as well >as hardware technology. True! Some interesting work on that line was done over at Stanford by Fred Chow, who did a machine-independent optimizer with multiple back-ends to be able to compare machines using same compiler technology. That's probably the best way to factor it out. The other interesting way is to be able to turn optimizations on/off and see how much difference they make. > > I've discussed benchmarking issues at length in the Floating- >Point Programmer's Guide for the Sun Workstation, 3.2 Release, leading Is this out yet? Sounds good. Previous memos have been useful. >to the recommendation that the nonlinear optimization and zero-finding >that P3 is intended to mimic is better benchmarked by the real thing, >such as the SPICE program. Yes, although it would be awfully nice to have smaller hunks of it that could be turned into reasonable-size benchmarks, especially ones that could be simulated (in advance of CPU design) a little easier. > > Linear problems are usually characterized by large dimension and >therefore memory and bus performance is as important as peak >floating-point performance; a Linpack benchmark with suitably- >dimensioned arrays is appropriate. Yes. > > I don't know whether RISC or CISC designs will prove to give the >most bang for the buck, but I do have some philosophical questions for >RISC gurus: Is hardware floating point faster than software floating >point on RISC systems? If so, and it is because the FPU technology is >faster than the CPU, then why isn't the CPU fabricated with that tech- >nology? If it's just a matter of obtaining parallelism, then wouldn't >two identical CPU's work just as well and be more flexible for non- >floating-point applications? If there are functional units on the FPU >that aren't on the CPU, should they be on the CPU so non-floating- >point instructions can use them if desirable? If the CPU and FPU are >one chip, cycle times should be slower, but would the reduced communi- >cation overhead compensate? If you use separate heterogeneous proces- >sors, don't you end up with ... a CISC? 1) Is hardware FP faster? Yes. 2) No, technology is the same, at least in our case. I don't know what other people do. 3) It's not just parallelism, but dedicating the right kind of hardware. A 32-bit integer CPU has no particular reason to have the kinds of datapaths an FPU needs. There are functional units on the FPU, but they aren't ones that help the CPU much (or they would have been on the CPU in the first place!) 4) Would reduced communication overhead compensate? Probably not, at the current state of technology that is generally available. Right now, at least in anything close to 2micron CMOS, if the FPU is part of the CPU chip, it just has to be heavily microcoded. It's only when chip shrinkage gets enough that you can put the fastest FPU together with the CPU on 1 chip, and have nothing better to put on that chip, that it's worth doing for performance. (Note: there may be other reasons, or different price/performance aim points for integrating them, but if you want FP performance, you must dedicate significant silicon real-estate.) 5) Don't you end up with ... a CISC? I'm not sure what this means. RISC means different things to different people. What it usually means to us is: a) Design approach where hardware resources are concentrated on things that are performance-critical and universal. b) The belief that in making things fast, instructions and/or complex addressing formats drop out, NOT as a GOAL,but as a side-effect. Thus, in our case, we designed a CPU that would go fast for integer performance, and have a tight-coupled coprocessor interface that would let FP go fast also. (Note: integer performance is universal, whereas FP is mostly bimodal: people either don't care about it all, or want as much as they can get.) When you measure integer programs, you make choices to include or delete features, according to the statistics seen in measuring substantial programs. You do the same thing for FP-intensive programs. Guess what! You discover that FP Adds, Subtracts, Multiplies (and maybe Divides) are: a) Good Things b) Not simulatable by integer arithmetic very quickly. However, suppose that we'd discovered that FP Divide happened so seldom that it could be simulated in software at an adequate performance level, and that taking that silicon and using it to make FP Mult faster gave better overall performance. In that case, we might have done it that way. In any case, we don't see any conflict in having a RISC with FP, (or decimal, or ...anything where some important class of application needs hardware thrown at it and can justify the cost of having it.) Seymour Cray has been doing fast machines for years with similar design principles (if at a different cost point!) and FP has certainly been there. Anyway, thanks for the additional data. Also, I'd be happy to see more discussion on what metrics are reasonable [especially since the original posting invented "Whetstones/MHz" on the spur of the moment, and there have been some interesting side discussions generated, both on: a) Are KWhets a good choice? b) What's a MHz? As can be seen, this business is still clearly in need of benchmarks that: a) measure something real. b) measure something understandable. c) are small enough that they can be run and simulated in reasonable time. d) predict real performance of adequate-sized classes of programs. e) are used by enough people that you can do comparisons. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
jbuck@epimass.UUCP (Joe Buck) (10/16/86)
In article <8575@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes: >Mflops:(Millions of FLoating point OPerations per Second) >MHz: (Millions of cycles per second) > >Therefore 'Mflops per MHz':(Millions^2 FLoating point OPeration cycles per > sec^2) > >Sounds like an acceleration to me. Must be a measure of how fast computer >speed is improving. Still, the choice of units forces this number to >be small. 8-) You multiplied instead of dividing. If you had divided, you would have found that the number measures floating point operations per cycle. -- - Joe Buck {hplabs,fortune}!oliveb!epimass!jbuck, nsc!csi!epimass!jbuck Entropic Processing, Inc., Cupertino, California
dennisg@fritz.UUCP (Dennis Griesser) (10/17/86)
In article <8184@sun.uucp> dgh@sun.UUCP writes: > Mflops Per MHz In article <8575@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes: >Mflops:(Millions of FLoating point OPerations per Second) >MHz: (Millions of cycles per second) > >Therefore 'Mflops per MHz':(Millions^2 FLoating point OPeration cycles per > sec^2) > >Sounds like an acceleration to me. Must be a measure of how fast computer >speed is improving. Still, the choice of units forces this number to >be small. You are not factoring the units out correctly... million x flop second flop -------------- x --------------- = ----- second million x cycle cycle Sounds reasonable to me.
ags@h.cc.purdue.edu (Dave Seaman) (10/17/86)
In article <8575@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes: >In article <8184@sun.uucp> dgh@sun.UUCP writes: >> >> Mflops Per MHz >> >> David Hough >> dhough@sun.com >> >Mflops:(Millions of FLoating point OPerations per Second) >MHz: (Millions of cycles per second) > >Therefore 'Mflops per MHz':(Millions^2 FLoating point OPeration cycles per > sec^2) You didn't divide properly. "Mflops per MHz" means "Mflops divided by MHz", which, by the "invert the denominator and multiply" rule, comes out to "Floating point operations per cycle" after cancelling the "millions of ... per second" from numerator and denominator. I'm not claiming that this is a particularly useful measure, but that's what it means. -- Dave Seaman ags@h.cc.purdue.edu
nather@ut-sally.UUCP (Ed Nather) (10/17/86)
When I first started messing with computers (longer ago than I like to remember) I was discouraged to learn they could not handle numbers as long as I sometimes needed. Then I learned about floating point -- as a way to get very large numbers into registers (and memory cells) of limited length. It sounded great until I learned that you give up something when you do things that way --simple operations become much more complex (and slower) using standard hardware. Also, the aphorism about using a lot of floating operations was brought home to me: "Using floating point is like moving piles of sand around. Every time you move one you lose a little sand, and pick up a little dirt." Has hardware technology progressed to the point where we might want to consider making a VERY LARGE integer machine -- with integers long enough so floating point operations would be unnecessary? I'm not sure how long they would have to be, but 512 bits sounds about right to start with. This would allow integers to have values up to about 10E150 or so, large enough for Avagadro's number or, with suitable scaling, Planck's constant. It would allow rapid integer operations in place of floating point operations. If you could add two 512-bit integers in a couple of clock cycles, it should be pretty fast. I guess this would be a somewhat different way of doing parallel operations rather than serial ones. Is this crazy? -- Ed Nather Astronomy Dept, U of Texas @ Austin {allegra,ihnp4}!{noao,ut-sally}!utastro!nather nather@astro.AS.UTEXAS.EDU
bobmon@iuvax.UUCP (Robert Montante) (10/17/86)
>> Mflops Per MHz >> [...] >Mflops:(Millions of FLoating point OPerations per Second) >MHz: (Millions of cycles per second) > >Therefore 'Mflops per MHz':(Millions^2 FLoating point OPeration cycles per > sec^2) > >Sounds like an acceleration to me. Must be a measure of how fast computer >speed is improving. Still, the choice of units forces this number to >be small. 8-) I get: 10e6 X FLoating_point_OPerations / second ----------------------------------------------- 10e6 X Cycles / second which reduces to FLoating_point_OPerations / Cycle an apparent measure of instruction complexity. But then, if you use a Floating Point Accelerator, perhaps these interpretations are consistent. 8-> *-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-* RAMontante Computer Science "Have you hugged ME today?" Indiana University
henry@utzoo.UUCP (Henry Spencer) (10/17/86)
> ... marketing and management scientists ...
^ ^
Syntax error in above line: incompatible concepts!
--
Henry Spencer @ U of Toronto Zoology
{allegra,ihnp4,decvax,pyramid}!utzoo!henry
ehj@mordor.ARPA (Eric H Jensen) (10/17/86)
In article <6028@ut-sally.UUCP> nather@ut-sally.UUCP (Ed Nather) writes: >more complex (and slower) using standard hardware. Also, the aphorism >about using a lot of floating operations was brought home to me: >"Using floating point is like moving piles of sand around. Every time >you move one you lose a little sand, and pick up a little dirt." I thought numerical analysis was the plastic sheet you place your sand on - with some thought (algorithm changes) you can control your errors most of the time or at least understand them. Then of course there is always an Extended format ... >Has hardware technology progressed to the point where we might want to >consider making a VERY LARGE integer machine -- with integers long >... >scaling, Planck's constant. It would allow rapid integer operations >in place of floating point operations. If you could add two 512-bit >integers in a couple of clock cycles, it should be pretty fast. I would not want to be the one to place and route the carry-lookahead logic for a VERY fast 512 bit adder (you could avoid this by using the tidbits approach but that has many other implications). The real killers would be multiply and divide. If you really want large integers use an efficient bignum package; hardware can help by providing traps or micro-code support for overflow conditions. -- eric h. jensen (S1 Project @ Lawrence Livermore National Laboratory) Phone: (415) 423-0229 USMail: LLNL, P.O. Box 5503, L-276, Livermore, Ca., 94550 ARPA: ehj@angband UUCP: ...!decvax!decwrl!mordor!angband!ehj
josh@polaris.UUCP (Josh Knight) (10/18/86)
In article <16112@mordor.ARPA> ehj@mordor.UUCP (Eric H Jensen) writes: >In article <6028@ut-sally.UUCP> nather@ut-sally.UUCP (Ed Nather) writes: >>more complex (and slower) using standard hardware. Also, the aphorism >>about using a lot of floating operations was brought home to me: >>"Using floating point is like moving piles of sand around. Every time >>you move one you lose a little sand, and pick up a little dirt." > >I thought numerical analysis was the plastic sheet you place your sand >on - with some thought (algorithm changes) you can control your >errors most of the time or at least understand them. Then of course >there is always an Extended format ... > There are two realistic limits to the precision used for a particular problem, assuming that the time to accomplish the calculation is not an issue. The first is the precision of the input data (in astronomy this tends to be less than single precision, somtimes significantly less) and the second is the number of intermediate values you are willing to store in your calcuation (i.e. memory). The number of intermediate values includes things like the fineness of the grid you use in an approximation to a continuous problem formulation (although quantum mechanics makes everything "grainy" at some point, numerical calculations aren't usually done to that level). As Eric points out, proper handling of the calcuations and some extended precision (beyond what is kept in long term storage) will provide all the precision that is available with the given resources. Indeed, the proposal to use long integers is wasteful of the very resource that is usually in short supply in these calculations, namely memory (reference to all the "no virtual memory MY Cray!" verbage). When Ed stores the mass of a 10 solar mass star in his simulation of the evolution of an open cluster as a 512 bit integer, approximately 500 of the bits are wasted on meaningless precision. The mass of the sun is of order 10e33 grams, but the precision to which we know the mass is only five or six decimal digits (limited, I believe, by the precision of G, the gravitational coupling constant, but the masses of stars other than the sun are typically much more poorly known), thus storing this number in a 512 bit integer wastes almost all the bits, only 15-20 of them mean anything. I'll admit (before I get flamed) that the IBM 370 floating point format has some deficiencies when it comes to numerical calculations (truncating arithmetic and hexadecimal normalization). I will also disclaim that I speak only for myself, not my employer. -- Josh Knight, IBM T.J. Watson Research josh@ibm.com, josh@yktvmh.bitnet, ...!philabs!polaris!josh
bzs@bu-cs.BU.EDU (Barry Shein) (10/19/86)
From: nather@ut-sally.UUCP (Ed Nather) >Has hardware technology progressed to the point where we might want to >consider making a VERY LARGE integer machine -- with integers long >enough so floating point operations would be unnecessary? Why wouldn't the packed decimal formats of machines like the IBM/370 be sufficient for most uses (31 decimal digits+sign expressed as nibbles, slightly more complicated size range for multiplication and division operands, basic arithmetic operations supported.) That's not a huge range but it's a lot larger than 32-bit binary. I believe it was developed because a while ago people like the gov't noticed you can't do anything with 32-bit regs and their budgets, and floating point was unacceptable for many things. There are no packed decimal registers so I assume the instructions are basically memory-bandwidth limited (not unusual.) The VAX seems to support operand lengths up to 16-bits (65k*2 digits? I've never tried it.) There is some primitive support for this (ABCD, SBCD) in the 68K. -Barry Shein, Boston University
josh@polaris.UUCP (Josh Knight) (10/19/86)
Sorry about all the typos in <753@polaris>. -- Josh Knight, IBM T.J. Watson Research josh@ibm.com, josh@yktvmh.bitnet, ...!philabs!polaris!josh
stuart@BMS-AT.UUCP (Stuart D. Gathman) (10/21/86)
In article <6028@ut-sally.UUCP>, nather@ut-sally.UUCP (Ed Nather) writes: > long as I sometimes needed. Then I learned about floating point . . . > . . . . It sounded great until I learned that you give up > something when you do things that way --simple operations become much > more complex (and slower) using standard hardware. . . . For problems appropriate to floating point, the input is already imprecise. Planck's constant is not known to more than a dozen digits at most. Good floating point software keeps track of the remaining precision as computations proceed. Even if the results were computed precisely using rational arithmetic, the results would be more imprecise than the input. Rounding in floating point hardware contributes only a minor portion of the imprecision of the result in properly designed software. For problems unsuited to floating point, e.g. accounting, yes the floating point hardware gets in the way. For accounting one should use large integers: 48 bits is plenty in practice and no special hardware is needed. The 'BCD' baloney often advocated is just that. Monetary amounts in accounting are integers. 'BCD' is sometimes used so that decimal fractions round correctly, but the correct method is to use integers. Rational arithmetic is another place for large integers. Numbers are represented as the quotient of two large integers. This is where special hardware might help. Symbolic math often uses rational arithmetic, but the large integers should be variable length. Numbers such as '1' and '2' are far more common than 100 digit monsters. -- Stuart D. Gathman <..!seismo!{vrdxhq|dgis}!BMS-AT!stuart>
gnu@hoptoad.uucp (John Gilmore) (10/22/86)
In article <6028@ut-sally.UUCP> nather@ut-sally.UUCP (Ed Nather) writes: >"Using floating point is like moving piles of sand around. Every time >you move one you lose a little sand, and pick up a little dirt." IEEE floating point requires an "exact" mode which causes a trap any time the result of an operation is not exact. This lets your software know that it has picked up dirt, if it cares, and lets particularly smart software change to extended precision, long integers, or whatever. I was wondering how you represent values <1 in your 512-bit integers... or are you going to figure out binary points on the fly? In that case you might as well let hardware do it -- that's called floating point! -- John Gilmore {sun,ptsfa,lll-crg,ihnp4}!hoptoad!gnu jgilmore@lll-crg.arpa (C) Copyright 1986 by John Gilmore. May the Source be with you!
rb@cci632.UUCP (Rex Ballard) (10/22/86)
In article <3153@h.cc.purdue.edu> ags@h.cc.purdue.edu.UUCP (Dave Seaman) writes: >In article <8575@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes: >>In article <8184@sun.uucp> dgh@sun.UUCP writes: >>> Mflops Per MHz > "Floating point operations per cycle" > >after cancelling the "millions of ... per second" from numerator and >denominator. > >I'm not claiming that this is a particularly useful measure, but that's >what it means. Sounds to me like what is really wanted is cycles/flop average. This does give an indication of micro-code efficiency of the archetecture used, based on average, rather than advertized figures. Obviously the more cycles/flop, the less inherantly efficient the chip archetecture is. By figuring these values using whetstones or similar benchmarks, additional "off-chip" factors such as set up overhead for the FPU calls are correctly included. This sounds like an interesting measure of CPU/FPU archetecture in the broader sense of the word. >Dave Seaman >ags@h.cc.purdue.edu
kissell@garth.UUCP (10/23/86)
In article <725@mips.UUCP> mash@mips.UUCP (John Mashey) writes: > Right now, at least >in anything close to 2micron CMOS, if the FPU is part of the CPU chip, it >just has to be heavily microcoded. Oh? What law of physics are we violating? ;-) Kevin D. Kissell Fairchild Advanced Processor Division
peters@cubsvax.UUCP (Peter S. Shenkin) (10/23/86)
In article <BMS-AT.253> stuart@BMS-AT.UUCP (Stuart D. Gathman) writes: > >For problems appropriate to floating point, the input is already >imprecise. Planck's constant is not known to more than a dozen >digits at most. Good floating point software keeps track of >the remaining precision as computations proceed. ??? I've never heard of this. Could you say more? Until you do, I will.... Read on. > ...Rounding >in floating point hardware contributes only a minor portion of >the imprecision of the result in properly designed software. I disagree. Consider taking the average of a many floating point numbers which are read in from a file, and which differ greatly in magnitude. How many there are to average may not be known until EOF is encountered. The "obvious" way of doing this is to accumulate the sum, then divide by n. But if some numbers are very large, the very small ones will fall off the low end of the dynamic range, even if there are a lot of them; this problem is avoided if one uses higher precision (double or extended) for the sum. If declaring things this way is what you mean by properly designed software, OK. But the precision needed for intermediate values of a computation may greatly exceed that needed for input and output variables. I call this a rounding problem. I know of no "floating point software" that will get rid of this. There are, of course, programming techniques for handling it, some of which are very clever. Again, I suppose you could say that if you don't implement them then you're not using properly designed software. But these techniques are time-consuming to build in to programs, and time consuming to execute; therefore, they should only be used where they're really needed. But the whole point is that the precision needed for intermediate results may GREATLY exceed that needed for input and output variables, and an important part of numerical analysis is being able to figure out where that is. Peter S. Shenkin Columbia Univ. Biology Dept., NY, NY 10027 {philabs,rna}!cubsvax!peters cubsvax!peters@columbia.ARPA
ludemann@ubc-cs.UUCP (10/24/86)
In article <253@BMS-AT.UUCP> stuart@BMS-AT.UUCP writes: >For problems unsuited to floating point, e.g. accounting, yes the >floating point hardware gets in the way. For accounting one should >use large integers: 48 bits is plenty in practice and no special hardware >is needed. As someone who has done accounting using floating point for accounting, I wish to point out that 8-byte floating point has more precision than 15 digits of BCD. Remembering that the exponent only takes a "few" bits, I'll happily use floating point any day instead of integers (even 48 bit integers). Integers work fine for accounting as long as one is adding and subtracting but if one has to multiply (admittedly, not often), there's big trouble. After quite a number of attempts to make things balance to the penny, I changed to floating point and all my problems vanished (the code ran faster, too).
franka@mmintl.UUCP (Frank Adams) (10/28/86)
In article <253@BMS-AT.UUCP> stuart@BMS-AT.UUCP writes: >For problems unsuited to floating point, e.g. accounting, yes the >floating point hardware gets in the way. For accounting one should >use large integers: 48 bits is plenty in practice and no special hardware >is needed. 48 bits is not always adequate. One sometimes has to perform operations of the form a*(b/c), rounded to the nearest penny (integer). Doing this with integer arithmethic requires intermediate results with double the precision of the final results. With floating point, this is not necessary. Frank Adams ihnp4!philabs!pwa-b!mmintl!franka Multimate International 52 Oakland Ave North E. Hartford, CT 06108
dik@mcvax.uucp (Dik T. Winter) (11/08/86)
In article <570@cubsvax.UUCP> peters@cubsvax.UUCP (Peter S. Shenkin) writes: >In article <BMS-AT.253> stuart@BMS-AT.UUCP (Stuart D. Gathman) writes: >> >> Good floating point software keeps track of >>the remaining precision as computations proceed. > >??? I've never heard of this. Could you say more? Until you do, I will.... >Read on. > >> ...Rounding >>in floating point hardware contributes only a minor portion of >>the imprecision of the result in properly designed software. > >I disagree. Consider taking the average of a many floating point numbers >which are read in from a file, and which differ greatly in magnitude. >How many there are to average may not be known until EOF is encountered. >The "obvious" way of doing this is to accumulate the sum, then divide >by n. But if some numbers are very large, the very small ones will >fall off the low end of the dynamic range, even if there are a lot of >them; this problem is avoided if one uses higher precision (double >or extended) for the sum. If declaring things this way is what you mean by >properly designed software, OK. But the precision needed for intermediate >values of a computation may greatly exceed that needed for input and >output variables. I call this a rounding problem. I know of no "floating >point software" that will get rid of this. > Well, there are at least three packages dealing with it: ACRITH from IBM and ARITHMOS from Siemens (they are identical in fact) and a language called PASCAL-SC on a KWS workstation (a bit obscure I am sure). They are based on the work by Gulisch et al. from the University of Karlsruhe. They use arithmetic with directed rounding and accumulation of dot products in long registers (168 bytes on IBM). On IBM there is microcode support for this on the 4341 (or 4381 or 43?? or some such beast). The main purpose is verification of results (at least, that is my opinion). For instance on a set of linear equations find a solution interval that contains the true solution with the constraint that the interval is as small as possible. They then first proceed finding an approximate solution using standard techniques followed by an iterative scheme to obtain a smallest interval using interval arithmetic combined with long registers. This is superior toi standard interval arithmetic because the latter tends to give much too large intervals. -- dik t. winter, cwi, amsterdam, nederland UUCP: {seismo,decvax,philabs,okstate,garfield}!mcvax!dik or: dik@mcvax.uucp ARPA: dik%mcvax.uucp@seismo.css.gov