news@solar.ARPA (01/17/86)
From: vax135!ariel!mtunf!solar!news@ucbvax.berkeley.edu This newsgroup is moderated, and cannot be posted to directly. Please mail your article to the moderator for posting. Relay-Version: version B 2.10 5/3/83; site solar.UUCP Posting-Version: version B 2.10.3 4.3bsd-beta 6/6/85; site ucbvax.BERKELEY.EDU Path: solar!orion!mtunf!mtuni!mtunh!ariel!vax135!houxm!mhuxt!mhuxr!ulysses!ucbvax!works From: works@ucbvax.UUCP Newsgroups: mod.computers.workstations Subject: Weitek 1164/5 Floating Point Accelerator Message-ID: <8601150120.AA26926@caip.rutgers.edu> Date: Mon, 13-Jan-86 22:26:57 EST Article-I.D.: caip.8601150120.AA26926 Posted: Mon Jan 13 22:26:57 1986 Sender: daemon@ucbvax.BERKELEY.EDU Organization: The ARPA Internet Lines: 290 Approved: works@red.rutgers.edu Weitek 1164/5 Floating Point Acclerators David Hough ABSTRACT Sun-3 Floating Point Accelerator measured performance exceeds 600,000 floating point opera- tions per second on some popular benchmarks. Not all popular benchmarks are worth running, however; the results of the Whetstone benchmark, in partic- ular, are difficult to interpret. tbl/nroff source for this report is available from ucbvax!sun!dhough. Sun Microsystems, along with many of its competitors, has announced a Floating Point Accelerator product as an option for its new 68020-based Sun-3 systems. These Float- ing Point Accelerators are often based on the Weitek 1164/1165 chip set. The 1164/1165 set is currently avail- able only as engineering samples, so few of these FPA's have been used by customers. Consequently there is some uncer- tainty as to exactly what performance to expect, although most vendors projected similar results when they announced their products. Any performance differences among implemen- tations are due to the hardware surrounding the 1164/1165 and the quality of the compiler-generated code. The purpose of this report is to indicate what I have measured at Sun, and to encourage customers to report results they obtain from measurement of Sun's or competitors' products. Here are the current single and double precision bench- mark results for Sun's software release 3.1, currently under development and expected to be shipped to customers in quan- tity in the second quarter of this year. All programs were compiled with f77's -O option for maximum optimization. Results are measured in KFLOPS, thousands of floating point operations per second, except Whetstone results which are measured in KWIPS, thousands of Whetstone interpreter instructions per second. Note that all these numbers are MEASURED (not projections) except the spec sheet numbers, which are estimates derived last summer. 13 January 1986 D. Hough Weitek 1164/5 Floating Point Acclerators 2 Sun-3 SINGLE Precision KFLOPS: f77 option -fswitch -f68881 -f68881 -fswitch -ffpa FPA FP hardware 68881 68881 68881 FPA FPA spec FP clock MHz 12.5 12.5 16.7 16.7 16.7 sheet Whetstone KWIPS 530 860 1030 1400 2300 2000 Linpack rolled 52 86 108 180 610 450 Linpack unrolled 52 85 107 180 500 450 Large Linpack 1 79 100 370 Large Linpack 2 101 130 510 Large Linpack 4 115 150 630 Large Linpack 8 105 130 600 Large Linpack 16 96 120 400 Livermore max 210 280 1200 Livermore median 97 120 510 Livermore harmonic 86 110 420 Livermore loop #6 80 103 430 Livermore min 41 51 130 Sun-3 DOUBLE Precision KFLOPS: f77 option -fswitch -f68881 -f68881 -fswitch -ffpa FPA FP hardware 68881 68881 68881 FPA FPA spec FP clock MHz 12.5 12.5 16.7 16.7 16.7 sheet Whetstone KWIPS 400 790 930 860 1700 1500 Linpack rolled 39 80 101 100 400 350 Linpack unrolled 39 80 99 100 310 350 Large Linpack 1 74 92 250 Large Linpack 2 95 120 370 Large Linpack 4 109 130 450 Large Linpack 8 98 120 380 Large Linpack 16 90 108 290 Livermore max 200 270 830 Livermore median 90 110 320 Livermore harmonic 80 100 280 Livermore loop #6 75 92 270 Livermore min 38 48 110 Production Sun-3's run the 68020 CPU at 16.7 MHz and 68881 mask set A79J at 12.5 MHz. 16.7 MHz 68881 mask set A93N is currently available only as engineering samples. Note the difference between switched floating point (- fswitch) and inline floating point (-f68881 or -ffpa). A program compiled with switched floating point will use an fpa if it is there or else a 68881 if it is there. A pro- gram compiled with inline code will only run with the hardware for which it is compiled. As is evident, there is 13 January 1986 D. Hough Weitek 1164/5 Floating Point Acclerators 3 a considerable performance penalty for using switched instead of inline floating point. The usual Linpack benchmark measures the time required to solve a 100x100 system of linear equations. The inner loop of the Linpack benchmark looks like this when rolled: do 1 i = 1, n 1 x(i ) = x(i ) + c * y(i ) and like this when unrolled: do 1 i = 1, n, 4 x(i ) = x(i ) + c * y(i ) x(i+1) = x(i+1) + c * y(i+1) x(i+2) = x(i+2) + c * y(i+2) 1 x(i+3) = x(i+3) + c * y(i+3) The distributed version of the Linpack benchmark has the inner loop unrolled because that was faster on certain main- frames common in the mid-1970's. However, the unrolling defeats many current vectorizing compilers, so supercomputer manufacturers usually measure the rolled speed. Further com- plicating the issue is that some compilers do not generate optimum code for the inner loop whether rolled or unrolled, so hand coded assembly language is faster yet. The situa- tion for the usual Linpack benchmark and the Sun-3 is that code compiled inline for rolled loops is truly optimized and cannot be improved by hand coding in assembly language. Rolled loops are what a programmer would be most likely to write, so it is does not bother me that Sun's f77 compiler does not generate quite as good code when the loops are unrolled. The FPA spec sheet projections were derived by considering the rolled loop; it did not occur to me that the results would be different from unrolled until I measured the hardware. The usual Linpack benchmark is a good one for scien- tific and engineering floating point calculations, in part because it measures the performance of hardware and compiler in an indisputable way on a realistic computation. An optimizing compiler can't optimize away any of the floating point work in the Linpack benchmark, although it can organ- ize it more or less efficiently. Less widely used than the program just discussed is the Large Linpack benchmark, which measures the time required to solve a 300x300 system of linear equations, with the compu- tation organized rather differently than the usual Linpack benchmark. The program reports KFLOPS rates for solving the problem for different source codings corresponding to unrol- ling 1, 2, 4, 8, or 16 times. The Livermore Loops benchmark measures the time 13 January 1986 D. Hough Weitek 1164/5 Floating Point Acclerators 4 required to perform 24 inner loops taken from important pro- duction codes run at Livermore. Max, min, median, and har- monic mean KFLOPS rates are reported above for data vectors of length 468. The KFLOPS rating for loop #6 is also reported; it has been identified by Patterson as the single loop best correlating with overall Livermore Loops perfor- mance. Some vendors prefer to talk about results of the Whet- stone benchmark, which was synthesized to mimic the instruc- tion stream created by the Whetstone Algol interpreter of the 1960's. Hardware and software progress have rendered the Whetstone benchmark obsolete but relevance has seldom affected the science of marketing. At least one of Sun's competitors has claimed 3000 K Whetstone instructions per second for single precision, using the same 68020 and 1164/1165, which is an amazing accomplishment. Anyone that can independently verify such claims should so respond and explain how it's done! In the meantime I might consider how to improve Sun's 2300K to 3000. About half the time in the Whetstone bench- mark is taken by the P3 subroutine, and on an 1164/1165 sys- tem about half the P3 time is consumed by the division instruction. The most direct way to obtain a substantial improvement is to get rid of that division! Looking at our hardware architecture and local compiler optimization, I can't imagine any incremental improvements that would have significant effect. Certain types of global cross-procedural optimizations can have a profound impact, however. Since P3's division is by a global variable whose value happens to be 2.0, in prin- ciple the division could be converted to a multiplication by 0.5. Another possibility is to expand short procedures such as P3 inline in the calling code, then notice that the expanded computation is invariant and could be removed to the outside of the do loop, leaving an empty loop. Anyone who built such inline expansion into their compiler would double their Whetstone scores, and the only cost would be a substantial diversion of software resources away from other projects that might actually benefit customers. Since crit- ical loops in real applications are usually source coded by the programmer to avoid division by 2.0 or invariant subrou- tine calls, corresponding optimizations in the compiler sel- dom pay off in realistic floating point applications, so Sun's efforts are focused elsewhere. The moral of this digression is "don't pay much atten- tion to Whetstone results". If you want a single number to characterize performance on scientific and engineering prob- lems, use the usual Linpack benchmark. If you want lots of numbers, the Livermore loops benchmark provides them. If you want accuracy and IEEE conformance as well as speed... 13 January 1986 D. Hough Weitek 1164/5 Floating Point Acclerators 5 that's a topic for another report. Code fragments from the Whetstone program... T = .499975 T2 = 2.0 later... DO 90 I=1,N8 CALL P3(X,Y,Z) 90 CONTINUE later... SUBROUTINE P3(X,Y,Z) IMPLICIT REAL*4 (A-H,O-Z) COMMON T,T1,T2,E1(4),J,K,L X1 = X Y1 = Y X1 = T * (X1 + Y1) Y1 = T * (X1 + Y1) Z = (X1 + Y1) / T2 RETURN END note that with Weitek 1164/1165, the one division takes longer than the three additions and two multiplications combined...
news@solar.ARPA (01/20/86)
From: vax135!ariel!mtunf!solar!news@ucbvax.berkeley.edu This newsgroup is moderated, and cannot be posted to directly. Please mail your article to the moderator for posting. Relay-Version: version B 2.10 5/3/83; site solar.UUCP Posting-Version: version B 2.10.3 4.3bsd-beta 6/6/85; site ucbvax.BERKELEY.EDU Path: solar!orion!mtunf!mtuni!mtunh!ariel!vax135!houxm!mhuxt!mhuxr!ulysses!ucbvax!works From: GUTFREUND@UMASS-CS.CSNET ("Steven H. Gutfreund") Newsgroups: mod.computers.workstations Subject: Network chaos Message-ID: <8601162259.AA15299@ucbvax.berkeley.edu> Date: Wed, 15-Jan-86 09:23:00 EST Article-I.D.: ucbvax.8601162259.AA15299 Posted: Wed Jan 15 09:23:00 1986 Sender: usenet@ucbvax.BERKELEY.EDU Organization: The ARPA Internet Lines: 22 Approved: works@red.rutgers.edu Does anyone have any comments about the new consortium formed to produce a uniform network model (file servers, print servers, etc) among all computer vendors (except IBM). Bell, DEC, Burroughs, CDC, etc announced this effort in the last week. My feeling was this was not so much a technical move (there is a lot of room for innovation in the area of servers, especially when one realizes that in the future there will be voice-servers, video-servers, parrallel process servers - and locking oneself into a standard across hybred operating systems will constrain all operating system development at these firms). The actual move seems to have been more marketing focused. I think everyone was shocked when they woke up and realized that in one day, IBM could say no, they would not go ethernet with the PC, they would go ring - and all the Interlan's and 3-Coms and Apples of the world would have to dance to IBM's tune. This effort seems to be aimed at coming up with a marketing counterforce to keep IBM from making SNA a defacto standard. - Steve