news@solar.ARPA (01/17/86)
From: vax135!ariel!mtunf!solar!news@ucbvax.berkeley.edu
This newsgroup is moderated, and cannot be posted to directly.
Please mail your article to the moderator for posting.
Relay-Version: version B 2.10 5/3/83; site solar.UUCP
Posting-Version: version B 2.10.3 4.3bsd-beta 6/6/85; site ucbvax.BERKELEY.EDU
Path: solar!orion!mtunf!mtuni!mtunh!ariel!vax135!houxm!mhuxt!mhuxr!ulysses!ucbvax!works
From: works@ucbvax.UUCP
Newsgroups: mod.computers.workstations
Subject: Weitek 1164/5 Floating Point Accelerator
Message-ID: <8601150120.AA26926@caip.rutgers.edu>
Date: Mon, 13-Jan-86 22:26:57 EST
Article-I.D.: caip.8601150120.AA26926
Posted: Mon Jan 13 22:26:57 1986
Sender: daemon@ucbvax.BERKELEY.EDU
Organization: The ARPA Internet
Lines: 290
Approved: works@red.rutgers.edu
Weitek 1164/5 Floating Point Acclerators
David Hough
ABSTRACT
Sun-3 Floating Point Accelerator measured
performance exceeds 600,000 floating point opera-
tions per second on some popular benchmarks. Not
all popular benchmarks are worth running, however;
the results of the Whetstone benchmark, in partic-
ular, are difficult to interpret.
tbl/nroff source for this report is available
from ucbvax!sun!dhough.
Sun Microsystems, along with many of its competitors,
has announced a Floating Point Accelerator product as an
option for its new 68020-based Sun-3 systems. These Float-
ing Point Accelerators are often based on the Weitek
1164/1165 chip set. The 1164/1165 set is currently avail-
able only as engineering samples, so few of these FPA's have
been used by customers. Consequently there is some uncer-
tainty as to exactly what performance to expect, although
most vendors projected similar results when they announced
their products. Any performance differences among implemen-
tations are due to the hardware surrounding the 1164/1165
and the quality of the compiler-generated code. The purpose
of this report is to indicate what I have measured at Sun,
and to encourage customers to report results they obtain
from measurement of Sun's or competitors' products.
Here are the current single and double precision bench-
mark results for Sun's software release 3.1, currently under
development and expected to be shipped to customers in quan-
tity in the second quarter of this year. All programs were
compiled with f77's -O option for maximum optimization.
Results are measured in KFLOPS, thousands of floating point
operations per second, except Whetstone results which are
measured in KWIPS, thousands of Whetstone interpreter
instructions per second. Note that all these numbers are
MEASURED (not projections) except the spec sheet numbers,
which are estimates derived last summer.
13 January 1986
D. Hough Weitek 1164/5 Floating Point Acclerators 2
Sun-3 SINGLE Precision KFLOPS:
f77 option -fswitch -f68881 -f68881 -fswitch -ffpa FPA
FP hardware 68881 68881 68881 FPA FPA spec
FP clock MHz 12.5 12.5 16.7 16.7 16.7 sheet
Whetstone KWIPS 530 860 1030 1400 2300 2000
Linpack rolled 52 86 108 180 610 450
Linpack unrolled 52 85 107 180 500 450
Large Linpack 1 79 100 370
Large Linpack 2 101 130 510
Large Linpack 4 115 150 630
Large Linpack 8 105 130 600
Large Linpack 16 96 120 400
Livermore max 210 280 1200
Livermore median 97 120 510
Livermore harmonic 86 110 420
Livermore loop #6 80 103 430
Livermore min 41 51 130
Sun-3 DOUBLE Precision KFLOPS:
f77 option -fswitch -f68881 -f68881 -fswitch -ffpa FPA
FP hardware 68881 68881 68881 FPA FPA spec
FP clock MHz 12.5 12.5 16.7 16.7 16.7 sheet
Whetstone KWIPS 400 790 930 860 1700 1500
Linpack rolled 39 80 101 100 400 350
Linpack unrolled 39 80 99 100 310 350
Large Linpack 1 74 92 250
Large Linpack 2 95 120 370
Large Linpack 4 109 130 450
Large Linpack 8 98 120 380
Large Linpack 16 90 108 290
Livermore max 200 270 830
Livermore median 90 110 320
Livermore harmonic 80 100 280
Livermore loop #6 75 92 270
Livermore min 38 48 110
Production Sun-3's run the 68020 CPU at 16.7 MHz and
68881 mask set A79J at 12.5 MHz. 16.7 MHz 68881 mask set
A93N is currently available only as engineering samples.
Note the difference between switched floating point (-
fswitch) and inline floating point (-f68881 or -ffpa). A
program compiled with switched floating point will use an
fpa if it is there or else a 68881 if it is there. A pro-
gram compiled with inline code will only run with the
hardware for which it is compiled. As is evident, there is
13 January 1986
D. Hough Weitek 1164/5 Floating Point Acclerators 3
a considerable performance penalty for using switched
instead of inline floating point.
The usual Linpack benchmark measures the time required
to solve a 100x100 system of linear equations. The inner
loop of the Linpack benchmark looks like this when rolled:
do 1 i = 1, n
1 x(i ) = x(i ) + c * y(i )
and like this when unrolled:
do 1 i = 1, n, 4
x(i ) = x(i ) + c * y(i )
x(i+1) = x(i+1) + c * y(i+1)
x(i+2) = x(i+2) + c * y(i+2)
1 x(i+3) = x(i+3) + c * y(i+3)
The distributed version of the Linpack benchmark has the
inner loop unrolled because that was faster on certain main-
frames common in the mid-1970's. However, the unrolling
defeats many current vectorizing compilers, so supercomputer
manufacturers usually measure the rolled speed. Further com-
plicating the issue is that some compilers do not generate
optimum code for the inner loop whether rolled or unrolled,
so hand coded assembly language is faster yet. The situa-
tion for the usual Linpack benchmark and the Sun-3 is that
code compiled inline for rolled loops is truly optimized and
cannot be improved by hand coding in assembly language.
Rolled loops are what a programmer would be most likely to
write, so it is does not bother me that Sun's f77 compiler
does not generate quite as good code when the loops are
unrolled. The FPA spec sheet projections were derived by
considering the rolled loop; it did not occur to me that the
results would be different from unrolled until I measured
the hardware.
The usual Linpack benchmark is a good one for scien-
tific and engineering floating point calculations, in part
because it measures the performance of hardware and compiler
in an indisputable way on a realistic computation. An
optimizing compiler can't optimize away any of the floating
point work in the Linpack benchmark, although it can organ-
ize it more or less efficiently.
Less widely used than the program just discussed is the
Large Linpack benchmark, which measures the time required to
solve a 300x300 system of linear equations, with the compu-
tation organized rather differently than the usual Linpack
benchmark. The program reports KFLOPS rates for solving the
problem for different source codings corresponding to unrol-
ling 1, 2, 4, 8, or 16 times.
The Livermore Loops benchmark measures the time
13 January 1986
D. Hough Weitek 1164/5 Floating Point Acclerators 4
required to perform 24 inner loops taken from important pro-
duction codes run at Livermore. Max, min, median, and har-
monic mean KFLOPS rates are reported above for data vectors
of length 468. The KFLOPS rating for loop #6 is also
reported; it has been identified by Patterson as the single
loop best correlating with overall Livermore Loops perfor-
mance.
Some vendors prefer to talk about results of the Whet-
stone benchmark, which was synthesized to mimic the instruc-
tion stream created by the Whetstone Algol interpreter of
the 1960's. Hardware and software progress have rendered the
Whetstone benchmark obsolete but relevance has seldom
affected the science of marketing. At least one of Sun's
competitors has claimed 3000 K Whetstone instructions per
second for single precision, using the same 68020 and
1164/1165, which is an amazing accomplishment. Anyone that
can independently verify such claims should so respond and
explain how it's done!
In the meantime I might consider how to improve Sun's
2300K to 3000. About half the time in the Whetstone bench-
mark is taken by the P3 subroutine, and on an 1164/1165 sys-
tem about half the P3 time is consumed by the division
instruction. The most direct way to obtain a substantial
improvement is to get rid of that division! Looking at our
hardware architecture and local compiler optimization, I
can't imagine any incremental improvements that would have
significant effect.
Certain types of global cross-procedural optimizations
can have a profound impact, however. Since P3's division is
by a global variable whose value happens to be 2.0, in prin-
ciple the division could be converted to a multiplication by
0.5. Another possibility is to expand short procedures such
as P3 inline in the calling code, then notice that the
expanded computation is invariant and could be removed to
the outside of the do loop, leaving an empty loop. Anyone
who built such inline expansion into their compiler would
double their Whetstone scores, and the only cost would be a
substantial diversion of software resources away from other
projects that might actually benefit customers. Since crit-
ical loops in real applications are usually source coded by
the programmer to avoid division by 2.0 or invariant subrou-
tine calls, corresponding optimizations in the compiler sel-
dom pay off in realistic floating point applications, so
Sun's efforts are focused elsewhere.
The moral of this digression is "don't pay much atten-
tion to Whetstone results". If you want a single number to
characterize performance on scientific and engineering prob-
lems, use the usual Linpack benchmark. If you want lots of
numbers, the Livermore loops benchmark provides them. If
you want accuracy and IEEE conformance as well as speed...
13 January 1986
D. Hough Weitek 1164/5 Floating Point Acclerators 5
that's a topic for another report.
Code fragments from the Whetstone program...
T = .499975
T2 = 2.0
later...
DO 90 I=1,N8
CALL P3(X,Y,Z)
90 CONTINUE
later...
SUBROUTINE P3(X,Y,Z)
IMPLICIT REAL*4 (A-H,O-Z)
COMMON T,T1,T2,E1(4),J,K,L
X1 = X
Y1 = Y
X1 = T * (X1 + Y1)
Y1 = T * (X1 + Y1)
Z = (X1 + Y1) / T2
RETURN
END
note that with Weitek 1164/1165, the one division takes longer than
the three additions and two multiplications combined...news@solar.ARPA (01/20/86)
From: vax135!ariel!mtunf!solar!news@ucbvax.berkeley.edu
This newsgroup is moderated, and cannot be posted to directly.
Please mail your article to the moderator for posting.
Relay-Version: version B 2.10 5/3/83; site solar.UUCP
Posting-Version: version B 2.10.3 4.3bsd-beta 6/6/85; site ucbvax.BERKELEY.EDU
Path: solar!orion!mtunf!mtuni!mtunh!ariel!vax135!houxm!mhuxt!mhuxr!ulysses!ucbvax!works
From: GUTFREUND@UMASS-CS.CSNET ("Steven H. Gutfreund")
Newsgroups: mod.computers.workstations
Subject: Network chaos
Message-ID: <8601162259.AA15299@ucbvax.berkeley.edu>
Date: Wed, 15-Jan-86 09:23:00 EST
Article-I.D.: ucbvax.8601162259.AA15299
Posted: Wed Jan 15 09:23:00 1986
Sender: usenet@ucbvax.BERKELEY.EDU
Organization: The ARPA Internet
Lines: 22
Approved: works@red.rutgers.edu
Does anyone have any comments about the new consortium formed to
produce a uniform network model (file servers, print servers, etc)
among all computer vendors (except IBM). Bell, DEC, Burroughs, CDC,
etc announced this effort in the last week.
My feeling was this was not so much a technical move (there is a lot
of room for innovation in the area of servers, especially when one
realizes that in the future there will be voice-servers,
video-servers, parrallel process servers - and locking oneself into a
standard across hybred operating systems will constrain all operating
system development at these firms).
The actual move seems to have been more marketing focused. I think
everyone was shocked when they woke up and realized that in one day,
IBM could say no, they would not go ethernet with the PC, they would
go ring - and all the Interlan's and 3-Coms and Apples of the world
would have to dance to IBM's tune. This effort seems to be aimed at
coming up with a marketing counterforce to keep IBM from making SNA a
defacto standard.
- Steve