[comp.arch] Macho flops versus Megaflops

fouts@orville.nas.nasa.gov (Marty Fouts) (11/07/87)

Kevin Buchs asks why the ETA-10 is advertised at 375 MFLOP but does 10
MFLOP on Linpack, and if other machines such as the Cray 2 have the
same problem.  The answer to the second question is yes; most machines
have a different peak number and "average" number.  I believe that an
EPA sticker needs to be put on (super)computer claims:  'Vendor
factory calculations show a maximum performance of X Units.  Use this
number as a guide, your floppage may vary, according to programming
style and problem conditions.'

The advertised peak performance number is just that; peak performance.
It is frequently refered to around here as the "guarenteed not to
exceed this speed" number; and is usually obtained (for a
supercomputer) by application of the following logic:

The machine, when running in full blown vector mode can pump out 1 FLOP
result per functional unit per N clock periods.  (We try to make N 1
also ;-)  It has M functional units which can be active simultaneously
and a clock period of T nanoseconds.  Therefore, if you have an
application which can be coded to use every functional unit, and is
entirely vector in performance you can achieve M / (N * T) FLOPS per
second.  This is the rate the vendor quotes.

On a real application, this rate can be slowed by many things.  First
of all, your application isn't entirely vector adds and multiplies, it
has to do other work.  This leads to the vector/scalar trade off which
Gene Amdahl loves so much -- Hot vector computers aren't nearly
as hot as scalar machines.  (A 10 to 1 time ratio is not uncommon)
If you have a code which is 10% scalar and 90% vector on a machine
which has the 10 to 1 performance ratio, that code is going to spend
as much time doing scalar work as it does doing vector work.

Secondly, even if your application is all vector, there is probably
some architectural gotcha that will keep the machine getting peak
performance, such as you really need 3 adds and 1 multiply, but the
machine has 2 adders and 2 multipliers, so the multiplier is idle part
of the time while the adder handles the extra work load.  There are
many of these kinds of gotchas, relating to vector length, number and
type of functional units, and memory reference patterns.  (On the Cray
2, it can take 1.5 times as long to reference the same memory bank
twice in a row as it does to reference two successive memory banks.)

Thirdly, there is the quality of the compiler technology.  The better
a compiler is at detecting optimizable code the better performance it
can achieve.  Originally, the Cray 2 C compiler would produce about 7
million whetstones and the Fortran compiler about 15 million.  Now,
the C compiler produces 11 and the Fortran compiler 20.  (Dusty deck
double precision in all four cases.)

Finally, I/O can do you in.  You might have a machine with a small
physical memory and backing stores, such as SSD on an X/MP or virtual
memory on a 205, and you have to keep moving your data between very
fast main memory and not so fast backing store, so that your CPU can
get to it.  (Small is relative;  X/MP 4-16 has 16 MWord = 128 MByte of
main memory, which is big compared to a PC, but small compared to the
2048 MByte of memory on a Cray 2.  The key feature is that all of the
data being crunched doesn't fit.)

And all of these things occur at a gross physical level, so that the
programmer has to be painfully aware of them.  I have written simple
loops on the Cray 2 in C or Fortran which get 15 - 20 MFlops, which
can be replaced by hand coded assembly and get 150-1200 MFlops.

The bottom line is that the vendor reports the peak speed and Linpack,
the Livermore Loops, the NAS kernels, Whetstone, et al. report how the
vendor's compiler technology translates a particular algorithm into
code to run the vendors architecture.  My favorite pathological case
is a C program I wrote which runs twice as fast on a Vax as on the
Cray 2; simply because I coded for pathological behavior on the 2.

mrc@CAM.UNISYS.COM (Ken Leonard --> ken@oahu.mcl.unisys.com@sdcjove.cam.unisys.com) (11/09/87)

In article <3322@ames.arpa>, fouts@orville.nas.nasa.gov (Marty Fouts) writes:
> Kevin Buchs asks why the ETA-10 is advertised at 375 MFLOP but does 10
> MFLOP on Linpack...
> 
> ...
> 
> The bottom line is that the vendor reports the peak speed and Linpack,
> the Livermore Loops, the NAS kernels, Whetstone, et al. report how the
> vendor's compiler technology translates a particular algorithm into
> code to run the vendors architecture.  My favorite pathological case
> is a C program I wrote which runs twice as fast on a Vax as on the
> Cray 2; simply because I coded for pathological behavior on the 2.

First of all, how many folk believe that such mundane things as hardware
resource distribution/partition/allocation/proportion are at least as
important as clock speed?  If you judge by the number of systems with
per-character comm controllers ( vs. DMA ) or few-as-possible
large-as-possible disk stores ( vs. many-as-needed fast-as-possible
allocated/assigned-per-applicationstructure ), you may soon conclude
that the answer is "not very darn many."

Second, you omitted any (direct) reference to operating system architecture--
either as a matter of base family or quality of implementation/port.

UNIX is NOT a "bad system", as I am sometimes accused of saying.  Neither is
MVS not GCOS nor MS-DOS!  But ANY system can be a bad choice for a given
application, and a particular "port" of a particular system may be a
DISASTER for a particular application.

Does a number-cruncer (application) with relatively little I/O run better
on an interrupt-driven kernel, or with an interrupt-stacking kernel?  What
about a disk-cruncher?  What about a comm-cruncher?

Does a kernel with a single paging philosophy suit very many applications
very well (HECK, NO!) 

Should a compiler care that extreme "registerizing" of variables in a
high-interrupt environment means that the average context-switch time is 
doubled?  Probably, it should (be able to be told to) care VERY much.

Should a compiler care that bank-interleaved allocation of the biggest array
may not be as important as being told which array actually is subject to
the majority of vector operations?

Gee, what ever happened to the idea of considering the whole problem
before leaping to a solution?

regardz to all,
Ken Leonard

!--> ken@oahu.mcl.unisys.com@sdcjove.cam.unisys.com

henry@utzoo.UUCP (Henry Spencer) (11/11/87)

> Gee, what ever happened to the idea of considering the whole problem
> before leaping to a solution?

Same thing that always happens to it:  everybody says "gee, that would be
a great idea if (a) we were only interested in one problem, (b) we knew
in advance what it was like, and (c) we could afford to invest the effort
needed to understand it completely and optimize the whole system for it".
-- 
Those who do not understand Unix are |  Henry Spencer @ U of Toronto Zoology
condemned to reinvent it, poorly.    | {allegra,ihnp4,decvax,utai}!utzoo!henry

muller@alliant.Alliant.COM (Jim Muller) (11/17/87)

In <3322@ames.arpa> fouts@orville.nas.nasa.gov.UUCP (Marty Fouts) writes:
>...most machines have a different peak number and "average" number.
>The advertised peak performance number is frequently refered to around here
>as the "guarenteed not to exceed this speed" number...

He then gave a good explanation of how peak numbers are calculated, and
why they really are "peak" instead of "average sustainable" speed.  His
reasons were quite accurate, and can be summarized into four areas:

>First of all, your application isn't entirely vector adds and multiplies...
>Secondly, there is probably some architectural gotcha that will keep the
>   machine (from) getting peak performance...relating to vector length,
>   number and type of functional units, and memory reference patterns.
>Thirdly, there is the quality of the compiler technology.
>Finally, I/O can do you in.

There is one more item to the story, though.  All of these factors influence
the speed of any given code, but the business of peak vs. average speed is
presumably beyond the features of "typical" applications.  In other words,
such test codes should be written so as *not* to trip over these things.

The extra item is something you cannot work around, i.e. the "ramp-up" time
of vector instructions.  Typically, vector instructions take N cycles to load
up, followed by M cycles with output.  It is the rate of the M outputs that is
used for "peak" speeds.  The average sustained speed, though, is reduced by a
factor of N / (N + M).  If the ramp-up requires half as many cycles as the
vector lenght, then the sustainable rate will be only 2/3 of the peak rate,
EVEN IF THE CODE IS PERFECTLY MATCHED TO THE OTHER ARCHITECTURAL FEATURES OF
THE MACHINE!  It has nothing to do with the four "real world" factors that
Marty explained so well.

"So why not list sustainable rates, instead?  Or give the ramp-up times too?"
you ask.  Simply because it isn't that simple.  The peak rates quoted may be
for the fairly busy triadic vector operations.  Simpler vector operations may
require fewer ramp-up cycles, but still output one datum per cycle.  Yet the
nominal flop-rate (both peak and sustainable) is lower because that operation
is doing less work (e.g. an add is only half as many nominal operations as a
multiply followed by an add).  Inotherwords, there is no single answer.

One thing that machine designers (should) try to do is reduce the ramp-up time
for vector instructions, since this will result in a real-time speedup of the
vector portions of any code.  However, while improving both the theoretical
sustainable rate and the real throughput rate, it has no impact on the peak
rate.  Thus, the true speed of a machine is obscured before you ever get into
the question of "real world" applications.

Highly tuned, avoid-all-the-architectural-pitfalls codes for the Alliant FX/8
have managed to reach sustained output rates near the *sustainable* rate as
described here.  I have no doubt that other super- and mini-super-computer
builders have done this too.  However, no code will ever go faster than the
sustainable rate, and never even reach the peak rate, unless you measure
output rate during the body of a single vector instruction.  BTW, these highly
tuned codes are usually worthless except as academic studies, since real-life
applications are often dominated by the other architectural weaknesses, i.e.
you start from the sustainable rate and work down!

-----------------------------------------------------------------------------
My employer did not sanction this posting, nor did they require or request me
to make this disclaimer.  Thanks for listening.  - Jim

himel@mips.UUCP (Mark I. Himelstein) (11/18/87)

I've wondered looking at peak versus sustained numbers if anything can
be deduced from the difference. For example looking at linpack versus
peak it appeared to me that the less the difference between the two
the more reasonably priced the machine was (most of the time). It may
be that achieving that leading-edge performance on sustained numbers
requires extravagant harware leading to higher cost.

Mark I. Himelstein

Disclaimer: This posting only contains my ideas, not my employers.

lamaster@pioneer.arpa (Hugh LaMaster) (11/19/87)

In article <860@alliant.Alliant.COM> muller@alliant.UUCP (Jim Muller) writes:

>
>The extra item is something you cannot work around, i.e. the "ramp-up" time
>of vector instructions.  Typically, vector instructions take N cycles to load
>up, followed by M cycles with output.  It is the rate of the M outputs that is
(discussion of vector start up times omitted)

Followers of the megaflops sweepstakes may be interested to know the latest
(Nov 3) Dongarra report now lists the ETA-10 (both nitrogen cooled 10.5 ns and
air cooled 24 ns).  The ETA-10 is currently the fastest full precision
compiled fortran machine, at 52 MFLOPS, displacing the previous recordholder
the NEC SX-2 (a mere 43 MFLOPS).  The 24 ns ETA10-P rates a respectable 23
MFLOPS (faster than a Cray-2).  Further results are needed to confirm the
performance, but it does indicate that ETA has been able to significantly
reduce vector start up times compared with the Cyber 205, which only ran at 17
MFLOPS (20ns clock).  

  Hugh LaMaster, m/s 233-9,  UUCP {topaz,lll-crg,ucbvax}!
  NASA Ames Research Center                ames!pioneer!lamaster
  Moffett Field, CA 94035    ARPA lamaster@ames-pioneer.arpa
  Phone:  (415)694-6117      ARPA lamaster@pioneer.arc.nasa.gov

(Disclaimer: "All opinions solely the author's responsibility")