[comp.arch] linpack

rpeglar@eta.ETA.COM (Rob Peglar) (11/05/87)

A recent posting read as follows:


>  
>  I have seen it many times in the Media.  The 10-P is rated
>  at 375 Mflops, but does Linpack at 25.  Why the big difference?
>  Do other supercomputers behave similarly?  What does a Cray 2
>  do on Linpack?
>  -- 
>  Kevin Buchs   3500 Zycad Dr. Oakdale, MN 55109  (612)779-5548
>  Zycad Corp.   {rutgers,ihnp4,amdahl,umn-cs}!meccts!nis!zycad!kjb


1)  The 10-P rating at 375 Mflops is PEAK rate.  This means long (500
elements, for example) vectors of 32-bit precision running in a linked
triad computation (e.g. A = B*C + D, all vectors).  The quoted example
of Linpack does not perform at peak rates.

2)  Yes, other supercomputers behave similarly, in a loose sense.  The
ETA architecture is sufficiently different than, for example, a Cray X
model.  Their Linpack ratings will be slower than their peak ratings.

3)  What does a Cray-2 do on Linpack?  Dr. Dongarra's report dated
October 30, 1986 (about a year old now) shows the Cray-2 running at
15 Mflops in one processor using CFT2.7, rolled BLAS.  No listing for
coded BLAS.  

*** It must be noted that the published rates on ETA10-P (and all
other models of ETA procesors) are for 100x100 Linpack, 64-bit (full)
precision, all Fortran.  ***  (The Cray-2 number above is the same
calculation)

There exist many articles in the literature which explain the various
supercomputer architectures.  Understanding Linpack results can only be
achieved with study of same.

BTW, the Cray X-MP single processor rates at 24 Mflops, using CFT 1.13,
rolled BLAS.  Remember, these numbers vary with compilers' astuteness.
Using CFT 1.15, for example, may give better numbers.  Try it out.

The ETA10-P rating used the ETA Fortran 77 compiler 1.0.   

Rob Peglar
ETA Systems, Inc.

Disclaimer:  The above is not an opinion of ETA Systems, Inc.  It's the truth.

preston@titan.rice.edu (Preston Briggs) (10/18/89)

In article <9079@batcomputer.tn.cornell.edu> kahn@tcgould.tn.cornell.edu writes:

>Throw away ALL your copies of the LINPACK 100x100 benchmark if you
>are interested in supercomputers.  The 300x300 is barely big enough

Danny Sorenson mentioned recently that linpack is sort of intended
to show how *bad* a computer can be.  The sizes are kept
deliberately small so that the vector machines barely have a chance
to get rolling.

So, big if you're optimistic; small otherwise.

Preston Briggs

kahn@batcomputer.tn.cornell.edu (Shahin Kahn) (10/19/89)

In article <2203@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
>In article <9079@batcomputer.tn.cornell.edu> kahn@tcgould.tn.cornell.edu writes:
>>Throw away ALL your copies of the LINPACK 100x100 benchmark if you
>>are interested in supercomputers.  The 300x300 is barely big enough
>Danny Sorenson mentioned recently that linpack is sort of intended
>to show how *bad* a computer can be.  The sizes are kept
>deliberately small so that the vector machines barely have a chance
>to get rolling.

It certainly is biased towards micros with limited memory and is
absolutely irrelevant as a *supercomputer* application.  Yes, it
can show how bad a supercomputer can be.  I dont believe, however,
that it was *intended* to do that.  My theory is that the size was
set smallish because a 100x100 matrix wasnt considered so small
back then, and the algorithm is suboptimal because that's what
level-1 BLAS and the LINPACK library did back then.  There were
people who were implementing equivalents of level-2 BLAS and even
the emerging LAPACK kernels, but those didnt get blessed by the
Linpack benchmark until the 300x300 and finally the 1000x1000 were
included.

DONT compare supercomputers with the 100x100 linpack.
If you must use linpack 100x100, use linpack 300x300!  And then
submit 37 copies of the program at once and see which machine 
does 37 copies the fastest, and which one allows you to use the
keyboard while its running them!!

Read JackWorlton's paper in SupercomputerReveiw of Dec.88 or Jan.89,
and then if you still want to use linpack, at least you'll do it
with better knowledge.

mccalpin@masig3.masig3.ocean.fsu.edu (John D. McCalpin) (10/19/89)

In article <9079@batcomputer.tn.cornell.edu> kahn@tcgould.tn.cornell.edu 
writes:
>Throw away ALL your copies of the LINPACK 100x100 benchmark if you
>are interested in supercomputers.  The 300x300 is barely big enough 

In article <2203@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) 
writes:
>Danny Sorenson mentioned recently that linpack is sort of intended
>to show how *bad* a computer can be.  The sizes are kept
>deliberately small so that the vector machines barely have a chance
>to get rolling.

In article <9089@batcomputer.tn.cornell.edu> kahn@batcomputer.tn.cornell.edu 
(Shahin Kahn) writes:
>It certainly is biased towards micros with limited memory and is
>absolutely irrelevant as a *supercomputer* application.  Yes, it
>can show how bad a supercomputer can be.

Well, I'll through in my $0.02 of disagreement with this thread.  It
has been my experience that the poor performance of the LINPACK
100x100 test on supercomputers is *entirely typical* of what users
actually run on the things. There a plenty of applications burning up
Cray, Cyber 205, and ETA-10 cycles which have average vector lengths
*shorter* than the average of 66 elements for the LINPACK test, and
which are furthermore loaded down with scalar code.

The 100x100 test case is not representative of *everyone's* jobs, but
it is not an unreasonable "average" case, either.  I think it is much
more representative of what most users will see than the 1000x1000
case, for example.  The 1000x1000 case is a very good indicator of
what the *best* performance will be with *careful optimization* on
codes that are essentially 100% vectorizable.  Most of the
supercomputer workload does not fall into that category....
--
John D. McCalpin - mccalpin@masig1.ocean.fsu.edu
		   mccalpin@scri1.scri.fsu.edu
		   mccalpin@delocn.udel.edu

mcdonald@uxe.cso.uiuc.edu (10/20/89)

>Well, I'll through in my $0.02 of disagreement with this thread.  It
>has been my experience that the poor performance of the LINPACK
>100x100 test on supercomputers is *entirely typical* of what users
>actually run on the things. There a plenty of applications burning up
>Cray, Cyber 205, and ETA-10 cycles which have average vector lengths
>*shorter* than the average of 66 elements for the LINPACK test, and
>which are furthermore loaded down with scalar code.

>The 100x100 test case is not representative of *everyone's* jobs, but
>it is not an unreasonable "average" case, either.  I think it is much
>more representative of what most users will see than the 1000x1000
>case, for example.  The 1000x1000 case is a very good indicator of

I heartily agree with that: I have two projects that I tried on 
a big supercomputer: one diagonalized zillions of 20x20 matrices,
the other wants to diagonalize a single 1000000x1000000 matrix.
Neither was suitable for a Cray!

Doug McDonald

swarren@eugene.uucp (Steve Warren) (10/20/89)

In article <2203@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
>In article <9079@batcomputer.tn.cornell.edu> kahn@tcgould.tn.cornell.edu writes:
>
>>Throw away ALL your copies of the LINPACK 100x100 benchmark if you
>>are interested in supercomputers.  The 300x300 is barely big enough
>
>Danny Sorenson mentioned recently that linpack is sort of intended
>to show how *bad* a computer can be.  The sizes are kept
>deliberately small so that the vector machines barely have a chance
>to get rolling.
>
>So, big if you're optimistic; small otherwise.
>
>Preston Briggs


That is sort of like testing a dump-truck on a slalom course.  A vector
machine should have balanced performance on scalar code, but you are
buying vector performance primarily.

So, big if you want to see if it can do what you are paying for, small
if you want to see if it can do anything else.

--Steve
-------------------------------------------------------------------------
	  {uunet,sun}!convex!swarren; swarren@convex.COM

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/21/89)

In article <MCCALPIN.89Oct19090641@masig3.masig3.ocean.fsu.edu> mccalpin@masig3.masig3.ocean.fsu.edu (John D. McCalpin) writes:
>In article <9079@batcomputer.tn.cornell.edu> kahn@tcgould.tn.cornell.edu 
>writes:
>>Throw away ALL your copies of the LINPACK 100x100 benchmark if you
>>are interested in supercomputers.  The 300x300 is barely big enough 
>
>In article <2203@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) 
>writes:
>>Danny Sorenson mentioned recently that linpack is sort of intended
>>to show how *bad* a computer can be.  The sizes are kept
>>deliberately small so that the vector machines barely have a chance
>>to get rolling.
>
>In article <9089@batcomputer.tn.cornell.edu> kahn@batcomputer.tn.cornell.edu 
>(Shahin Kahn) writes:
>>It certainly is biased towards micros with limited memory and is
>>absolutely irrelevant as a *supercomputer* application.  Yes, it
>>can show how bad a supercomputer can be.

I found this particularly amusing.  As a longtime defender of Linpack, I have
often been accused of being biased towards big vector machines, because of
the sensitivity of Linpack to memory and FPU bandwidth, and, particularly,
the ability to stream from memory to FPU and back to memory.  Now, this
happens to be a very important property of a CPU to effectively run many codes
which I have seen over the years.  I never rate machines on the basis of Linpackin absolute terms, but you can tell a lot about a machine with low Linpack
numbers.  I never could understand why people bought 11/780's, for example :-)

>Well, I'll through in my $0.02 of disagreement with this thread.  It
>has been my experience that the poor performance of the LINPACK
>100x100 test on supercomputers is *entirely typical* of what users
>actually run on the things.

I agree that vector startup time is extremely important, and Linpack is a 
fairly "nice" program with respect to average vector length, so if vector
startup time is so long as to slow it down significantly, this is significant
to users.  On the other hand, the performance is not so poor as it once was.
See below.

> There a plenty of applications burning up
>Cray, Cyber 205, and ETA-10 cycles which have average vector lengths
>*shorter* than the average of 66 elements for the LINPACK test, and
>which are furthermore loaded down with scalar code.

I note, at this point, that the ~7 ns (~142 MHz) ETA10G achieved the fastest
single processor Linpack score of 93 MFLOPS, or, .65 FLOPs/cycle.  The
Cyber 205, using earlier compilers, achieved only 17 MFLOPS, on a 20 ns
clock, or, .34 FLOPs/cycle.  The Cray Y-MP gets .50 FLOPs/cycle, while
the Cray 1/S (in 1983) got only .15 FLOPs/cycle.   The same Cray 1/S today
gets .34 FLOPs/cycle.  (It has less memory bandwidth than the Cray X-MP
and Y-MP, so you can see this effect clearly.)  The Cray XYs and ETA machines 
are capable of achieving around 2 FLOPs/cycle in hardware.  My point is
that there has been considerable improvement in both hardware and software
and startup time penalties have been correspondingly reduced.

What is the relevance of Linpack today?  Well, it still has *some* of the
same significance that it always had, but tells less than it used to.  When
caches were small, you could extrapolate the 100x100 results to bigger jobs
without worrying.  On the big iron, your performance went *up* with larger
problem sizes, so even if 300x300x300 was typical of your problem, you knew
what to expect.  Now, with 100x100 fitting in some small caches, you need to
run a bigger job to make sure performance doesn't go *down* dramatically.
(Which it does on some micro based systems, of course.)  On the other hand, if
you switch to 300x300, you lose the information contained in the 100x100 case
wrt startup time.  So, good numbers tell you even less than they did before,
but bad numbers, in a sense, tell you even more, for the same reason.  
I wouldn't buy a machine with a bad Linpack result to do these kinds of
problems, but I would look hard at the set of machines with good results,
and would look further, to see which one was the best for the job at hand.

Sometimes I use a "grep" benchmark just for fun.  The Cray Y-MP still greps
faster than any other machine I have tested, but, I agree, it isn't the
world's most cost effective grepper out there :-)  As with all benchmarks,
you have to be careful not to fool yourself...  I would guess that an amd29000
based system might be the fastest on that particular test.

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117