[comp.sys.super] ATTACK OF THE KILLER MICROS

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (07/31/90)

		ATTACK OF THE KILLER MICROS --- AGAIN....
	     (a summary of my recent postings in comp.arch)

ABSTRACT:
--------
The price and performance of the new IBM RS/6000-based workstations
have forced me to reconsider my position on the roles of dedicated
microprocessor-based machines versus the use of shared supercomputer
facilities for floating-point intensive scientific computing.  There
are large classes of problems that have traditionally been
performed on supercomputers that now can be performed more
cost-effectively and often with faster wall-clock turnaround on
IBM's RS/6000-based machines.

DISCLAIMER:
----------- 
I have no financial interest in *any* of the computer companies
mentioned in this text.  I am not an official spokesman for anyone ---
these comments are my own.  Most of the numbers in this text are
approximate, though I have made some effort to document sources.
Corrections are welcomed, and I will gladly make retractions or
clarifications of any severe errors.


LINPACK 1000x1000 MFLOPS PER DOLLAR:
------------------------------------
Stardent has recently been running an advertisement in Supercomputing
Review which uses the measure of LINPACK 1000x1000 MFLOPS per Dollar
to evaluate several computers --- specifically the IBM 3090/180VF, the
Alliant FX/80, the SGI 4D/240, and the Stardent 3040.

Just for the hell of it, I decided to put the IBM Power Station 320
and the Cray Y/MP on the chart, and have reproduced the expanded chart
below.

	Machine			MFLOPS		Price		Ratio
	--------------------------------------------------------------
   ****	IBM Power Station 320	13.26		  13,000	 36.6 ****(1)
	Stardent 3040		77		 162,500	 17.0
	SGI 4D/240		17		 158,000	  3.9
	Alliant FX/80		69		 650,000	  3.8
   ****	Cray Y/MP-8	      2144	      18,000,000	  4.3 ****(2)
   **** Cray Y/MP	       324	       2,200,000	  5.3 ****(2)
	IBM 3090/180VF		92	       3,300,000	  1.0
	--------------------------------------------------------------
	- Unmarked numbers are from the Stardent advertisement, and 
	  the MFLOPS values appear to be in agreement with Dongarra's 
	  LINPACK benchmark summary dated May 30, 1990.
	- Note that the Cray prices are revised from my previous posting!

Notes:

(1) The 13.26 MFLOPS on the IBM 320 was observed by me, using an 
8-column block-mode solver written in FORTRAN by Earl Killian at 
MIPS (earl@mips.com).  The standard version of LINPACK with unrolled 
BLAS runs at 8.4 MFLOPS.
   Note that the 1000x1000 LINPACK benchmark specifically allows any
solver code to be run, and most (if not all) of the above results
utilize highly optimized solvers -- not necessarily written in
FORTRAN. 

   The $13,000 configuration includes no monitor or graphics adapter,
etc.  It is strictly a server, configured with 16 MB RAM and 120 MB
disk.  NFS is used to store results directly onto the disks of my graphics
workstation.  The price is a list price quote from the Wilmington, DE,
IBM office.
   IBM's prices for memory are a bit steep --- almost $600/MB (list)
--- but several 3rd-parties are already at work on cloning the memory
boards, which should drop the price to under $200/MB.  The machine can
be configured with up to 32 MB (4 MW) using 1 Mbit technology and 128
MB (16 MW) using 4 Mbit technology.  I do not believe that memory
boards based on 4 Mbit chips are shipping yet.

(2) The Cray prices are from by personal communication from a source
inside Cray.  I don't know if he wants me to use his name, so I will
leave it out of this public posting.
   These prices improve the Cray's ratios over the values derived from
the higher prices I quoted in my comp.arch articles, though the
changes were less than a factor of 2.

   The Cray performance numbers are from the May 30, 1990 LINPACK
benchmark report, reprinted in Supercomputing Review.  Please note
that the Y/MP single-processor performance is better than on my
previous posting (which was incorrect).



PERFORMANCE ON MY APPLICATION CODES:
------------------------------------
My application codes are two and three-dimensional ocean models, using
various combinations of finite-difference, finite-element, and spectral
collocation methods.  The two 3D codes (SPEM and McModel) are highly 
vectorized, each running at speeds in excess of 120 MFLOPS on a single
CPU of the Cray Y/MP.  At least one of the codes (McModel) is also
highly parallelizable, with a speedup of about 6 estimated for 8 cpus
(observed speedup was 4.8 on 6 cpus).  The 2D code currently has a
scalar divide bottleneck.

Based on the results of the two 3D codes, I estimate the performance
of the IBM 320 as 1/25 of a single cpu of the Cray Y/MP.  The code
with the scalar bottleneck runs on the IBM 320 at 1/3 the speed of the
Y/MP.

I should note that these jobs are *definitely not* cache-contained.
They are basically composed of lots of consecutive dyadic vector
operations with very few reduction operations.  No effort has been
made to optimize these codes for cached-memory machines.  Of course,
all calculations are done in 64-bit precision.

The startling conclusion from this is that even for fully vectorizable
application codes I can get 1/25 of a Cray Y/MP for under $10,000
(with University discounts).  This is equivalent to one Cray Y/MP
hour/calendar day, or 30 Cray hours/month, or 360 Cray hours/year.  I
don't believe that I can get allocations that large at the national
supercomputing centers, and if I did, then having the calculations
done locally would still have the advantage of convenience.



WALL-CLOCK TURNAROUND:
----------------------
An anecdote:
  I recently submitted a proposal to the NSF to do some cpu-intensive
studies of the equations governing a theoretical two-dimensional
ocean.  The calculations are estimated to require 200 hours of Cray
Y/MP time.  I don't consider this a trivial expenditure....
With an IBM 320, I would probably be able to finish all of the 
calculations before the proposal even completes the review process!

Is this crazy?  I don't think so.  It can easily take 2 months to
write a proposal, and then 6-8 months before any funding becomes
available.  Then a separate proposal for supercomputer time must be
written, and then the jobs must actually be run through the job
queues.  It is easy to see that obtaining 200 hours of time can take
in excess of 12 months, while equivalent time on an IBM 320 can be
obtained in about 8 months, and at significantly lower cost.




LIMITATIONS OF THE DEDICATED PROCESSOR APPROACH:
------------------------------------------------
(1) Memory

The configuration that I quoted has a rather small memory by current
supercomputer standards, but 2 MW (64-bit) can be quite useful for
many computationally-intensive problems that are currently run on
supercomputers.  As soon as 3rd-party vendors start delivering memory
boards at competitive prices, the machine will be upgradable to 4 MW
(64-bit) for a few thousand dollars.  Since the machine was designed
to accept 4 Mbit technology, it is possible to configure it with up to
16 MW of memory.  I expect that it will be a few months before IBM
releases any boards based on 4 Mbit chips, and then a few more months
before clones are available >from 3rd parties.  Estimated cost for a
full 128 MB = 16 MW is about $20,000 in addition to the base price of
$8700 for the machine.

(2) Disk

Disk space is still not cheap, but it is a great deal cheaper to buy
760 MB SCSI drives for my workstation than to buy more DD-49's (or
whatever is the current Cray disk) for the Cray.  The last time I
checked, generic SCSI drives were about 1/10 the cost of
top-of-the-line supercomputer drives, and had about 1/4 (or somewhat
less) of the performance.

(3) System administration

For many, this can be important, however the proliferation of graphics
workstations means that many of us already have the administration
headaches, and one more compute server is not a substantial additional
burden.

(4) Unutilized cycles

It is much more difficult to share cycles on a KILLER MICRO than on a
nationally-shared supercomputer.  Certainly the price-performance
advantage is cancelled if the machine sits idle most of the time.
Fortunately, I have several `inexhaustible' problems for my machine to
work on. 


COST OF SUPERCOMPUTER TIME:
---------------------------
A point which seems insufficiently appreciated within the scientific
community is that the time on large supercomputers has a real world
cost of up to several hundred dollars per hour, split sort of evenly
between depreciation and operations/maintenance/utilities.  Some
sample numbers for a Cray Y/MP-8/864:

Amortization of Purchase price over 4 years:
    $18,000,000/4 years/8 cpus/8000 hours per year = $70/cpu hour

Operating/Utility/Maintenance Expenses:
    $5,000,000 per year/8 cpus/8000 hours per year = $78/cpu hour

PLEASE note that these figures are *approximate*.   I don't think that
I could be as much as a factor of 2 off overall, though.  The $18
Million for the Y/MP comes from Cray, and the $5 Million for
operations comes from some literature from the Ohio Supercomputer
Center dated January, 1990, estimating their expenses for their Cray
Y/MP-8/864 as $22.22 Million for equipment and $4.5 Million for annual
operating expenses (including staff and maintenance).

Back to the subject --- At $150/hour, the 200 hour Cray calculation is
costing us taxpayers about $30,000. This does not count my salary while I
write the proposals, the time spent by the mail reviewers and the
panels reviewing the proposals, the salaries of the administrators and
paper-pushers, etc, etc, etc....  This is compared to about the use of
about $20,000 of hardware for about 1/4 of its useful life span.

Note that a cost of $150/hour is *damned cheap* for Cray time.  I have
not seen any site that actually charges less than 4 times that much,
so I could easily have underestimated the true costs.




SUMMARY:
--------
As many other people have pointed out, the choice of a computational
platform is a multivariate constrained optimization problem.  Some of
the constraints are:
	(1) The cost must be within the available budget.
	    This includes the cost of porting the code as well.
	(2) The wall-clock turnaround must be within the limits
	    of the research project.
	(3) Point (2) usually requires sufficient memory to make
	    the problem core-containable.
	(4) Sufficient mass storage space and access speed must be
	    available to save intermediate and permanent results
	    without slowing down the calculation past the constraints
	    of point (2).

I contend that the introduction of the IBM 320 marks an important jump
in the parameter space of problems that can be dealt with effectively
on "KILLER MICROS".



WHAT FUTURE FOR SUPERCOMPUTERS?
-------------------------------
I do not believe that the preceding discussion in any way diminishes
the usefulness of supercomputers.  I can still get 1500 MFLOPS
performance levels on one of my codes on an 8-cpu Y/MP.  What is does
shift is the *length* of the jobs for which the faster machine is
*required*.  Since I work on projects with annual sorts of time scales,
and am willing to run a calculation for 6 months or so, the Cray is
only going to be required if I need more than 180 Cray hours in a
6-month period.

There are a number of "Grand Challenge" sorts of projects that require
that sort of investment in time, but the dividing line of what
projects can be done in my office vs what projects must be done at a
remote supercomputer site is shifting rapidly toward the largest of
projects.  I was please to note that the Ohio Supercomputer Center
makes minimum allocations of 400 hours in its vistors program. 

Perhaps the biggest problem associated with this shift is that fewer
and fewer people will see the need to dedicate themselves to becoming
truly proficient in making state-of-the-art supercomputers run
effectively.  If I had seen the trends more clearly 5-6 years ago, I
doubt that I would have invested the significant time in
supercomputers that I ended up actually investing.  I also would not
have invested so much of that time with ETA equipment, but that is
another story.  :-(

--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@vax1.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET