mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (07/31/90)
ATTACK OF THE KILLER MICROS --- AGAIN.... (a summary of my recent postings in comp.arch) ABSTRACT: -------- The price and performance of the new IBM RS/6000-based workstations have forced me to reconsider my position on the roles of dedicated microprocessor-based machines versus the use of shared supercomputer facilities for floating-point intensive scientific computing. There are large classes of problems that have traditionally been performed on supercomputers that now can be performed more cost-effectively and often with faster wall-clock turnaround on IBM's RS/6000-based machines. DISCLAIMER: ----------- I have no financial interest in *any* of the computer companies mentioned in this text. I am not an official spokesman for anyone --- these comments are my own. Most of the numbers in this text are approximate, though I have made some effort to document sources. Corrections are welcomed, and I will gladly make retractions or clarifications of any severe errors. LINPACK 1000x1000 MFLOPS PER DOLLAR: ------------------------------------ Stardent has recently been running an advertisement in Supercomputing Review which uses the measure of LINPACK 1000x1000 MFLOPS per Dollar to evaluate several computers --- specifically the IBM 3090/180VF, the Alliant FX/80, the SGI 4D/240, and the Stardent 3040. Just for the hell of it, I decided to put the IBM Power Station 320 and the Cray Y/MP on the chart, and have reproduced the expanded chart below. Machine MFLOPS Price Ratio -------------------------------------------------------------- **** IBM Power Station 320 13.26 13,000 36.6 ****(1) Stardent 3040 77 162,500 17.0 SGI 4D/240 17 158,000 3.9 Alliant FX/80 69 650,000 3.8 **** Cray Y/MP-8 2144 18,000,000 4.3 ****(2) **** Cray Y/MP 324 2,200,000 5.3 ****(2) IBM 3090/180VF 92 3,300,000 1.0 -------------------------------------------------------------- - Unmarked numbers are from the Stardent advertisement, and the MFLOPS values appear to be in agreement with Dongarra's LINPACK benchmark summary dated May 30, 1990. - Note that the Cray prices are revised from my previous posting! Notes: (1) The 13.26 MFLOPS on the IBM 320 was observed by me, using an 8-column block-mode solver written in FORTRAN by Earl Killian at MIPS (earl@mips.com). The standard version of LINPACK with unrolled BLAS runs at 8.4 MFLOPS. Note that the 1000x1000 LINPACK benchmark specifically allows any solver code to be run, and most (if not all) of the above results utilize highly optimized solvers -- not necessarily written in FORTRAN. The $13,000 configuration includes no monitor or graphics adapter, etc. It is strictly a server, configured with 16 MB RAM and 120 MB disk. NFS is used to store results directly onto the disks of my graphics workstation. The price is a list price quote from the Wilmington, DE, IBM office. IBM's prices for memory are a bit steep --- almost $600/MB (list) --- but several 3rd-parties are already at work on cloning the memory boards, which should drop the price to under $200/MB. The machine can be configured with up to 32 MB (4 MW) using 1 Mbit technology and 128 MB (16 MW) using 4 Mbit technology. I do not believe that memory boards based on 4 Mbit chips are shipping yet. (2) The Cray prices are from by personal communication from a source inside Cray. I don't know if he wants me to use his name, so I will leave it out of this public posting. These prices improve the Cray's ratios over the values derived from the higher prices I quoted in my comp.arch articles, though the changes were less than a factor of 2. The Cray performance numbers are from the May 30, 1990 LINPACK benchmark report, reprinted in Supercomputing Review. Please note that the Y/MP single-processor performance is better than on my previous posting (which was incorrect). PERFORMANCE ON MY APPLICATION CODES: ------------------------------------ My application codes are two and three-dimensional ocean models, using various combinations of finite-difference, finite-element, and spectral collocation methods. The two 3D codes (SPEM and McModel) are highly vectorized, each running at speeds in excess of 120 MFLOPS on a single CPU of the Cray Y/MP. At least one of the codes (McModel) is also highly parallelizable, with a speedup of about 6 estimated for 8 cpus (observed speedup was 4.8 on 6 cpus). The 2D code currently has a scalar divide bottleneck. Based on the results of the two 3D codes, I estimate the performance of the IBM 320 as 1/25 of a single cpu of the Cray Y/MP. The code with the scalar bottleneck runs on the IBM 320 at 1/3 the speed of the Y/MP. I should note that these jobs are *definitely not* cache-contained. They are basically composed of lots of consecutive dyadic vector operations with very few reduction operations. No effort has been made to optimize these codes for cached-memory machines. Of course, all calculations are done in 64-bit precision. The startling conclusion from this is that even for fully vectorizable application codes I can get 1/25 of a Cray Y/MP for under $10,000 (with University discounts). This is equivalent to one Cray Y/MP hour/calendar day, or 30 Cray hours/month, or 360 Cray hours/year. I don't believe that I can get allocations that large at the national supercomputing centers, and if I did, then having the calculations done locally would still have the advantage of convenience. WALL-CLOCK TURNAROUND: ---------------------- An anecdote: I recently submitted a proposal to the NSF to do some cpu-intensive studies of the equations governing a theoretical two-dimensional ocean. The calculations are estimated to require 200 hours of Cray Y/MP time. I don't consider this a trivial expenditure.... With an IBM 320, I would probably be able to finish all of the calculations before the proposal even completes the review process! Is this crazy? I don't think so. It can easily take 2 months to write a proposal, and then 6-8 months before any funding becomes available. Then a separate proposal for supercomputer time must be written, and then the jobs must actually be run through the job queues. It is easy to see that obtaining 200 hours of time can take in excess of 12 months, while equivalent time on an IBM 320 can be obtained in about 8 months, and at significantly lower cost. LIMITATIONS OF THE DEDICATED PROCESSOR APPROACH: ------------------------------------------------ (1) Memory The configuration that I quoted has a rather small memory by current supercomputer standards, but 2 MW (64-bit) can be quite useful for many computationally-intensive problems that are currently run on supercomputers. As soon as 3rd-party vendors start delivering memory boards at competitive prices, the machine will be upgradable to 4 MW (64-bit) for a few thousand dollars. Since the machine was designed to accept 4 Mbit technology, it is possible to configure it with up to 16 MW of memory. I expect that it will be a few months before IBM releases any boards based on 4 Mbit chips, and then a few more months before clones are available >from 3rd parties. Estimated cost for a full 128 MB = 16 MW is about $20,000 in addition to the base price of $8700 for the machine. (2) Disk Disk space is still not cheap, but it is a great deal cheaper to buy 760 MB SCSI drives for my workstation than to buy more DD-49's (or whatever is the current Cray disk) for the Cray. The last time I checked, generic SCSI drives were about 1/10 the cost of top-of-the-line supercomputer drives, and had about 1/4 (or somewhat less) of the performance. (3) System administration For many, this can be important, however the proliferation of graphics workstations means that many of us already have the administration headaches, and one more compute server is not a substantial additional burden. (4) Unutilized cycles It is much more difficult to share cycles on a KILLER MICRO than on a nationally-shared supercomputer. Certainly the price-performance advantage is cancelled if the machine sits idle most of the time. Fortunately, I have several `inexhaustible' problems for my machine to work on. COST OF SUPERCOMPUTER TIME: --------------------------- A point which seems insufficiently appreciated within the scientific community is that the time on large supercomputers has a real world cost of up to several hundred dollars per hour, split sort of evenly between depreciation and operations/maintenance/utilities. Some sample numbers for a Cray Y/MP-8/864: Amortization of Purchase price over 4 years: $18,000,000/4 years/8 cpus/8000 hours per year = $70/cpu hour Operating/Utility/Maintenance Expenses: $5,000,000 per year/8 cpus/8000 hours per year = $78/cpu hour PLEASE note that these figures are *approximate*. I don't think that I could be as much as a factor of 2 off overall, though. The $18 Million for the Y/MP comes from Cray, and the $5 Million for operations comes from some literature from the Ohio Supercomputer Center dated January, 1990, estimating their expenses for their Cray Y/MP-8/864 as $22.22 Million for equipment and $4.5 Million for annual operating expenses (including staff and maintenance). Back to the subject --- At $150/hour, the 200 hour Cray calculation is costing us taxpayers about $30,000. This does not count my salary while I write the proposals, the time spent by the mail reviewers and the panels reviewing the proposals, the salaries of the administrators and paper-pushers, etc, etc, etc.... This is compared to about the use of about $20,000 of hardware for about 1/4 of its useful life span. Note that a cost of $150/hour is *damned cheap* for Cray time. I have not seen any site that actually charges less than 4 times that much, so I could easily have underestimated the true costs. SUMMARY: -------- As many other people have pointed out, the choice of a computational platform is a multivariate constrained optimization problem. Some of the constraints are: (1) The cost must be within the available budget. This includes the cost of porting the code as well. (2) The wall-clock turnaround must be within the limits of the research project. (3) Point (2) usually requires sufficient memory to make the problem core-containable. (4) Sufficient mass storage space and access speed must be available to save intermediate and permanent results without slowing down the calculation past the constraints of point (2). I contend that the introduction of the IBM 320 marks an important jump in the parameter space of problems that can be dealt with effectively on "KILLER MICROS". WHAT FUTURE FOR SUPERCOMPUTERS? ------------------------------- I do not believe that the preceding discussion in any way diminishes the usefulness of supercomputers. I can still get 1500 MFLOPS performance levels on one of my codes on an 8-cpu Y/MP. What is does shift is the *length* of the jobs for which the faster machine is *required*. Since I work on projects with annual sorts of time scales, and am willing to run a calculation for 6 months or so, the Cray is only going to be required if I need more than 180 Cray hours in a 6-month period. There are a number of "Grand Challenge" sorts of projects that require that sort of investment in time, but the dividing line of what projects can be done in my office vs what projects must be done at a remote supercomputer site is shifting rapidly toward the largest of projects. I was please to note that the Ohio Supercomputer Center makes minimum allocations of 400 hours in its vistors program. Perhaps the biggest problem associated with this shift is that fewer and fewer people will see the need to dedicate themselves to becoming truly proficient in making state-of-the-art supercomputers run effectively. If I had seen the trends more clearly 5-6 years ago, I doubt that I would have invested the significant time in supercomputers that I ended up actually investing. I also would not have invested so much of that time with ETA equipment, but that is another story. :-( -- John D. McCalpin mccalpin@perelandra.cms.udel.edu Assistant Professor mccalpin@vax1.udel.edu College of Marine Studies, U. Del. J.MCCALPIN/OMNET