fouts@orville.nas.nasa.gov (Marty Fouts) (11/07/87)
Kevin Buchs asks why the ETA-10 is advertised at 375 MFLOP but does 10 MFLOP on Linpack, and if other machines such as the Cray 2 have the same problem. The answer to the second question is yes; most machines have a different peak number and "average" number. I believe that an EPA sticker needs to be put on (super)computer claims: 'Vendor factory calculations show a maximum performance of X Units. Use this number as a guide, your floppage may vary, according to programming style and problem conditions.' The advertised peak performance number is just that; peak performance. It is frequently refered to around here as the "guarenteed not to exceed this speed" number; and is usually obtained (for a supercomputer) by application of the following logic: The machine, when running in full blown vector mode can pump out 1 FLOP result per functional unit per N clock periods. (We try to make N 1 also ;-) It has M functional units which can be active simultaneously and a clock period of T nanoseconds. Therefore, if you have an application which can be coded to use every functional unit, and is entirely vector in performance you can achieve M / (N * T) FLOPS per second. This is the rate the vendor quotes. On a real application, this rate can be slowed by many things. First of all, your application isn't entirely vector adds and multiplies, it has to do other work. This leads to the vector/scalar trade off which Gene Amdahl loves so much -- Hot vector computers aren't nearly as hot as scalar machines. (A 10 to 1 time ratio is not uncommon) If you have a code which is 10% scalar and 90% vector on a machine which has the 10 to 1 performance ratio, that code is going to spend as much time doing scalar work as it does doing vector work. Secondly, even if your application is all vector, there is probably some architectural gotcha that will keep the machine getting peak performance, such as you really need 3 adds and 1 multiply, but the machine has 2 adders and 2 multipliers, so the multiplier is idle part of the time while the adder handles the extra work load. There are many of these kinds of gotchas, relating to vector length, number and type of functional units, and memory reference patterns. (On the Cray 2, it can take 1.5 times as long to reference the same memory bank twice in a row as it does to reference two successive memory banks.) Thirdly, there is the quality of the compiler technology. The better a compiler is at detecting optimizable code the better performance it can achieve. Originally, the Cray 2 C compiler would produce about 7 million whetstones and the Fortran compiler about 15 million. Now, the C compiler produces 11 and the Fortran compiler 20. (Dusty deck double precision in all four cases.) Finally, I/O can do you in. You might have a machine with a small physical memory and backing stores, such as SSD on an X/MP or virtual memory on a 205, and you have to keep moving your data between very fast main memory and not so fast backing store, so that your CPU can get to it. (Small is relative; X/MP 4-16 has 16 MWord = 128 MByte of main memory, which is big compared to a PC, but small compared to the 2048 MByte of memory on a Cray 2. The key feature is that all of the data being crunched doesn't fit.) And all of these things occur at a gross physical level, so that the programmer has to be painfully aware of them. I have written simple loops on the Cray 2 in C or Fortran which get 15 - 20 MFlops, which can be replaced by hand coded assembly and get 150-1200 MFlops. The bottom line is that the vendor reports the peak speed and Linpack, the Livermore Loops, the NAS kernels, Whetstone, et al. report how the vendor's compiler technology translates a particular algorithm into code to run the vendors architecture. My favorite pathological case is a C program I wrote which runs twice as fast on a Vax as on the Cray 2; simply because I coded for pathological behavior on the 2.
mrc@CAM.UNISYS.COM (Ken Leonard --> ken@oahu.mcl.unisys.com@sdcjove.cam.unisys.com) (11/09/87)
In article <3322@ames.arpa>, fouts@orville.nas.nasa.gov (Marty Fouts) writes: > Kevin Buchs asks why the ETA-10 is advertised at 375 MFLOP but does 10 > MFLOP on Linpack... > > ... > > The bottom line is that the vendor reports the peak speed and Linpack, > the Livermore Loops, the NAS kernels, Whetstone, et al. report how the > vendor's compiler technology translates a particular algorithm into > code to run the vendors architecture. My favorite pathological case > is a C program I wrote which runs twice as fast on a Vax as on the > Cray 2; simply because I coded for pathological behavior on the 2. First of all, how many folk believe that such mundane things as hardware resource distribution/partition/allocation/proportion are at least as important as clock speed? If you judge by the number of systems with per-character comm controllers ( vs. DMA ) or few-as-possible large-as-possible disk stores ( vs. many-as-needed fast-as-possible allocated/assigned-per-applicationstructure ), you may soon conclude that the answer is "not very darn many." Second, you omitted any (direct) reference to operating system architecture-- either as a matter of base family or quality of implementation/port. UNIX is NOT a "bad system", as I am sometimes accused of saying. Neither is MVS not GCOS nor MS-DOS! But ANY system can be a bad choice for a given application, and a particular "port" of a particular system may be a DISASTER for a particular application. Does a number-cruncer (application) with relatively little I/O run better on an interrupt-driven kernel, or with an interrupt-stacking kernel? What about a disk-cruncher? What about a comm-cruncher? Does a kernel with a single paging philosophy suit very many applications very well (HECK, NO!) Should a compiler care that extreme "registerizing" of variables in a high-interrupt environment means that the average context-switch time is doubled? Probably, it should (be able to be told to) care VERY much. Should a compiler care that bank-interleaved allocation of the biggest array may not be as important as being told which array actually is subject to the majority of vector operations? Gee, what ever happened to the idea of considering the whole problem before leaping to a solution? regardz to all, Ken Leonard !--> ken@oahu.mcl.unisys.com@sdcjove.cam.unisys.com
henry@utzoo.UUCP (Henry Spencer) (11/11/87)
> Gee, what ever happened to the idea of considering the whole problem > before leaping to a solution? Same thing that always happens to it: everybody says "gee, that would be a great idea if (a) we were only interested in one problem, (b) we knew in advance what it was like, and (c) we could afford to invest the effort needed to understand it completely and optimize the whole system for it". -- Those who do not understand Unix are | Henry Spencer @ U of Toronto Zoology condemned to reinvent it, poorly. | {allegra,ihnp4,decvax,utai}!utzoo!henry
muller@alliant.Alliant.COM (Jim Muller) (11/17/87)
In <3322@ames.arpa> fouts@orville.nas.nasa.gov.UUCP (Marty Fouts) writes: >...most machines have a different peak number and "average" number. >The advertised peak performance number is frequently refered to around here >as the "guarenteed not to exceed this speed" number... He then gave a good explanation of how peak numbers are calculated, and why they really are "peak" instead of "average sustainable" speed. His reasons were quite accurate, and can be summarized into four areas: >First of all, your application isn't entirely vector adds and multiplies... >Secondly, there is probably some architectural gotcha that will keep the > machine (from) getting peak performance...relating to vector length, > number and type of functional units, and memory reference patterns. >Thirdly, there is the quality of the compiler technology. >Finally, I/O can do you in. There is one more item to the story, though. All of these factors influence the speed of any given code, but the business of peak vs. average speed is presumably beyond the features of "typical" applications. In other words, such test codes should be written so as *not* to trip over these things. The extra item is something you cannot work around, i.e. the "ramp-up" time of vector instructions. Typically, vector instructions take N cycles to load up, followed by M cycles with output. It is the rate of the M outputs that is used for "peak" speeds. The average sustained speed, though, is reduced by a factor of N / (N + M). If the ramp-up requires half as many cycles as the vector lenght, then the sustainable rate will be only 2/3 of the peak rate, EVEN IF THE CODE IS PERFECTLY MATCHED TO THE OTHER ARCHITECTURAL FEATURES OF THE MACHINE! It has nothing to do with the four "real world" factors that Marty explained so well. "So why not list sustainable rates, instead? Or give the ramp-up times too?" you ask. Simply because it isn't that simple. The peak rates quoted may be for the fairly busy triadic vector operations. Simpler vector operations may require fewer ramp-up cycles, but still output one datum per cycle. Yet the nominal flop-rate (both peak and sustainable) is lower because that operation is doing less work (e.g. an add is only half as many nominal operations as a multiply followed by an add). Inotherwords, there is no single answer. One thing that machine designers (should) try to do is reduce the ramp-up time for vector instructions, since this will result in a real-time speedup of the vector portions of any code. However, while improving both the theoretical sustainable rate and the real throughput rate, it has no impact on the peak rate. Thus, the true speed of a machine is obscured before you ever get into the question of "real world" applications. Highly tuned, avoid-all-the-architectural-pitfalls codes for the Alliant FX/8 have managed to reach sustained output rates near the *sustainable* rate as described here. I have no doubt that other super- and mini-super-computer builders have done this too. However, no code will ever go faster than the sustainable rate, and never even reach the peak rate, unless you measure output rate during the body of a single vector instruction. BTW, these highly tuned codes are usually worthless except as academic studies, since real-life applications are often dominated by the other architectural weaknesses, i.e. you start from the sustainable rate and work down! ----------------------------------------------------------------------------- My employer did not sanction this posting, nor did they require or request me to make this disclaimer. Thanks for listening. - Jim
himel@mips.UUCP (Mark I. Himelstein) (11/18/87)
I've wondered looking at peak versus sustained numbers if anything can be deduced from the difference. For example looking at linpack versus peak it appeared to me that the less the difference between the two the more reasonably priced the machine was (most of the time). It may be that achieving that leading-edge performance on sustained numbers requires extravagant harware leading to higher cost. Mark I. Himelstein Disclaimer: This posting only contains my ideas, not my employers.
lamaster@pioneer.arpa (Hugh LaMaster) (11/19/87)
In article <860@alliant.Alliant.COM> muller@alliant.UUCP (Jim Muller) writes: > >The extra item is something you cannot work around, i.e. the "ramp-up" time >of vector instructions. Typically, vector instructions take N cycles to load >up, followed by M cycles with output. It is the rate of the M outputs that is (discussion of vector start up times omitted) Followers of the megaflops sweepstakes may be interested to know the latest (Nov 3) Dongarra report now lists the ETA-10 (both nitrogen cooled 10.5 ns and air cooled 24 ns). The ETA-10 is currently the fastest full precision compiled fortran machine, at 52 MFLOPS, displacing the previous recordholder the NEC SX-2 (a mere 43 MFLOPS). The 24 ns ETA10-P rates a respectable 23 MFLOPS (faster than a Cray-2). Further results are needed to confirm the performance, but it does indicate that ETA has been able to significantly reduce vector start up times compared with the Cyber 205, which only ran at 17 MFLOPS (20ns clock). Hugh LaMaster, m/s 233-9, UUCP {topaz,lll-crg,ucbvax}! NASA Ames Research Center ames!pioneer!lamaster Moffett Field, CA 94035 ARPA lamaster@ames-pioneer.arpa Phone: (415)694-6117 ARPA lamaster@pioneer.arc.nasa.gov (Disclaimer: "All opinions solely the author's responsibility")