xxremak@csduts1.lerc.nasa.gov (David A. Remaklus) (06/29/90)
In a recent conversation with some colleagues of mine at the Ames NAS facility concerning parallel processing, they mentioned their experiences porting a code to the Intel i860 hybercube located there (128 nodes, 7.5 gigaFLOPS peek). On this particular code they were able to achieve about 300 MFLOPS for an efficiency factor of about 2.5%. This low efficiency factor didn't seem to bother them but it sure bothered me. Other colleagues of ours at the United Technologies Research Center in East Hartford, CT ported similar codes to their 1/4 CM-2 and achieved anywhere from 600 to 800 MFLOPS for an effiency factor of more than 50%. It is our contention that it is necessary to achieve an efficiency factor of at least 50% before the particular implementation of the code can be considered appropriate for execution on that parallel processor system. What are your opinions on this matter? Are there any published papers that deal with this subject? Dave R. -- David A. Remaklus NASA Lewis Research Center Cleveland, Ohio 44135 xxremak@csduts1.lerc.nasa.gov
lush@EE.MsState.Edu (Edward Luke) (06/29/90)
In article <9508@hubcap.clemson.edu> xxremak@csduts1.lerc.nasa.gov (David A. Remaklus) writes: > In a recent conversation with some colleagues of mine at the Ames NAS > facility concerning parallel processing, they mentioned their experiences > porting a code to the Intel i860 hybercube located there (128 nodes, > 7.5 gigaFLOPS peek). ANY peak performance number is bogus. A peak number can be considered as a number "guaranteed not to exceed", and no more. Intel claims that the i860 can put out something like 80 Mflops peak, but you would be lucky to get 10 Mflops out of one on real applications. > It is our contention that it is necessary to achieve an efficiency factor > of at least 50% before the particular implementation of the code can be > considered appropriate for execution on that parallel processor system. > What are your opinions on this matter? Are there any published papers > that deal with this subject? I would say that the most important factors are: 1) Cost per *real* Mflop 2) Mflop value required by the application. (Application A requires 300 Mflops to run in one hour) 3) Ease of programming, or program development time.
fyodor@decwrl.dec.com (Chris Kuszmaul) (06/30/90)
In article <9508@hubcap.clemson.edu>, xxremak@csduts1.lerc.nasa.gov (David A. Remaklus) writes: > In a recent conversation with some colleagues of mine at the Ames NAS > facility concerning parallel processing, they mentioned their experiences > porting a code to the Intel i860 hybercube located there (128 nodes, > 7.5 gigaFLOPS peek). On this particular code they were able to > achieve about 300 MFLOPS for an efficiency factor of about 2.5%. This > low efficiency factor didn't seem to bother them but it sure bothered > me. Other colleagues of ours at the United Technologies Research Center > in East Hartford, CT ported similar codes to their 1/4 CM-2 and achieved > anywhere from 600 to 800 MFLOPS for an effiency factor of more than 50%. > > It is our contention that it is necessary to achieve an efficiency factor > of at least 50% before the particular implementation of the code can be > considered appropriate for execution on that parallel processor system. > What are your opinions on this matter? Are there any published papers > that deal with this subject? > > Dave R. > > -- > David A. Remaklus > NASA Lewis Research Center > Cleveland, Ohio 44135 > xxremak@csduts1.lerc.nasa.gov The following table is derived from information published in the CSRD Perfect Report, of March 1990. One of my colleagues, Dr. Ken Jacobsen, Director of the Applications Group at MasPar computer corporation, compiled it. What it contains is performance, and percentages of peak performance for each of several computer systems, on each of several applications. There are two categories, one is for 'Base' performance, one is for 'Optimized' performance - the base performance is from directly compiled code, the optimized performance is for hand-tuned code. You will note that, except for workstations (for which the percentages are a little suspect- consider the DEC 600-410S , which gets 107.7 (!!) percent of peak on hand tuned FLO52) the peak performances for ANY high end computer rarely approach 50 percent, let alone a parallel computer. Based on the above numbers, I suggest that any computer (parallel or otherwise) which is getting more than ten percent of peak performance on a given application is doing at least a reasonably acceptable job. If a system gets fifty percent, then you have an unusually good application/machine matchup. 2.5 percent is a little low - but really is not that bad. Not that I would want anyone to buy anything other than a MasPar MP-1, which, by the way, gets roughly 30% of peak performance on ARC3d. CLK PERFECT CLUB SUMMARY Machine Peak ADM ARC2D ARC3D FLO52 OCEAN SPEC77 BDNA _______________________________________________________________________________ Cray XMP/14SE 200.0 Base Sec 47.2 26.8 -.- 8.8 74.7 -.- 18.0 Base MFLOPS 10.7 68.7 -.- 62.4 22.2 -.- 50.3 Base % Peak 5.4 34.4 -.- 31.2 11.1 -.- 25.2 Opt Sec 47.2 26.8 -.- 8.8 74.7 -.- 18.0 Opt MFLOPS 10.7 68.7 -.- 62.4 22.2 -.- 50.3 Opt % Peak 5.4 34.4 -.- 31.2 11.1 -.- 25.2 Cray XMP/416 941.0 Base Sec 34.1 10.0 31.8 2.8 66.1 61.6 10.9 Base MFLOPS 14.8 183.8 130.7 194.0 25.1 30.1 83.0 Base % Peak 1.6 19.5 13.9 20.6 2.7 3.2 8.8 Opt Sec 8.2 7.0 13.4 2.5 12.1 8.4 5.8 Opt MFLOPS 61.1 261.7 310.9 218.7 136.7 220.0 156.4 Opt % Peak 6.5 27.8 33.0 23.2 14.5 23.4 16.6 Cray YMP/832 2666.0 Base Sec 27.0 4.1 17.8 1.7 46.7 36.0 7.5 Base MFLOPS 18.7 448.2 233.1 328.7 35.5 51.5 121.5 Base % Peak 0.7 16.8 8.7 12.3 1.3 1.9 4.6 Opt Sec 5.6 2.7 5.2 1.6 6.0 3.4 3.1 Opt MFLOPS 90.6 682.3 792.6 347.4 275.4 543.3 288.4 Opt % Peak 3.4 25.6 29.7 13.0 10.3 20.4 10.8 Cray 2S/4128 1952.0 Base Sec 38.4 18.4 71.0 8.9 73.2 105.9 11.1 Base MFLOPS 13.1 100.3 58.5 61.7 22.7 17.5 81.3 Base % Peak 0.7 5.1 3.0 3.2 1.2 0.9 4.2 Opt Sec 26.1 15.6 71.0 7.3 21.0 66.6 10.8 Opt MFLOPS 19.3 118.5 58.5 75.6 78.8 27.8 83.5 Opt % Peak 1.0 6.1 3.0 3.9 4.0 1.4 4.2 Cyber 205 400.0 Base Sec 111.3 -.- 355.1 39.7 287.9 297.3 57.0 Base MFLOPS 4.5 -.- 11.7 13.8 5.8 6.2 15.9 Base % Peak 1.1 -.- 2.9 3.5 1.5 1.6 4.0 Opt Sec 111.3 -.- 355.1 15.0 84.3 297.3 57.0 Opt MFLOPS 4.5 -.- 11.7 36.7 19.7 6.2 15.9 Opt % Peak 1.1 -.- 2.9 9.2 4.9 1.6 4.0 ETA 10E 380.0 Base Sec -.- -.- 187.3 13.2 -.- 284.5 -.- Base MFLOPS -.- -.- 22.2 41.6 -.- 6.5 -.- Base % Peak -.- -.- 5.8 10.9 -.- 1.7 -.- Opt Sec -.- -.- 124.8 13.0 -.- 258.2 -.- Opt MFLOPS -.- -.- 33.3 42.3 -.- 7.2 -.- Opt % Peak -.- -.- 8.8 11.1 -.- 1.9 -.- ETA 10G 570.0 Base Sec 68.6 -.- 124.8 8.8 172.0 189.6 15.5 Base MFLOPS 7.3 -.- 33.3 62.2 9.6 9.8 58.4 Base % Peak 1.3 -.- 5.8 10.9 1.7 1.7 10.2 Opt Sec 68.6 -.- 83.1 8.7 172.0 171.7 15.5 Opt MFLOPS 7.3 -.- 50.0 63.2 9.6 10.8 58.4 Opt % Peak 1.3 -.- 8.8 11.1 1.7 1.9 10.2 ETA 10Q 210.0 Base Sec -.- -.- 320.7 23.9 -.- 514.6 -.- Base MFLOPS -.- -.- 13.0 23.0 -.- 3.6 -.- Base % Peak -.- -.- 6.2 11.0 -.- 1.7 -.- Opt Sec -.- -.- 214.3 23.5 -.- 493.0 -.- Opt MFLOPS -.- -.- 19.4 23.4 -.- 3.8 -.- Opt % Peak -.- -.- 9.2 11.1 -.- 1.8 -.- Fujitsu VP100 285.7 (32-bit) Base Sec 62.8 20.9 -.- 8.8 226.2 98.2 19.5 Base MFLOPS 8.0 88.1 -.- 62.4 7.3 18.9 46.4 Base % Peak 2.8 30.8 -.- 21.8 2.6 6.6 16.2 Opt Sec 62.8 20.9 -.- 8.8 226.2 98.2 19.5 Opt MFLOPS 8.0 88.1 -.- 62.4 7.3 18.9 46.4 Opt % Peak 2.8 30.8 -.- 21.8 2.6 6.6 16.2 HitachiS820/80 3000.0 (32-bit) Base Sec 22.6 3.7 -.- 2.4 -.- 48.9 7.3 Base MFLOPS 22.3 499.2 -.- 229.7 -.- 37.9 123.5 Base % Peak 0.7 16.6 -.- 7.7 -.- 1.3 4.1 Opt Sec 22.6 3.7 -.- 2.4 -.- 48.9 7.3 Opt MFLOPS 22.3 499.2 -.- 229.7 -.- 37.9 123.5 Opt % Peak 0.7 16.6 -.- 7.7 -.- 1.3 4.1 NEC SX/2 1300.0 (32-bit) Base Sec 31.4 -.- 58.2 3.1 97.8 45.0 5.3 Base MFLOPS 16.1 -.- 71.4 177.1 16.9 41.2 170.2 Base % Peak 1.2 -.- 5.5 13.6 1.3 3.2 13.1 Opt Sec 31.4 -.- 32.1 3.1 97.8 35.2 5.3 Opt MFLOPS 16.1 -.- 129.5 177.1 16.9 52.7 170.2 Opt % Peak 1.2 -.- 10.0 13.6 1.3 4.1 13.1 INTEL iPSC/1 4.0 (32-bit) Base Sec -.- -.- -.- -.- -.- -.- -.- Base MFLOPS -.- -.- -.- -.- -.- -.- -.- Base % Peak -.- -.- -.- -.- -.- -.- -.- Opt Sec -.- -.- -.- -.- -.- -.- -.- Opt MFLOPS -.- -.- -.- -.- -.- -.- -.- Opt % Peak -.- -.- -.- -.- -.- -.- -.- Mark III 12.8 (32-bit) Base Sec -.- -.- -.- -.- -.- -.- -.- Base MFLOPS -.- -.- -.- -.- -.- -.- -.- Base % Peak -.- -.- -.- -.- -.- -.- -.- Opt Sec -.- -.- -.- -.- -.- -.- -.- Opt MFLOPS -.- -.- -.- -.- -.- -.- -.- Opt % Peak -.- -.- -.- -.- -.- -.- -.- NCUBE NCUBE/10 205.0 (32-bit) Base Sec -.- -.- -.- -.- -.- -.- -.- Base MFLOPS -.- -.- -.- -.- -.- -.- -.- Base % Peak -.- -.- -.- -.- -.- -.- -.- Opt Sec -.- -.- -.- -.- -.- -.- -.- Opt MFLOPS -.- -.- -.- -.- -.- -.- -.- Opt % Peak -.- -.- -.- -.- -.- -.- -.- ALLIANT FX/8 94.4 (32-bit) Base Sec 493.4 -.- 743.5 113.6 2068.0 1310.3 195.2 Base MFLOPS 1.0 -.- 5.6 4.8 0.8 1.4 4.6 Base % Peak 1.1 -.- 5.9 5.1 0.8 1.5 4.9 Opt Sec 493.4 -.- 642.0 89.5 2068.0 194.8 195.2 Opt MFLOPS 1.0 -.- 6.5 6.1 0.8 9.5 4.6 Opt % Peak 1.1 -.- 6.9 6.5 0.8 10.1 4.9 ALLIANT FX/80 188.8 (32-bit) Base Sec 353.1 228.5 538.3 84.4 1582.0 975.4 116.5 Base MFLOPS 1.4 8.1 7.7 6.5 1.0 1.9 7.8 Base % Peak 0.7 4.3 4.1 3.4 0.5 1.0 4.1 Opt Sec 353.1 228.5 538.3 64.5 1582.0 189.6 116.5 Opt MFLOPS 1.4 8.1 7.7 8.5 1.0 9.8 7.8 Opt % Peak 0.7 4.3 4.1 4.5 0.5 5.2 4.1 ARDENT Titan 2 16.0 (32-bit) Base Sec 333.9 -.- -.- 108.7 -.- 915.6 639.4 Base MFLOPS 1.5 -.- -.- 5.1 -.- 2.0 1.4 Base % Peak 10.0 -.- -.- 31.9 -.- 12.5 8.8 Opt Sec 333.9 -.- -.- 108.7 -.- 915.6 639.4 Opt MFLOPS 1.5 -.- -.- 5.1 -.- 2.0 1.4 Opt % Peak 10.0 -.- -.- 31.9 -.- 12.5 8.8 CONVEX C220 100.0 (32-bit) Base Sec 126.8 -.- 430.8 44.6 501.3 953.2 72.2 Base MFLOPS 4.0 -.- 9.6 12.3 3.3 1.9 12.5 Base % Peak 4.0 -.- 9.6 12.3 3.3 1.9 12.5 Opt Sec 126.8 -.- 430.8 44.6 501.3 953.2 72.2 Opt MFLOPS 4.0 -.- 9.6 12.3 3.3 1.9 12.5 Opt % Peak 4.0 -.- 9.6 12.3 3.3 1.9 12.5 STARDENT 3010 48.0 (32-bit) Base Sec 96.0 184.0 -.- 78.0 510.0 482.0 124.0 Base MFLOPS 5.2 10.0 -.- 7.0 3.3 3.8 7.3 Base % Peak 10.8 20.8 -.- 14.6 6.9 7.9 15.2 Opt Sec 96.0 184.0 -.- 78.0 510.0 482.0 124.0 Opt MFLOPS 5.2 10.0 -.- 7.0 3.3 3.8 7.3 Opt % Peak 10.8 20.8 -.- 14.6 6.9 7.9 15.2 DEC 6000-410S 3.25(32-bit) Base Sec 324.0 2495.0 2794.0 339.0 1268.0 1283.0 718.0 Base MFLOPS 1.6 0.7 1.5 1.6 1.3 1.4 1.3 Base % Peak 49.2 21.5 46.2 49.2 40.0 43.1 40.0 Opt Sec 324.0 2495.0 2794.0 339.0 1268.0 1283.0 718.0 Opt MFLOPS 1.6 0.7 1.5 1.6 1.3 1.4 1.3 Opt % Peak 49.2 21.5 46.2 49.2 40.0 43.1 40.0 ENCORE Multimax 0.4 (32-bit) Base Sec 4231.2 -.- 35395.2 4540.5 23172.7 13584.0 8151.4 Base MFLOPS 0.1 -.- 0.1 0.1 0.1 0.1 0.1 Base % Peak 25.0 -.- 25.0 25.0 25.0 25.0 25.0 Opt Sec 3345.4 -.- 35395.2 3874.0 14650.1 13584.0 8151.4 Opt MFLOPS 0.2 -.- 0.1 0.1 0.1 0.1 0.1 Opt % Peak 50.0 -.- 25.0 25.0 25.0 25.0 25.0 VAX 11/780 1.25(32-bit) Base Sec 2121.0 56530.0 14097.0 2063.0 8337.0 8242.0 5339.0 Base MFLOPS 0.2 0.0 0.3 0.3 0.2 0.2 0.2 Base % Peak 16.0 0.0 24.0 24.0 16.0 16.0 16.0 Opt Sec 2121.0 56530.0 14097.0 2063.0 8337.0 8242.0 5339.0 Opt MFLOPS 0.2 0.0 0.3 0.3 0.2 0.2 0.2 Opt % Peak 16.0 0.0 24.0 24.0 16.0 16.0 16.0 APOLLO DSP10040 40.0 (32-bit) Base Sec 143.7 877.3 -.- 216.2 -.- -.- 247.1 Base MFLOPS 3.5 2.1 -.- 2.5 -.- -.- 3.7 Base % Peak 8.8 5.3 -.- 6.3 -.- -.- 9.3 Opt Sec 143.7 877.3 -.- 216.2 -.- -.- 247.1 Opt MFLOPS 3.5 2.1 -.- 2.5 -.- -.- 3.7 Opt % Peak 8.8 5.3 -.- 6.3 -.- -.- 9.3 DEC 3100 2.0 (32-bit) Base Sec 378.7 -.- 1773.9 457.9 -.- -.- 577.8 Base MFLOPS 1.3 -.- 2.3 1.2 -.- -.- 1.6 Base % Peak 65.0 -.- 115.0 60.0 -.- -.- 80.0 Opt Sec 257.1 -.- 1773.9 260.3 -.- -.- 577.8 Opt MFLOPS 2.0 -.- 2.3 2.1 -.- -.- 1.6 Opt % Peak 100.0 -.- 115.0 105.0 -.- -.- 80.0 MIPS M/120 8.3 (32-bit) Base Sec 439.4 -.- -.- 209.7 -.- -.- 515.5 Base MFLOPS 1.1 -.- -.- 2.6 -.- -.- 1.8 Base % Peak 13.3 -.- -.- 31.3 -.- -.- 21.7 Opt Sec 439.4 -.- -.- 209.7 -.- -.- 515.5 Opt MFLOPS 1.1 -.- -.- 2.6 -.- -.- 1.8 Opt % Peak 13.3 -.- -.- 31.3 -.- -.- 21.7 MIPS RS2030 8.35(32-bit) Base Sec 475.5 -.- -.- 244.7 992.7 1877.3 575.6 Base MFLOPS 1.1 -.- -.- 2.2 1.7 1.0 1.6 Base % Peak 13.2 -.- -.- 26.3 20.4 12.0 19.2 Opt Sec 475.5 -.- -.- 244.7 992.7 1877.3 575.6 Opt MFLOPS 1.1 -.- -.- 2.2 1.7 1.0 1.6 Opt % Peak 13.2 -.- -.- 26.3 20.4 12.0 19.2 SUN SPARC 1 5.5 (32-bit) Base Sec 601.4 -.- 3474.6 540.7 -.- -.- 822.9 Base MFLOPS 0.8 -.- 1.2 1.0 -.- -.- 1.1 Base % Peak 14.5 -.- 21.8 18.2 -.- -.- 20.0 Opt Sec 341.9 -.- 2633.0 354.5 -.- -.- 822.9 Opt MFLOPS 1.5 -.- 1.6 1.5 -.- -.- 1.1 Opt % Peak 27.3 -.- 29.1 27.3 -.- -.- 20.0 SUN SPARC 330 7.0 (32-bit) Base Sec 442.8 -.- 2598.9 419.5 -.- -.- 584.7 Base MFLOPS 1.1 -.- 1.6 1.3 -.- -.- 1.5 Base % Peak 15.7 -.- 22.9 18.6 -.- -.- 21.4 Opt Sec 260.8 -.- 1865.5 183.3 -.- -.- 584.7 Opt MFLOPS 1.9 -.- 2.2 1.9 -.- -.- 1.5 Opt % Peak 27.1 -.- 31.4 27.1 -.- -.- 21.4 SUN 3/280 0.25(32-bit) Base Sec 4435.0 -.- 30119.9 4537.5 -.- -.- 6269.8 Base MFLOPS 0.1 -.- 0.1 0.1 -.- -.- 0.1 Base % Peak 40.0 -.- 40.0 40.0 -.- -.- 40.0 Opt Sec 3152.8 -.- 30119.9 4268.5 -.- -.- 6269.8 Opt MFLOPS 0.2 -.- 0.1 0.1 -.- -.- 0.1 Opt % Peak 80.0 -.- 40.0 40.0 -.- -.- 40.0 Machine MDG QCD TRFD DYFESM SPICE MG3D TRACK Total _______________________________________________________________________________ Cray XMP/14SE Base Sec 452.2 40.9 17.4 19.5 -.- -.- 17.7 547.7 Base MFLOPS 7.6 6.3 24.8 28.3 -.- -.- 4.6 18.7 Base % Peak 3.8 3.2 12.4 14.2 -.- -.- 2.3 9.3 Opt Sec 452.2 40.9 17.4 19.5 -.- -.- 17.7 547.5 Opt MFLOPS 7.6 6.3 24.8 28.3 -.- -.- 4.6 18.7 Opt % Peak 3.8 3.2 12.4 14.2 -.- -.- 2.3 9.3 Cray XMP/416 Base Sec 256.7 25.7 9.6 13.4 11.9 522.5 12.7 1069.8 Base MFLOPS 13.4 10.1 44.8 41.1 3.9 21.2 6.5 25.6 Base % Peak 1.4 1.1 4.8 4.4 0.4 2.3 0.7 2.7 Opt Sec 17.5 3.2 2.1 2.9 3.2 24.4 3.3 114.0 Opt MFLOPS 195.9 81.4 206.2 191.7 14.7 453.3 24.7 240.0 Opt % Peak 20.8 8.7 21.9 20.4 1.6 48.2 2.6 25.5 Cray YMP/832 Base Sec 207.2 20.7 7.6 9.3 8.2 407.9 10.3 812.0 Base MFLOPS 16.6 12.6 56.4 59.4 5.7 27.1 7.9 33.7 Base % Peak 0.6 0.5 2.1 2.2 0.2 1.0 0.3 1.3 Opt Sec 5.8 1.0 1.0 1.9 2.5 9.7 2.1 51.6 Opt MFLOPS 594.9 249.6 444.2 295.2 18.9 1146.2 38.7 530.2 Opt % Peak 22.3 9.4 16.7 11.1 0.7 43.0 1.5 19.9 Cray 2S/4128 Base Sec 244.6 32.8 16.7 17.3 12.1 569.2 16.0 1235.6 Base MFLOPS 14.0 7.9 25.8 32.0 3.9 19.4 5.1 22.1 Base % Peak 0.7 0.4 1.3 1.6 0.2 1.0 0.3 1.1 Opt Sec 120.6 16.5 8.3 7.4 7.0 74.0 15.0 467.2 Opt MFLOPS 28.5 15.8 52.2 74.3 6.7 149.5 5.4 58.6 Opt % Peak 1.5 0.8 2.7 3.8 0.3 7.7 0.3 3.0 Cyber 205 Base Sec 950.6 69.3 49.1 44.2 36.6 1312.2 39.9 3650.2 Base MFLOPS 3.6 3.7 8.8 12.5 1.3 8.4 2.0 7.0 Base % Peak 0.9 0.9 2.2 3.1 0.3 2.1 0.5 1.7 Opt Sec 950.6 69.3 49.1 44.2 36.6 1312.2 39.9 3421.9 Opt MFLOPS 3.6 3.7 8.8 12.5 1.3 8.4 2.0 7.5 Opt % Peak 0.9 0.9 2.2 3.1 0.3 2.1 0.5 1.9 ETA 10E Base Sec -.- -.- -.- 23.0 -.- -.- -.- 508.0 Base MFLOPS -.- -.- -.- 24.0 -.- -.- -.- 14.0 Base % Peak -.- -.- -.- 6.3 -.- -.- -.- 3.7 Opt Sec -.- -.- -.- 12.9 -.- -.- -.- 408.9 Opt MFLOPS -.- -.- -.- 42.9 -.- -.- -.- 17.4 Opt % Peak -.- -.- -.- 11.3 -.- -.- -.- 4.6 ETA 10G Base Sec 572.6 48.4 17.5 15.4 22.4 1700.8 22.1 2978.5 Base MFLOPS 6.0 5.4 24.7 36.0 2.1 6.5 3.7 8.6 Base % Peak 1.1 0.9 4.3 6.3 0.4 1.1 0.6 1.5 Opt Sec 572.6 48.4 17.5 8.5 22.4 1700.8 22.1 2911.9 Opt MFLOPS 6.0 5.4 24.7 64.6 2.1 6.5 3.7 8.8 Opt % Peak 1.1 0.9 4.3 6.3 0.4 1.1 0.6 1.5 ETA 10Q Base Sec -.- -.- -.- 41.7 -.- -.- -.- 900.0 Base MFLOPS -.- -.- -.- 13.3 -.- -.- -.- 7.9 Base % Peak -.- -.- -.- 6.3 -.- -.- -.- 3.8 Opt Sec -.- -.- -.- 23.3 -.- -.- -.- 754.1 Opt MFLOPS -.- -.- -.- 23.7 -.- -.- -.- 9.4 Opt % Peak -.- -.- -.- 11.3 -.- -.- -.- 4.5 Fujitsu VP100 Base Sec 473.3 56.5 14.4 17.2 15.9 641.3 20.1 1675.1 Base MFLOPS 7.3 4.6 29.9 32.1 2.9 17.2 4.1 13.8 Base % Peak 2.6 1.6 10.5 11.2 1.0 6.0 1.4 4.8 Opt Sec 473.3 56.5 14.4 17.2 15.9 641.3 20.1 1675.1 Opt MFLOPS 7.3 4.6 29.9 32.1 2.9 17.2 4.1 4.8 Opt % Peak 2.6 1.6 10.5 11.2 1.0 6.0 1.4 4.8 HitachiS820/80 Base Sec 225.9 28.0 -.- 7.9 8.2 -.- 11.0 365.9 Base MFLOPS 15.2 9.3 -.- 69.8 5.7 -.- 7.4 27.4 Base % Peak 0.5 0.3 -.- 2.3 0.2 -.- 0.2 0.9 Opt Sec 225.9 28.0 -.- 7.9 8.2 -.- 11.0 365.9 Opt MFLOPS 15.2 9.3 -.- 69.8 5.7 -.- 7.4 27.4 Opt % Peak 0.5 0.3 -.- 2.3 0.2 -.- 0.2 0.9 NEC SX/2 Base Sec 243.3 27.0 7.4 8.3 10.0 315.7 13.2 865.7 Base MFLOPS 14.1 9.6 57.9 66.6 4.7 35.0 6.2 29.5 Base % Peak 1.1 0.7 4.5 5.1 0.4 2.7 0.5 2.3 Opt Sec 24.8 27.0 7.4 8.0 10.0 315.7 13.2 611.0 Opt MFLOPS 138.6 9.6 57.9 69.3 4.7 35.0 6.2 41.8 Opt % Peak 10.7 0.7 4.5 5.3 0.4 2.7 0.5 3.2 INTEL iPSC/1 Base Sec 23049.9 353.8 -.- -.- -.- -.- -.- 23403.7 Base MFLOPS 0.1 0.7 -.- -.- -.- -.- -.- 0.2 Base % Peak 2.5 17.5 -.- -.- -.- -.- -.- 4.0 Opt Sec 2386.0 102.5 -.- -.- -.- -.- -.- 2488.5 Opt MFLOPS 1.4 2.5 -.- -.- -.- -.- -.- 1.5 Opt % Peak 35.0 62.5 -.- -.- -.- -.- -.- 37.2 Mark III Base Sec 25520.0 66.8 -.- -.- -.- -.- -.- 25586.8 Base MFLOPS 0.1 3.9 -.- -.- -.- -.- -.- 0.1 Base % Peak 0.8 30.5 -.- -.- -.- -.- -.- 1.1 Opt Sec 1094.2 40.9 -.- -.- -.- -.- -.- 1135.1 Opt MFLOPS 3.1 6.3 -.- -.- -.- -.- -.- 3.3 Opt % Peak 24.2 49.2 -.- -.- -.- -.- -.- 25.5 NCUBE NCUBE/10 Base Sec 40125.9 90.1 -.- -.- -.- -.- -.- 40216.0 Base MFLOPS 0.1 2.9 -.- -.- -.- -.- -.- 0.1 Base % Peak 0.1 1.4 -.- -.- -.- -.- -.- 0.0 Opt Sec 369.6 8.3 -.- -.- -.- -.- -.- 377.9 Opt MFLOPS 9.3 31.3 -.- -.- -.- -.- -.- 9.8 Opt % Peak 4.5 15.3 -.- -.- -.- -.- -.- 4.8 ALLIANT FX/8 Base Sec 2972.1 356.6 298.2 237.7 97.3 11651.6 141.5 20679.0 Base MFLOPS 1.2 0.7 1.4 2.3 0.5 0.9 0.6 1.2 Base % Peak 1.3 0.7 1.5 2.4 0.5 1.0 0.6 1.3 Opt Sec 618.2 119.9 298.2 95.7 23.9 11651.6 118.1 16608.5 Opt MFLOPS 5.5 2.2 1.4 5.8 2.0 0.9 0.7 1.5 Opt % Peak 5.8 2.3 1.5 6.1 2.1 1.0 0.7 1.6 ALLIANT FX/80 Base Sec 2118.6 238.1 264.1 199.1 67.7 8586.1 89.5 15441.4 Base MFLOPS 1.6 1.1 1.6 2.8 0.7 1.3 0.9 1.8 Base % Peak 0.8 0.6 0.8 1.5 0.4 0.7 0.5 0.9 Opt Sec 500.7 86.4 264.1 89.9 17.8 8586.1 84.2 12701.7 Opt MFLOPS 6.9 3.0 1.6 6.1 2.6 1.3 1.0 2.2 Opt % Peak 3.7 1.6 0.8 3.2 1.4 0.7 0.5 1.1 ARDENT Titan 2 Base Sec 4505.0 261.5 137.2 364.3 -.- -.- -.- 7265.6 Base MFLOPS 0.8 1.0 3.1 1.5 -.- -.- -.- 1.2 Base % Peak 5.0 6.3 19.4 9.4 -.- -.- -.- 7.3 Opt Sec 4505.0 261.5 137.2 364.3 -.- -.- -.- 7265.6 Opt MFLOPS 0.8 1.0 3.1 1.5 -.- -.- -.- 1.2 Opt % Peak 5.0 6.3 19.4 9.4 -.- -.- -.- 7.3 CONVEX C220 Base Sec 1357.7 136.5 77.6 185.5 31.5 4519.8 47.3 8484.8 Base MFLOPS 2.5 1.9 5.6 3.0 1.5 2.4 1.7 3.0 Base % Peak 2.5 1.9 5.6 3.0 1.5 2.4 1.7 3.0 Opt Sec 1357.7 136.5 77.6 185.5 31.5 4519.8 47.3 8484.8 Opt MFLOPS 2.5 1.9 5.6 3.0 1.5 2.4 1.7 3.0 Opt % Peak 2.5 1.9 5.6 3.0 1.5 2.4 1.7 3.0 STARDENT 3010 Base Sec 768.0 70.0 94.0 64.0 24.0 1235.7 32.0 3761.7 Base MFLOPS 4.5 3.7 4.6 8.6 1.9 9.0 2.6 6.2 Base % Peak 9.4 7.7 9.6 17.9 4.0 18.8 5.4 12.8 Opt Sec 768.0 70.0 94.0 64.0 24.0 1235.7 32.0 3761.7 Opt MFLOPS 4.5 3.7 4.6 8.6 1.9 9.0 2.6 6.2 Opt % Peak 9.4 7.7 9.6 17.9 4.0 18.8 5.4 12.8 DEC 600-410S Base Sec 2554.0 184.0 309.0 160.0 61.0 10646.0 85.0 23220.0 Base MFLOPS 1.3 1.4 1.4 3.5 0.8 1.0 1.0 1.2 Base % Peak 40.0 43.1 43.1 107.7 24.6 30.8 30.8 36.2 Opt Sec 2554.0 184.0 309.0 160.0 61.0 10646.0 85.0 23220.0 Opt MFLOPS 1.3 1.4 1.4 3.5 0.8 1.0 1.0 1.2 Opt % Peak 40.0 43.1 43.1 107.7 24.6 30.8 30.8 36.2 ENCORE Multimax Base Sec 25615.9 2134.0 3426.5 -.- 464.5 -.- 681.4 121397.3 Base MFLOPS 0.1 0.1 0.1 -.- 0.1 -.- 0.1 0.1 Base % Peak 25.0 25.0 25.0 -.- 25.0 -.- 25.0 28.6 Opt Sec 25615.9 1761.2 3384.9 -.- 464.5 -.- 681.4 110908.0 Opt MFLOPS 0.1 0.1 0.1 -.- 0.1 -.- 0.1 0.1 Opt % Peak 25.0 25.0 25.0 -.- 25.0 -.- 25.0 31.3 VAX 11/780 Base Sec 20349.0 1124.0 2906.0 1100.0 423.0 65228.0 605.0 188464.0 Base MFLOPS 0.2 0.2 0.1 0.5 0.1 0.2 0.1 0.1 Base % Peak 16.0 16.0 8.0 40.0 8.0 16.0 8.0 11.6 Opt Sec 20349.0 1124.0 2906.0 1100.0 423.0 65228.0 605.0 188464.0 Opt MFLOPS 0.2 0.2 0.1 0.5 0.1 0.2 0.1 0.1 Opt % Peak 16.0 16.0 8.0 40.0 8.0 16.0 8.0 11.6 APOLLO DSP10040 Base Sec 994.8 88.6 233.9 136.8 -.- 3555.9 -.- 6494.3 Base MFLOPS 3.5 2.9 1.8 4.0 -.- 3.1 -.- 3.0 Base % Peak 8.8 7.3 4.5 10.0 -.- 7.8 -.- 7.5 Opt Sec 994.8 88.6 233.9 136.8 -.- 3555.9 -.- 6494.3 Opt MFLOPS 3.5 2.9 1.8 4.0 -.- 3.1 -.- 3.0 Opt % Peak 8.8 7.3 4.5 10.0 -.- 7.8 -.- 7.5 DEC 3100 Base Sec 2059.4 -.- -.- 195.6 -.- -.- -.- 5443.3 Base MFLOPS 1.7 -.- -.- 2.8 -.- -.- -.- 1.9 Base % Peak 85.0 -.- -.- 140.0 -.- -.- -.- 92.8 Opt Sec 2059.4 -.- -.- 125.8 -.- -.- -.- 5054.3 Opt MFLOPS 1.7 -.- -.- 4.4 -.- -.- -.- 2.0 Opt % Peak 85.0 -.- -.- 220.0 -.- -.- -.- 100.0 MIPS M/120 Base Sec 1667.1 106.7 338.1 106.4 54.9 -.- -.- 3437.8 Base MFLOPS 2.1 2.4 1.3 5.2 0.9 -.- -.- 1.9 Base % Peak 25.3 28.9 15.7 62.7 10.8 -.- -.- 23.4 Opt Sec 1667.1 106.7 338.1 106.4 54.9 -.- -.- 3437.8 Opt MFLOPS 2.1 2.4 1.3 5.2 0.9 -.- -.- 1.9 Opt % Peak 25.3 28.9 15.7 62.7 10.8 -.- -.- 23.4 MIPS RS2030 Base Sec 1738.5 109.2 351.3 112.7 67.3 5167.9 21.8 11734.5 Base MFLOPS 2.0 2.4 1.2 4.9 0.7 2.1 3.7 1.8 Base % Peak 24.0 28.7 14.4 58.7 8.4 25.1 44.2 21.8 Opt Sec 1738.5 109.2 351.3 112.7 67.3 5167.9 21.8 11734.5 Opt MFLOPS 2.0 2.4 1.2 4.9 0.7 2.1 3.7 1.8 Opt % Peak 24.0 28.7 14.4 58.7 8.4 25.1 44.2 21.8 SUN SPARC 1 Base Sec 3243.0 138.3 353.7 224.3 66.5 -.- -.- 9465.4 Base MFLOPS 1.1 1.9 1.2 2.5 0.7 -.- -.- 1.1 Base % Peak 20.0 34.5 21.8 45.5 12.7 -.- -.- 20.8 Opt Sec 3243.0 138.3 353.7 224.3 66.5 -.- -.- 8178.1 Opt MFLOPS 1.1 1.9 1.2 2.5 0.7 -.- -.- 1.3 Opt % Peak 20.0 34.5 21.8 45.5 12.7 -.- -.- 24.1 SUN SPARC 330 Base Sec 2377.3 110.0 240.0 151.6 48.1 -.- -.- 6756.9 Base MFLOPS 1.4 2.4 1.8 3.6 1.0 -.- -.- 1.6 Base % Peak 20.0 34.3 25.7 51.4 14.3 -.- -.- 22.9 Opt Sec 2377.3 110.0 240.0 141.9 48.1 -.- -.- 5811.6 Opt MFLOPS 1.4 2.4 1.8 3.9 1.0 -.- -.- 1.9 Opt % Peak 20.0 34.3 25.7 55.7 14.3 -.- -.- 26.7 SUN 3/280 Base Sec 19155.0 1515.9 3907.8 2026.5 378.3 -.- -.- 72344.8 Base MFLOPS 0.2 0.2 0.1 0.3 0.1 -.- -.- 0.1 Base % Peak 80.0 80.0 40.0 120.0 40.0 -.- -.- 60.0 Opt Sec 19155.0 1515.9 3907.8 1864.0 378.3 -.- -.- 70632.0 Opt MFLOPS 0.2 0.2 0.1 0.3 0.1 -.- -.- 0.2 Opt % Peak 80.0 80.0 40.0 120.0 40.0 -.- -.- 61.4
ian@decwrl.dec.com (Ian L. Kaplan) (06/30/90)
>In a recent conversation with some colleagues of mine at the Ames NAS >facility concerning parallel processing, they mentioned their experiences >porting a code to the Intel i860 hybercube located there (128 nodes, >7.5 gigaFLOPS peek). On this particular code they were able to >achieve about 300 MFLOPS for an efficiency factor of about 2.5%. This >low efficiency factor didn't seem to bother them but it sure bothered >me. Other colleagues of ours at the United Technologies Research Center >in East Hartford, CT ported similar codes to their 1/4 CM-2 and achieved >anywhere from 600 to 800 MFLOPS for an efficancy factor of more than 50%. > >David A. Remaklus >NASA Lewis Research Center >Cleveland, Ohio 44135 >xxremak@csduts1.lerc.nasa.gov This is somewhat tangential to the issue, but I could not resist mentioning it. Perhaps the difference in the execution efficiency between the Intel cube (an MIMD machine) and the CM-2 (a SIMD machine) is due to the fact (not doubt hotly contested) that SIMD systems are easier to program. Easier to program also means easier to fit one's problem to. MIMD architecture and programming continues to be a hot topic in the computer science research community. Some people theorize that this is because MIMD programming is so difficult that it provides a challenging research problem and a fertile field for Phd theses. A term like "ease of programming" is often used without giving much definition, so I will try to flesh out my claims. One definition of ease of programming is that much of the machine architecture is abstracted and the programmer can think about writing a program that describes the problem rather than thinking about shoehorning the problem onto the machine. SIMD systems can be programmed in _standard_ Fortran 90. MIMD systems can only be programmed in a language that contains extensions for synchronization. The SIMD programmer need only consider machine architecture when it comes to making their program run more efficiently. The MIMD programmer must consider the machine architecture or the program will not run deterministicly. Full symbolic debugging can also be supported on a SIMD machine. Has anyone done a symbolic debugger for a large scale MIMD system? Of course I am biased. Ian Kaplan MasPar Computer Corp. ian@maspar.com
mccalpin@vax1.udel.edu (John D Mccalpin) (07/03/90)
In article <9508@hubcap.clemson.edu> xxremak@csduts1.lerc.nasa.gov (David A. Remaklus) writes: >In a recent conversation with some colleagues of mine at the Ames NAS >facility concerning parallel processing, they mentioned their experiences >porting a code to the Intel i860 hybercube located there (128 nodes, >7.5 gigaFLOPS peek). On this particular code they were able to >achieve about 300 MFLOPS for an efficiency factor of about 2.5%. This >low efficiency factor didn't seem to bother them but it sure bothered >me. The question of efficiency is complicated in this case by the choice of the i860 as the cpu. The peak performance quoted corresponds to about 60 MFLOPS/cpu, which may not be attainable even for optimally coded assembly language routines. Preston Briggs at Rice University has spent some time working on this processor and in a real live piece of hardware was unable to obtain greater than about 33 MFLOPS for a hand-coded 64-bit matrix-multiply kernel. Code compiled from FORTRAN using existing compiler technology typically produced performance in the 2-5 MFLOPS range. The 300 MFLOPS observed performance is about 2.3 MFLOPS/cpu, and may indicate very good performance, all things considered. So a more reasonable estimate of efficiency for this case is to look at the parallel speedup. I would be surprised if one cpu gave better than 5 MFLOPS performance, so the "efficiency" in this case would be close to 50% = (300 MFLOPS)/(128 cpu*5MFLOPS/cpu). -- John D. McCalpin mccalpin@vax1.udel.edu Assistant Professor mccalpin@delocn.udel.edu College of Marine Studies, U. Del. mccalpin@scri1.scri.fsu.edu
dmcmilla@cfctech.cfc.com (Don McMillan CS 50) (07/03/90)
In article <9508@hubcap.clemson.edu>, xxremak@csduts1.lerc.nasa.gov (David A. Remaklus) writes: |> |> It is our contention that it is necessary to achieve an efficiency factor |> of at least 50% before the particular implementation of the code can be |> considered appropriate for execution on that parallel processor system. |> What are your opinions on this matter? Are there any published papers |> that deal with this subject? |> You're in good company. See "Speedup Versus Efficiency in Parallel Systems", IEEE Trans on Computers, vol 38. no 3, March, 1989. Basically, the authors define a method for determining the "average parallelism" of a given algorithm, and thence to selecting the most appropriate number of processors such that at least 50% of the maximimum possible speedup is attained, with at least 50% efficiency. Don McMillan __ . . Phone: (313) 986-1436 CS Department / ` |\ /| UUCP: {umich,cfctech}!rphroy!rcsuna!dmcmilla GM Research Labs | ,_ | | | CSNet: mcmillan@gmr.com Warren, MI 48090 USA \__/ | | Internet: dmcmilla%rcsuna.uucp@umich.edu
carroll@beaver.cs.washington.edu (Jeff Carroll) (07/05/90)
In article <9521@hubcap.clemson.edu> argosy!ian@decwrl.dec.com (Ian L. Kaplan) writes: > (David Remaklus writes:) >>In a recent conversation with some colleagues of mine at the Ames NAS >>facility concerning parallel processing, they mentioned their experiences >>porting a code to the Intel i860 hybercube located there (128 nodes, >>7.5 gigaFLOPS peek). On this particular code they were able to >>achieve about 300 MFLOPS for an efficiency factor of about 2.5%. This >>low efficiency factor didn't seem to bother them but it sure bothered >>me. Other colleagues of ours at the United Technologies Research Center >>in East Hartford, CT ported similar codes to their 1/4 CM-2 and achieved >>anywhere from 600 to 800 MFLOPS for an efficancy factor of more than 50%. We have an application that runs at roughly 70% efficiency on our iPSC/860. Email me for details. > Perhaps the difference in the execution efficiency between the Intel >cube (an MIMD machine) and the CM-2 (a SIMD machine) is due to the >fact (not doubt hotly contested) that SIMD systems are easier to >program. Easier to program also means easier to fit one's problem to... I thought marketing was taboo on the net. :^) In this case I think it's far more likely that the low efficiencies are due to the fact that there are no good market-ready compilers for the i860 as yet. > A term like "ease of programming" is often used without giving much >definition, so I will try to flesh out my claims. One definition of >ease of programming is that much of the machine architecture is >abstracted and the programmer can think about writing a program that >describes the problem rather than thinking about shoehorning the >problem onto the machine. Maybe. But for problems that defy application of a data-parallel algorithm (and thus run at very ordinary speeds on a data-parallel machine), one is quickly persuaded to learn to use a shoehorn. >... SIMD systems can be programmed in >_standard_ Fortran 90. MIMD systems can only be programmed in a >language that contains extensions for synchronization. Well, no. An iPSC can be programmed in standard FORTRAN 77 (who uses FORTRAN 90 anyway?). Think of a network of Unix-like systems supporting an RPC mechanism. That's the way you program an iPSC. Once you've finished decomposing your problem, it's no more painful than writing FORTRAN under VMS (groan...). What you say is true of some (if not all) other MIMD systems. > ...The SIMD >programmer need only consider machine architecture when it comes to >making their program run more efficiently. The MIMD programmer must >consider the machine architecture or the program will not run >deterministicly. Granted: but, in my opinion, the rewards are great. > > Of course I am biased. > > Ian Kaplan > MasPar Computer Corp. > ian@maspar.com Let's hear it for truth in advertising. Jeff Carroll carroll@atc.boeing.com disclaimer #1: I am not associated with Intel Corporation, except as a satisfied customer. disclaimer #2: these are personal opinions, not official positions of the Boeing Company (though the two may coincide, especially regarding such things as FORTRAN 90.)
xxremak@csduts1.lerc.nasa.gov (David A. Remaklus) (07/14/90)
In article <9661@hubcap.clemson.edu> dfk@grad13.cs.duke.edu (David F. Kotz) writes: >What is difficult about programming many MIMD (or SIMD) systems is to >make an *efficient* program (the original point of this thread of >discussion). Just getting your program to run is not always that >hard, although of course some systems/languages are easier than others >in this respect. To make an efficient program, unfortunately, one must >usually pay attention to architectural details. > After the original posting, I received a number of comments by email. It seems that we all have different definitions for efficiency as it applies to parallel processing. I think the core issue that needs to be considered (which is related to efficiency) is the determination of the "appropriateness" of executing a given application/algorithm for a given architecture and parallel system. For example, few (I would think) would argue that it is inappropriate to run an entirely or predominately scalar code on a CRAY supercomputer. I mean it all gets down to whether or not the application/algorithm can take advantage of the resources presented to it by the architecture/machine. If it can't, then it doesn't belong there. -- David A. Remaklus NASA Lewis Research Center Cleveland, Ohio 44135 xxremak@csduts1.lerc.nasa.gov