dakramer@phoenix.princeton.edu (David Anthony Kramer) (10/25/89)
This is the summary of the replies to my posting of a couple of weeks ago. Sorry that this is a little late, but I have been a little busy. My original posting went like this: >I am looking for benchmarks of FFT and matrix multiplication algorithms >running on parallel machines such as CM, MPP, DAP etc, as well as >on fast vector processors such as Crays, Cybers etc. I am particularly >interested in how performance is affected by increase in the number >of bits per data element. Are there any landmark papers in this area? >I would also appreciate any actual data that someone may have measured >off one or more actual machines. I received several replies with actual benchmarks, and several pointers to papers. The papers are: ---------------------------------------------------------------------------- >From israel@neat.cs.toronto.edu there is a paper by Jack Dongarra called "Performance of Various Computers Using Standard Linear Equations Software in a Fortran Environment". You can get it by writing the Math and Comp. Sci. Division Argonne National Labs Argonne, Illinois 60439 It seems that Prof. Dongarra is no longer at Argonne and Jan Griffin of Argonne sent me this letter: We are in receipt of your letter of 10/9 inquiring about a copy of the paper by Jack Dongarra on Performance of Various Computers. Dr. Dongarra is no longer with Argonne. I suggest you call his secretary, Mary Drake, at the Univ. of Tennesseee to request a copy. Her # is 615-974-5067. ---------------------------------------------------------------------------- Peter Montgomery of UCLA (pmontgom@math.ucla.edu) sent me the following hint: I am writing an integer FFT (more precisely, one which does an FFT (of order circa 16384) over the ring of integers mod N, where N is a composite integer of unknown factorization with approximately 200 decimal digits). The initial target machine is an Alliant (a vector machine); at our site up to six processors run in parallel. Do you want data on this when I get it? What do you mean by "bits per data element"? ---------------------------------------------------------------------------- Bertil Svensson (bertil@sm.luth.se) of Lulea Univ. of Technology Div. of Comp.Eng. S-95187 LULEA, Sweden had the following advice: In a parallel processor implementation project, LUCAS, at University of Lund, Sweden, a few years ago, we also ran and studied FFT and matrix mult. The machine has bit-serial PEs and a perfect shuffle/exchange network. It is all well documented in volume 216 of Springer Verlag:s Lecture Notes in Computer Science: Fernstrom C, Kruzela I, and Svensson B: "LUCAS Associative Array Processor. Design, Programming and Application Studies". See Chapter 7 of the book, where you also find the effects of different data length, and Chapter 10, where a suggestion for the inclusion of a bit- serial multiplier is made. Hope you can find the book if you are interested. I can send you copies of the relevant parts and additional information if you like. ---------------------------------------------------------------------------- John M.A. Roy (714) 856-5039 ICS Dept., Univ. Calif., Irvine CA 92717 Internet: roy@ics.uci.edu sent the following: For Cray performance you might try: "Performance Comparison of the CRAY X-MP/24 with SDD and the CRAY-2" in The Journal of Supercomputing, Vol.1, pp 409-419 (1988) by Anderson, Grimes, and Simon. They ran two 1-D and one 2-D FFTs (along with 21 other benchmarks). I've not looked at it in detail, but they did run FFT benchmarks on real machines and do give a lot of detail about their procedures. ---------------------------------------------------------------------------- Rich Stamms argosy!stamm@decwrl.dec.com contribution: We at MasPar are aware of a couple of sources of information on FFT's and matrix multiplication on parallel machines: - "Scientific Applications on the Connection Machine", edited by Horst Simon, NASA Ames Research Center, Publisher: World Scientific - 687 Hartwell Street, Teaneck, NJ 07666 - "Frontiers of Massively Parallel Scientific Computation" - NASA Goddard Spaceflight Center, 1986, editor: Jim Fischer, Code 635, Greenbelt MD 20771 phone: 301-286-3221 - "Frontiers of Massively Parallel Computation" - IEEE Computer Society, Order Number 892, IEEE Catalog Number 88CH2649-2 I hope these are useful to you. If you would like any information on our massively parallel SIMD computer, please contact Will Workman at 301-571-9480. ---------------------------------------------------------------------------- Swami (swm@unccvax.uncc.edu) sent the following: The following reference may be of interest to you, thoug it is on a MIMD machine. Alan Norton and A. Silberger , " Parallelization and performance analysis of teh Cooley-Tukey FFT algorithm for shared memory architectures", IEEE transactions on Computers , Vol C-36, No 5 May 1987 ---------------------------------------------------------------------------- And now for the benchmarks. It should be noted that these are fairly varied, and each has to be evaluated carefully lest we compare apples with crays (excuse me I couldn't help that). I have therefore included all of the explanations leading up to the benchmarks that I have. This makes for a rather long summary I am afraid. ---------------------------------------------------------------------------- Howard Wasserman of LANL (hjw%alpha@LANL.GOV) sent the following: We have a benchmark of each type that we have run on many machines. I've excerpted part of a report that contains results below. The "excerption" was kind of quick and dirty, since you were only interested in these two codes and we have others. So the table may look funny. If you want the full report, it is: H. J. Wasserman, ``Los Alamos National Laboratory Computer Benchmarking 1988,'' Los Alamos National Laboratory Report LA-11465-MS (1988). If you want the codes I can mail them to you. _______________________________________________________________________________ Table 3. Benchmark Execution Times (in seconds) for CRAY-1S, CRAY X-MP/48, CRAY X-MP/416, and CRAY X-MP/SE _______________________________________________________________________________ Code CRAY-1S X-MP/48 X-MP/416 X-MP/416 X-MP/416 X-MP/SE UNICOS CTSS COS CTSS CTSS UNICOS CFT77 2.0 CFT77 2.0 CFT 1.14 CFT 1.14 CFT77 2.0 CFT77 2.0 _______________________________________________________________________________ FFT 5.6 4.4 4.4 4.3 3.9 10.2 MATRIX 65.6 41.9 36.2 54.7 34.9 48.9 _____________________________________________________________ Table 4. Benchmark Execution Times (in seconds) for CRAY-2 _____________________________________________________________ Code 120-ns DRAMa 80-ns DRAM 55-ns SRAMb 55-ns SRAM (SN 2003) (SN 2011) (SN 2012) (SN 2012) CFT77 1.3 CFT77 1.3 CFT77 1.3 CFT77 2.0 _____________________________________________________________ FFT 10.7 10.5 9.6 4.1 MATRIX 61.9 59.4 57.3 57.0 ________________________________________________ Table 7. Benchmark Execution Times (in seconds) for NEC SX-2, Hitachi S-820/80, and ETA-10 ________________________________________________ Code NEC SX-2 Hitachi S-820/80 ETA-10 ________________________________________________ FFT 2.8 2.0 8.2 MATRIX 2.5 25.6 109.0 ___________________________________________________________________________________________________ Table 8. Benchmark Execution Times (in seconds) for the Alliant FX/8, Alliant FX/80, SCS-40, Multiflow Trace 7/200, CONVEX C-1, and DEC VAX 8600 ___________________________________________________________________________________________________ Alliant FX/8 Alliant FX/80 Code 1 CE 8 CEs 1 CE 8 CEs SCS-40 Multiflow Trace 7/200 Convex C-1 DEC VAX 8600 ___________________________________________________________________________________________________ FFT 82.0 40.0 71.9 38.3 21.0 25.3 97.0 222.2 MATRIX 953.1 156.5 815.0 114.0 167.0 479.1 529.0 - FFT: A fast Fourier transform (FFT) code that is highly vectoriz- able. This code measures the speed of single 64-point transforms. Because it executes many operations with short vector lengths, it is very sensitive to vector start-up times. FFT library routines supplied by supercomputer manufacturers generally perform multiple FFTs at much higher execution rates than this benchmark code. No I/O is per- formed. MATRIX: Basic matrix operations, including multiplication and tran- spose, on matrices of order 100. The code is highly vector- izable but not optimized for vector computers. ---------------------------------------------------------------------------- Dominic Herity (dherity%cs.tcd.ie) Distributed Systems Group Computer Science Dept. Trinity College Dublin Ireland sent the following comparison: I have done a comparison of speedup on FFT between an Intel Hypercube (32 x 8MHz '286s with '287s) and a multi-CPU VAX using 4 CPUs on a Qbus. I limited data to 1K points to avoid exceeding 64K data on the '286. Code was in C with lotsa #ifdefs to handle library differences. I found that with up to 4 CPUs used, the VAX setup g ave similar speedup and that using more CPUs on the hypercube gave little or negativ e improvement in runtime. Here are (briefly) results. CPUS 1 2 4 8 16 32 POINTS 64 170 125 160 185 175 - 256 660 425 320 320 375 400 1024 3030 1870 1310 1055 1010 1050 TABLE 1 : Hypercube runtimes (ms) CPUS 1 2 4 POINTS 64 100 100 90 256 300 210 170 1024 1300 790 550 TABLE 2 : MicroVAX runtimes (ms) CPUS 1 2 4 8 16 32 POINTS 64 1.00 1.36 1.06 0.92 0.97 256 1.00 1.55 2.06 2.06 1.76 1.65 1024 1.00 1.62 2.31 2.87 3.00 2.89 TABLE 3 : Hypercube speedup CPUS 1 2 4 POINTS 64 1.00 1.00 1.11 256 1.00 1.43 1.76 1024 1.00 1.65 2.36 TABLE 4 : MicroVAX speedup ---------------------------------------------------------------------------- Paul K. Rodman / KA1ZA / rodman@multiflow.com Multiflow Computer, Inc. Tel. 203 488 6090 x 236 Branford, Ct. 06405 submitted the following benchmarks: I happened to write the FFT library for the Trace 7,14 and 28/300 VLIW minisupercomputers. Here are the performance numbers for 32 and 64 bit, complex to complex, 1d, 2d and 3d, ffts, (whew!)... The times INCLUDE system time (mapping the pages for a large FFT takes a little bit of time, mustn't cheat and not count it!). "Mfl" is the achieved megaflops on the hardware. "rad2" is the equivalent radix-2 algorithm megaflops you would need to equal my radix4/2 code. (Radix 4 is about 17% faster). ------------------------------------------------------------- Trace 7/300 single precision 1d fft times: starting power of 2 = 3 ending power of 2 = 20 n= 8 cfft = 0.0209 ms. Mfl = 4.0 rad2 = 4.7 n= 16 cfft = 0.0316 ms. Mfl = 8.6 rad2 = 10.1 n= 32 cfft = 0.0503 ms. Mfl = 12.1 rad2 = 14.2 n= 64 cfft = 0.0913 ms. Mfl = 17.9 rad2 = 21.0 n= 128 cfft = 0.1779 ms. Mfl = 19.8 rad2 = 23.3 n= 256 cfft = 0.3682 ms. Mfl = 23.6 rad2 = 27.8 n= 512 cfft = 0.7791 ms. Mfl = 23.7 rad2 = 27.8 n= 1024 cfft = 1.6792 ms. Mfl = 25.9 rad2 = 30.5 n= 2048 cfft = 3.6028 ms. Mfl = 25.3 rad2 = 29.8 n= 4096 cfft = 7.7836 ms. Mfl = 26.8 rad2 = 31.6 n= 8192 cfft = 16.6511 ms. Mfl = 26.1 rad2 = 30.7 n= 16384 cfft = 35.7757 ms. Mfl = 27.2 rad2 = 32.1 n= 32768 cfft = 75.7264 ms. Mfl = 26.6 rad2 = 31.3 n= 65536 cfft = 161.9817 ms. Mfl = 27.5 rad2 = 32.4 n= 131072 cfft = 342.1053 ms. Mfl = 26.8 rad2 = 31.6 n= 262144 cfft = 722.2223 ms. Mfl = 27.8 rad2 = 32.7 n= 524288 cfft = 1514.5286 ms. Mfl = 27.2 rad2 = 32.0 n= 1048576 cfft = 3185.8552 ms. Mfl = 28.0 rad2 = 32.9 Trace 14/300 single precision 1d fft times: starting power of 2 = 3 ending power of 2 = 20 n= 8 cfft = 0.0217 ms. Mfl = 3.9 rad2 = 4.5 n= 16 cfft = 0.0262 ms. Mfl = 10.4 rad2 = 12.2 n= 32 cfft = 0.0378 ms. Mfl = 16.1 rad2 = 18.9 n= 64 cfft = 0.0612 ms. Mfl = 26.7 rad2 = 31.4 n= 128 cfft = 0.1079 ms. Mfl = 32.6 rad2 = 38.4 n= 256 cfft = 0.2098 ms. Mfl = 41.5 rad2 = 48.8 n= 512 cfft = 0.4266 ms. Mfl = 43.2 rad2 = 50.8 n= 1024 cfft = 0.8981 ms. Mfl = 48.5 rad2 = 57.0 n= 2048 cfft = 1.9074 ms. Mfl = 47.8 rad2 = 56.2 n= 4096 cfft = 4.0974 ms. Mfl = 51.0 rad2 = 60.0 n= 8192 cfft = 8.6544 ms. Mfl = 50.2 rad2 = 59.0 n= 16384 cfft = 18.5307 ms. Mfl = 52.6 rad2 = 61.9 n= 32768 cfft = 39.1996 ms. Mfl = 51.4 rad2 = 60.5 n= 65536 cfft = 83.7445 ms. Mfl = 53.2 rad2 = 62.6 n= 131072 cfft = 176.3980 ms. Mfl = 52.0 rad2 = 61.2 n= 262144 cfft = 370.6140 ms. Mfl = 54.1 rad2 = 63.7 n= 524288 cfft = 776.8641 ms. Mfl = 53.0 rad2 = 62.3 n= 1048576 cfft = 1635.6908 ms. Mfl = 54.5 rad2 = 64.1 Trace 28/300 single precision 1d fft times: starting power of 2 = 3 ending power of 2 = 20 n= 8 cfft = 0.0244 ms. Mfl = 3.4 rad2 = 4.0 n= 16 cfft = 0.0293 ms. Mfl = 9.3 rad2 = 10.9 n= 32 cfft = 0.0336 ms. Mfl = 18.1 rad2 = 21.3 n= 64 cfft = 0.0486 ms. Mfl = 33.6 rad2 = 39.5 n= 128 cfft = 0.0778 ms. Mfl = 45.2 rad2 = 53.2 n= 256 cfft = 0.1340 ms. Mfl = 64.9 rad2 = 76.4 n= 512 cfft = 0.2646 ms. Mfl = 69.7 rad2 = 81.9 n= 1024 cfft = 0.5196 ms. Mfl = 83.8 rad2 = 98.5 n= 2048 cfft = 1.1107 ms. Mfl = 82.1 rad2 = 96.5 n= 4096 cfft = 2.2842 ms. Mfl = 91.5 rad2 = 107.6 n= 8192 cfft = 5.0909 ms. Mfl = 85.3 rad2 = 100.3 n= 16384 cfft = 10.2632 ms. Mfl = 95.0 rad2 = 111.7 n= 32768 cfft = 22.3112 ms. Mfl = 90.3 rad2 = 106.3 n= 65536 cfft = 45.9704 ms. Mfl = 96.9 rad2 = 114.0 n= 131072 cfft = 99.7807 ms. Mfl = 92.0 rad2 = 108.2 n= 262144 cfft = 204.0159 ms. Mfl = 98.3 rad2 = 115.6 n= 524288 cfft = 438.3224 ms. Mfl = 93.9 rad2 = 110.5 n= 1048576 cfft = 901.5899 ms. Mfl = 98.9 rad2 = 116.3 ^LTrace 7/300 double precision 1d fft times: starting power of 2 = 3 ending power of 2 = 20 n= 8 dcfft = 0.0234 ms. Mfl = 3.6 rad2 = 4.2 n= 16 dcfft = 0.0411 ms. Mfl = 6.6 rad2 = 7.8 n= 32 dcfft = 0.0749 ms. Mfl = 8.1 rad2 = 9.6 n= 64 dcfft = 0.1514 ms. Mfl = 10.8 rad2 = 12.7 n= 128 dcfft = 0.3150 ms. Mfl = 11.2 rad2 = 13.1 n= 256 dcfft = 0.6866 ms. Mfl = 12.7 rad2 = 14.9 n= 512 dcfft = 1.4806 ms. Mfl = 12.4 rad2 = 14.6 n= 1024 dcfft = 3.2431 ms. Mfl = 13.4 rad2 = 15.8 n= 2048 dcfft = 7.0022 ms. Mfl = 13.0 rad2 = 15.3 n= 4096 dcfft = 15.1922 ms. Mfl = 13.8 rad2 = 16.2 n= 8192 dcfft = 32.6350 ms. Mfl = 13.3 rad2 = 15.7 n= 16384 dcfft = 70.1754 ms. Mfl = 13.9 rad2 = 16.3 n= 32768 dcfft = 149.0771 ms. Mfl = 13.5 rad2 = 15.9 n= 65536 dcfft = 317.1601 ms. Mfl = 14.1 rad2 = 16.5 n= 131072 dcfft = 672.3318 ms. Mfl = 13.6 rad2 = 16.1 n= 262144 dcfft = 1421.8750 ms. Mfl = 14.1 rad2 = 16.6 n= 524288 dcfft = 3128.8379 ms. Mfl = 13.2 rad2 = 15.5 n= 1048576 dcfft = 6471.4912 ms. Mfl = 13.8 rad2 = 16.2 Trace 14/300 double precision 1d fft times: starting power of 2 = 3 ending power of 2 = 20 n= 8 dcfft = 0.0196 ms. Mfl = 4.3 rad2 = 5.1 n= 16 dcfft = 0.0294 ms. Mfl = 9.2 rad2 = 10.9 n= 32 dcfft = 0.0491 ms. Mfl = 12.4 rad2 = 14.6 n= 64 dcfft = 0.0896 ms. Mfl = 18.2 rad2 = 21.4 n= 128 dcfft = 0.1751 ms. Mfl = 20.1 rad2 = 23.6 n= 256 dcfft = 0.3667 ms. Mfl = 23.7 rad2 = 27.9 n= 512 dcfft = 0.7762 ms. Mfl = 23.7 rad2 = 27.9 n= 1024 dcfft = 1.6808 ms. Mfl = 25.9 rad2 = 30.5 n= 2048 dcfft = 3.5950 ms. Mfl = 25.4 rad2 = 29.8 n= 4096 dcfft = 7.7551 ms. Mfl = 26.9 rad2 = 31.7 n= 8192 dcfft = 16.6725 ms. Mfl = 26.0 rad2 = 30.6 n= 16384 dcfft = 35.7785 ms. Mfl = 27.2 rad2 = 32.1 n= 32768 dcfft = 75.6102 ms. Mfl = 26.7 rad2 = 31.4 n= 65536 dcfft = 160.3618 ms. Mfl = 27.8 rad2 = 32.7 n= 131072 dcfft = 342.1053 ms. Mfl = 26.8 rad2 = 31.6 n= 262144 dcfft = 723.4101 ms. Mfl = 27.7 rad2 = 32.6 n= 524288 dcfft = 1532.3464 ms. Mfl = 26.9 rad2 = 31.6 n= 1048576 dcfft = 3242.8730 ms. Mfl = 27.5 rad2 = 32.3 Trace 28/300 double precision 1d fft times: starting power of 2 = 3 ending power of 2 = 20 n= 8 dcfft = 0.0221 ms. Mfl = 3.8 rad2 = 4.5 n= 16 dcfft = 0.0259 ms. Mfl = 10.5 rad2 = 12.4 n= 32 dcfft = 0.0396 ms. Mfl = 15.4 rad2 = 18.1 n= 64 dcfft = 0.0629 ms. Mfl = 25.9 rad2 = 30.5 n= 128 dcfft = 0.1167 ms. Mfl = 30.2 rad2 = 35.5 n= 256 dcfft = 0.2175 ms. Mfl = 40.0 rad2 = 47.1 n= 512 dcfft = 0.4603 ms. Mfl = 40.0 rad2 = 47.1 n= 1024 dcfft = 0.9302 ms. Mfl = 46.8 rad2 = 55.0 n= 2048 dcfft = 2.0565 ms. Mfl = 44.3 rad2 = 52.1 n= 4096 dcfft = 4.2072 ms. Mfl = 49.7 rad2 = 58.4 n= 8192 dcfft = 9.3226 ms. Mfl = 46.6 rad2 = 54.8 n= 16384 dcfft = 19.1722 ms. Mfl = 50.8 rad2 = 59.8 n= 32768 dcfft = 41.9885 ms. Mfl = 48.0 rad2 = 56.5 n= 65536 dcfft = 86.5954 ms. Mfl = 51.5 rad2 = 60.5 n= 131072 dcfft = 187.9797 ms. Mfl = 48.8 rad2 = 57.4 n= 262144 dcfft = 450.7950 ms. Mfl = 44.5 rad2 = 52.3 n= 524288 dcfft = 883.7720 ms. Mfl = 46.6 rad2 = 54.8 n= 1048576 dcfft = 1767.5441 ms. Mfl = 50.4 rad2 = 59.3 ^LTrace 7/300 single precision 2d fft times: Starting power of 2 = 3 ending power of 2 = 11 n= 8 2d cfft = 0.00030 sec, Mflops = 4.5 rad2 = 5.3 n= 16 2d cfft = 0.00074 sec, Mflops = 11.7 rad2 = 13.8 n= 32 2d cfft = 0.00275 sec, Mflops = 14.2 rad2 = 16.7 n= 64 2d cfft = 0.01069 sec, Mflops = 19.5 rad2 = 23.0 n= 128 2d cfft = 0.04327 sec, Mflops = 20.8 rad2 = 24.5 n= 256 2d cfft = 0.18353 sec, Mflops = 24.3 rad2 = 28.6 n= 512 2d cfft = 0.78399 sec, Mflops = 24.1 rad2 = 28.3 n= 1024 2d cfft = 3.40680 sec, Mflops = 26.2 rad2 = 30.8 n= 2048 2d cfft = 15.35197 sec, Mflops = 24.3 rad2 = 28.6 Trace 14/300 single precision 2d fft times: Starting power of 2 = 3 ending power of 2 = 11 n= 8 2d cfft = 0.00021 sec, Mflops = 6.3 rad2 = 7.4 n= 16 2d cfft = 0.00057 sec, Mflops = 15.2 rad2 = 17.9 n= 32 2d cfft = 0.00172 sec, Mflops = 22.7 rad2 = 26.7 n= 64 2d cfft = 0.00741 sec, Mflops = 28.2 rad2 = 33.2 n= 128 2d cfft = 0.02444 sec, Mflops = 36.9 rad2 = 43.4 n= 256 2d cfft = 0.09978 sec, Mflops = 44.7 rad2 = 52.5 n= 512 2d cfft = 0.41338 sec, Mflops = 45.7 rad2 = 53.7 n= 1024 2d cfft = 1.78180 sec, Mflops = 50.0 rad2 = 58.8 n= 2048 2d cfft = 7.86842 sec, Mflops = 47.4 rad2 = 55.8 Trace 28/300 single precision 2d fft times: Starting power of 2 = 3 ending power of 2 = 11 n= 8 2d cfft = 0.00021 sec, Mflops = 6.3 rad2 = 7.4 n= 16 2d cfft = 0.00057 sec, Mflops = 15.2 rad2 = 17.9 n= 32 2d cfft = 0.00129 sec, Mflops = 30.2 rad2 = 35.5 n= 64 2d cfft = 0.00428 sec, Mflops = 48.8 rad2 = 57.5 n= 128 2d cfft = 0.01527 sec, Mflops = 59.0 rad2 = 69.4 n= 256 2d cfft = 0.05524 sec, Mflops = 80.7 rad2 = 94.9 n= 512 2d cfft = 0.24945 sec, Mflops = 75.7 rad2 = 89.0 n= 1024 2d cfft = 0.96930 sec, Mflops = 92.0 rad2 = 108.2 n= 2048 2d cfft = 4.49013 sec, Mflops = 83.1 rad2 = 97.8 (4k x 4k also run recently : 17.7 seconds 96.6 megaflops) ^LTrace 7/300 double precision 2d fft times: Starting power of 2 = 3 ending power of 2 = 11 n= 8 2d dcfft = 0.00034 sec, Mflops = 3.9 rad2 = 4.6 n= 16 2d dcfft = 0.00126 sec, Mflops = 6.9 rad2 = 8.1 n= 32 2d dcfft = 0.00464 sec, Mflops = 8.4 rad2 = 9.9 n= 64 2d dcfft = 0.01910 sec, Mflops = 10.9 rad2 = 12.9 n= 128 2d dcfft = 0.08044 sec, Mflops = 11.2 rad2 = 13.2 n= 256 2d dcfft = 0.35101 sec, Mflops = 12.7 rad2 = 14.9 n= 512 2d dcfft = 1.51809 sec, Mflops = 12.4 rad2 = 14.6 n= 1024 2d dcfft = 6.65680 sec, Mflops = 13.4 rad2 = 15.8 n= 2048 2d dcfft = 28.82237 sec, Mflops = 13.0 rad2 = 15.2 Trace 14/300 double precision 2d fft times: Starting power of 2 = 3 ending power of 2 = 11 n= 8 2d dcfft = 0.00021 sec, Mflops = 6.3 rad2 = 7.4 n= 16 2d dcfft = 0.00086 sec, Mflops = 10.1 rad2 = 11.9 n= 32 2d dcfft = 0.00266 sec, Mflops = 14.6 rad2 = 17.2 n= 64 2d dcfft = 0.01055 sec, Mflops = 19.8 rad2 = 23.3 n= 128 2d dcfft = 0.04276 sec, Mflops = 21.1 rad2 = 24.8 n= 256 2d dcfft = 0.18174 sec, Mflops = 24.5 rad2 = 28.8 n= 512 2d dcfft = 0.77686 sec, Mflops = 24.3 rad2 = 28.6 n= 1024 2d dcfft = 3.43531 sec, Mflops = 25.9 rad2 = 30.5 n= 2048 2d dcfft = 14.69627 sec, Mflops = 25.4 rad2 = 29.9 Trace 28/300 double precision 2d fft times: Starting power of 2 = 3 ending power of 2 = 11 n= 8 2d dcfft = 0.00021 sec, Mflops = 6.3 rad2 = 7.4 n= 16 2d dcfft = 0.00063 sec, Mflops = 13.8 rad2 = 16.3 n= 32 2d dcfft = 0.00180 sec, Mflops = 21.6 rad2 = 25.4 n= 64 2d dcfft = 0.00627 sec, Mflops = 33.3 rad2 = 39.2 n= 128 2d dcfft = 0.02545 sec, Mflops = 35.4 rad2 = 41.6 n= 256 2d dcfft = 0.09800 sec, Mflops = 45.5 rad2 = 53.5 n= 512 2d dcfft = 0.44189 sec, Mflops = 42.7 rad2 = 50.3 n= 1024 2d dcfft = 1.79605 sec, Mflops = 49.6 rad2 = 58.4 n= 2048 2d dcfft = 8.08224 sec, Mflops = 46.2 rad2 = 54.3 (4k x 4k also run recently : 32.8 seconds 52.2 megaflops) ^LTrace 7/300 single precision 3d fft times: Starting power of 2 = 3 ending power of 2 = 7 n= 8 time for 3d cfft = 0.00300 sec, Mflops = 5.376 n= 16 time for 3d cfft = 0.02167 sec, Mflops = 9.641 n= 32 time for 3d cfft = 0.14500 sec, Mflops =12.881 n= 64 time for 3d cfft = 1.21667 sec, Mflops =16.483 n= 128 time for 3d cfft = 10.00000 sec, Mflops =17.302 Trace 14/300 single precision 3d fft times: Starting power of 2 = 3 ending power of 2 = 7 n= 8 time for 3d cfft = 0.00250 sec, Mflops = 6.451 n= 16 time for 3d cfft = 0.01333 sec, Mflops =15.667 n= 32 time for 3d cfft = 0.08667 sec, Mflops =21.551 n= 64 time for 3d cfft = 0.70000 sec, Mflops =28.649 n= 128 time for 3d cfft = 5.63333 sec, Mflops =30.713 Trace 28/300 single precision 3d fft times: Starting power of 2 = 3 ending power of 2 = 7 n= 8 time for 3d cfft = 0.00283 sec, Mflops = 5.692 n= 16 time for 3d cfft = 0.01167 sec, Mflops =17.905 n= 32 time for 3d cfft = 0.06167 sec, Mflops =30.288 n= 64 time for 3d cfft = 0.48333 sec, Mflops =41.491 n= 128 time for 3d cfft = 3.71667 sec, Mflops =46.551 (256 ** 3 case run recently, 24.4s, 70 mflps) ^LTrace 7/300 double precision 3d fft times: Starting power of 2 = 3 ending power of 2 = 7 n= 8 time for 3d dcfft = 0.00500 sec, Mflops = 3.226 n= 16 time for 3d dcfft = 0.03500 sec, Mflops = 5.968 n= 32 time for 3d dcfft = 0.26333 sec, Mflops = 7.093 n= 64 time for 3d dcfft = 2.28333 sec, Mflops = 8.783 n= 128 time for 3d dcfft = 21.31667 sec, Mflops = 8.116 Trace 14/300 double precision 3d fft times: Starting power of 2 = 3 ending power of 2 = 7 n= 8 time for 3d dcfft = 0.00300 sec, Mflops = 5.376 n= 16 time for 3d dcfft = 0.02167 sec, Mflops = 9.641 n= 32 time for 3d dcfft = 0.14500 sec, Mflops =12.881 n= 64 time for 3d dcfft = 1.28333 sec, Mflops =15.627 n= 128 time for 3d dcfft = 10.61667 sec, Mflops =16.297 Trace 28/300 double precision 3d fft times: Starting power of 2 = 3 ending power of 2 = 7 n= 8 time for 3d dcfft = 0.00250 sec, Mflops = 6.451 n= 16 time for 3d dcfft = 0.01333 sec, Mflops =15.667 n= 32 time for 3d dcfft = 0.08833 sec, Mflops =21.145 n= 64 time for 3d dcfft = 0.80000 sec, Mflops =25.068 n= 128 time for 3d dcfft = 6.70000 sec, Mflops =25.823 (256 ** 3 case run recently, 44.2, 38 mflps) ---------------------------------------------------------------------------- I havent included all of the replies I received, but there is a lot of interesting info in all this. It should at least give a good feel for what is out there. David Internet, Bitnet: dakramer@olympus.princeton.edu