[comp.parallel] Summary of FFT and Matrix operations

dakramer@phoenix.princeton.edu (David Anthony Kramer) (10/25/89)
This is the summary of the replies to my posting of a couple of weeks ago. 
Sorry that this is a little late, but I have been a little busy. My
original posting went like this:


>I am looking for benchmarks of FFT and matrix multiplication algorithms
>running on parallel machines such as CM, MPP, DAP etc, as well as
>on fast vector processors such as Crays, Cybers etc. I am particularly 
>interested in how performance is affected by increase in the number
>of bits per data element. Are there any landmark papers in this area?
>I would also appreciate any actual data that someone may have measured
>off one or more actual machines. 

I received several replies with actual benchmarks, and several pointers to
papers. The papers are:
----------------------------------------------------------------------------
>From israel@neat.cs.toronto.edu
there is a paper by Jack Dongarra called
"Performance of Various Computers Using Standard
Linear Equations Software in a Fortran Environment".

You can get it by writing the
Math and Comp. Sci. Division
Argonne National Labs
Argonne, Illinois 60439

It seems that Prof. Dongarra is no longer at Argonne and Jan Griffin of 
Argonne sent me this letter:

We are in receipt of your letter of 10/9 inquiring about
a copy of the paper by Jack Dongarra on Performance of Various
Computers.  Dr. Dongarra is no longer with Argonne.  I suggest
you call his secretary, Mary Drake, at the Univ. of Tennesseee
to request a copy.  Her # is 615-974-5067.

----------------------------------------------------------------------------
Peter Montgomery of UCLA (pmontgom@math.ucla.edu) sent me the following 
hint:

I am writing an integer FFT (more precisely, one which
does an FFT (of order circa 16384) over the ring of integers mod N, 
where N is a composite integer of unknown factorization 
with approximately 200 decimal digits).
The initial target machine is an Alliant (a vector machine); at our
site up to six processors run in parallel.  Do you want data on this
when I get it?  What do you mean by "bits per data element"?

----------------------------------------------------------------------------

Bertil Svensson (bertil@sm.luth.se) of Lulea Univ. of Technology
			               Div. of Comp.Eng.
			               S-95187 LULEA, Sweden
had the following advice:

In a parallel processor implementation project, LUCAS, at University
of Lund, Sweden, a few years ago, we also ran and studied FFT and matrix mult. 
The machine has bit-serial PEs and a perfect shuffle/exchange network.
It is all well documented in volume 216 of Springer Verlag:s Lecture Notes
in Computer Science:
	Fernstrom C, Kruzela I, and Svensson B:
	"LUCAS Associative Array Processor. Design, Programming
	 and Application Studies".

See Chapter 7 of the book, where you also find the effects of different data
length, and Chapter 10, where a suggestion for the inclusion of a bit-
serial multiplier is made.
Hope you can find the book if you are interested. I can send you copies of 
the relevant parts and additional information if you like.
----------------------------------------------------------------------------

John M.A. Roy (714) 856-5039
ICS Dept., Univ. Calif., Irvine CA 92717
Internet: roy@ics.uci.edu 
sent the following:
For Cray performance you might try:

	"Performance Comparison of the CRAY X-MP/24 with SDD
	 and the CRAY-2" in The Journal of Supercomputing, Vol.1,
	 pp 409-419 (1988) by Anderson, Grimes, and Simon.

They ran two 1-D and one 2-D FFTs (along with 21 other benchmarks).
I've not looked at it in detail, but they did run FFT benchmarks on
real machines and do give a lot of detail about their procedures.

----------------------------------------------------------------------------

Rich Stamms argosy!stamm@decwrl.dec.com contribution:

We at MasPar are aware of a couple of sources of information on FFT's and 
matrix multiplication on parallel machines:

- "Scientific Applications on the Connection Machine", edited by Horst Simon,
  NASA Ames Research Center, Publisher: World Scientific - 687 Hartwell Street,
  Teaneck, NJ 07666

- "Frontiers of Massively Parallel Scientific Computation" - NASA Goddard
  Spaceflight Center, 1986,  editor: Jim Fischer, Code 635, Greenbelt MD 20771
  phone: 301-286-3221

- "Frontiers of Massively Parallel Computation" - IEEE Computer Society, Order
  Number 892,  IEEE Catalog Number 88CH2649-2

I hope these are useful to you.  If you would like any information on our
massively parallel SIMD computer, please contact Will Workman at 301-571-9480.

----------------------------------------------------------------------------

Swami (swm@unccvax.uncc.edu) sent the following:
The following reference
may be of interest to you, thoug it is on a MIMD
machine.

Alan Norton and A. Silberger , " Parallelization and performance
analysis of teh Cooley-Tukey FFT algorithm for shared memory 
architectures", IEEE transactions on Computers , Vol C-36, No 5 May 1987
----------------------------------------------------------------------------

And now for the benchmarks. It should be noted that these are fairly varied, 
and each has to be evaluated carefully lest we compare apples with crays
(excuse me I couldn't help that). I have therefore included all of
the explanations leading up to the benchmarks that I have. This makes for
a rather long summary I am afraid.

----------------------------------------------------------------------------
Howard Wasserman of LANL  (hjw%alpha@LANL.GOV) sent the following:

We have a benchmark of each type that we
have run on many machines.  I've excerpted part
of a report that contains results below.  The 
"excerption" was kind of quick and dirty, since you
were only interested in these two codes and we have
others.  So the table may look funny.

If you want the full report, it is:
H. J. Wasserman, ``Los Alamos National Laboratory Computer
Benchmarking 1988,''
Los Alamos National Laboratory Report LA-11465-MS (1988).

If you want the codes I can mail them to you.

_______________________________________________________________________________
               Table 3.  Benchmark Execution Times (in seconds)
          for CRAY-1S, CRAY X-MP/48, CRAY X-MP/416, and CRAY X-MP/SE
_______________________________________________________________________________
 Code     CRAY-1S     X-MP/48    X-MP/416   X-MP/416   X-MP/416     X-MP/SE
          UNICOS       CTSS        COS        CTSS       CTSS       UNICOS
         CFT77 2.0   CFT77 2.0   CFT 1.14   CFT 1.14   CFT77 2.0   CFT77 2.0
_______________________________________________________________________________
FFT         5.6         4.4         4.4        4.3        3.9        10.2

MATRIX     65.6        41.9        36.2       54.7       34.9        48.9



_____________________________________________________________
      Table 4.  Benchmark Execution Times (in seconds)
                         for CRAY-2
_____________________________________________________________
Code     120-ns DRAMa   80-ns DRAM   55-ns SRAMb   55-ns SRAM
          (SN 2003)     (SN 2011)     (SN 2012)    (SN 2012)
          CFT77 1.3     CFT77 1.3     CFT77 1.3    CFT77 2.0
_____________________________________________________________
FFT          10.7          10.5          9.6           4.1

MATRIX       61.9          59.4         57.3          57.0



      ________________________________________________
      Table 7.  Benchmark Execution Times (in seconds)
         for NEC SX-2, Hitachi S-820/80, and ETA-10
      ________________________________________________
      Code     NEC SX-2   Hitachi S-820/80   ETA-10
      ________________________________________________
      FFT        2.8             2.0           8.2
      MATRIX     2.5            25.6         109.0



___________________________________________________________________________________________________
                     Table 8.  Benchmark Execution Times (in seconds) for the
                    Alliant FX/8, Alliant FX/80, SCS-40, Multiflow Trace 7/200,
                                   CONVEX C-1, and DEC VAX 8600
___________________________________________________________________________________________________
         Alliant FX/8    Alliant FX/80
Code     1 CE    8 CEs   1 CE    8 CEs   SCS-40   Multiflow Trace 7/200   Convex C-1   DEC VAX 8600
___________________________________________________________________________________________________
FFT       82.0    40.0    71.9    38.3    21.0             25.3              97.0         222.2

MATRIX   953.1   156.5   815.0   114.0   167.0            479.1             529.0           -



FFT:
A fast Fourier transform (FFT) code that is highly vectoriz-
able.   This  code  measures  the  speed  of single 64-point
transforms.  Because it executes many operations with  short
vector  lengths,  it  is  very  sensitive to vector start-up
times.   FFT  library  routines  supplied  by  supercomputer
manufacturers generally perform multiple FFTs at much higher
execution rates than this benchmark code.  No  I/O  is  per-
formed.

MATRIX:
Basic matrix operations, including multiplication and  tran-
spose, on matrices of order 100.  The code is highly vector-
izable but not optimized for vector computers.


----------------------------------------------------------------------------

Dominic Herity (dherity%cs.tcd.ie)
Distributed Systems Group
Computer Science Dept.
Trinity College
Dublin
Ireland

sent the following comparison:
I have done a comparison of speedup on FFT between an Intel Hypercube (32 x 8MHz
'286s with '287s) and a multi-CPU VAX using 4 CPUs on a Qbus. I limited data to
   1K
points to avoid exceeding 64K data on the '286. Code was in C with lotsa #ifdefs
    to
handle library differences. I found that with up to 4 CPUs used, the VAX setup g
   ave
similar speedup and that using more CPUs on the hypercube gave little or negativ
   e
improvement in runtime. Here are (briefly) results.

        CPUS     1      2        4       8      16      32

        POINTS
          64     170     125     160     185     175      -
         256     660     425     320     320     375     400
        1024    3030    1870    1310    1055    1010    1050
TABLE 1 : Hypercube runtimes (ms)

        CPUS     1      2        4

        POINTS
          64     100     100      90
         256     300     210     170
        1024    1300     790     550
TABLE 2 : MicroVAX runtimes (ms)

        CPUS     1      2        4       8      16      32

        POINTS
          64    1.00    1.36    1.06    0.92    0.97
         256    1.00    1.55    2.06    2.06    1.76    1.65
        1024    1.00    1.62    2.31    2.87    3.00    2.89
TABLE 3 : Hypercube speedup

        CPUS     1      2        4

        POINTS
          64    1.00    1.00    1.11
         256    1.00    1.43    1.76
        1024    1.00    1.65    2.36
TABLE 4 : MicroVAX speedup


----------------------------------------------------------------------------

Paul K. Rodman / KA1ZA /   rodman@multiflow.com
Multiflow Computer, Inc.   Tel. 203 488 6090 x 236
Branford, Ct. 06405

submitted the following benchmarks:
I happened to write the FFT library for the Trace 7,14 and 28/300
VLIW minisupercomputers. Here are the performance numbers for
32 and 64 bit, complex to complex, 1d, 2d and 3d, ffts, (whew!)...

The times INCLUDE system time (mapping the pages for a large FFT 
takes a little bit of time, mustn't cheat and not count it!).

"Mfl" is the achieved megaflops on the hardware.

"rad2" is the equivalent radix-2 algorithm megaflops you would
need to equal my radix4/2 code. (Radix 4 is about 17% faster).

-------------------------------------------------------------
Trace 7/300 single precision 1d fft times:
starting power of 2 =      3  ending power of 2 =     20
 n=        8 cfft =     0.0209 ms. Mfl =   4.0  rad2 =   4.7
 n=       16 cfft =     0.0316 ms. Mfl =   8.6  rad2 =  10.1
 n=       32 cfft =     0.0503 ms. Mfl =  12.1  rad2 =  14.2
 n=       64 cfft =     0.0913 ms. Mfl =  17.9  rad2 =  21.0
 n=      128 cfft =     0.1779 ms. Mfl =  19.8  rad2 =  23.3
 n=      256 cfft =     0.3682 ms. Mfl =  23.6  rad2 =  27.8
 n=      512 cfft =     0.7791 ms. Mfl =  23.7  rad2 =  27.8
 n=     1024 cfft =     1.6792 ms. Mfl =  25.9  rad2 =  30.5
 n=     2048 cfft =     3.6028 ms. Mfl =  25.3  rad2 =  29.8
 n=     4096 cfft =     7.7836 ms. Mfl =  26.8  rad2 =  31.6
 n=     8192 cfft =    16.6511 ms. Mfl =  26.1  rad2 =  30.7
 n=    16384 cfft =    35.7757 ms. Mfl =  27.2  rad2 =  32.1
 n=    32768 cfft =    75.7264 ms. Mfl =  26.6  rad2 =  31.3
 n=    65536 cfft =   161.9817 ms. Mfl =  27.5  rad2 =  32.4
 n=   131072 cfft =   342.1053 ms. Mfl =  26.8  rad2 =  31.6
 n=   262144 cfft =   722.2223 ms. Mfl =  27.8  rad2 =  32.7
 n=   524288 cfft =  1514.5286 ms. Mfl =  27.2  rad2 =  32.0
 n=  1048576 cfft =  3185.8552 ms. Mfl =  28.0  rad2 =  32.9
Trace 14/300 single precision 1d fft times:
starting power of 2 =      3  ending power of 2 =     20
 n=        8 cfft =     0.0217 ms. Mfl =   3.9  rad2 =   4.5
 n=       16 cfft =     0.0262 ms. Mfl =  10.4  rad2 =  12.2
 n=       32 cfft =     0.0378 ms. Mfl =  16.1  rad2 =  18.9
 n=       64 cfft =     0.0612 ms. Mfl =  26.7  rad2 =  31.4
 n=      128 cfft =     0.1079 ms. Mfl =  32.6  rad2 =  38.4
 n=      256 cfft =     0.2098 ms. Mfl =  41.5  rad2 =  48.8
 n=      512 cfft =     0.4266 ms. Mfl =  43.2  rad2 =  50.8
 n=     1024 cfft =     0.8981 ms. Mfl =  48.5  rad2 =  57.0
 n=     2048 cfft =     1.9074 ms. Mfl =  47.8  rad2 =  56.2
 n=     4096 cfft =     4.0974 ms. Mfl =  51.0  rad2 =  60.0
 n=     8192 cfft =     8.6544 ms. Mfl =  50.2  rad2 =  59.0
 n=    16384 cfft =    18.5307 ms. Mfl =  52.6  rad2 =  61.9
 n=    32768 cfft =    39.1996 ms. Mfl =  51.4  rad2 =  60.5
 n=    65536 cfft =    83.7445 ms. Mfl =  53.2  rad2 =  62.6
 n=   131072 cfft =   176.3980 ms. Mfl =  52.0  rad2 =  61.2
 n=   262144 cfft =   370.6140 ms. Mfl =  54.1  rad2 =  63.7
 n=   524288 cfft =   776.8641 ms. Mfl =  53.0  rad2 =  62.3
 n=  1048576 cfft =  1635.6908 ms. Mfl =  54.5  rad2 =  64.1
Trace 28/300 single precision 1d fft times:
starting power of 2 =      3  ending power of 2 =     20
 n=        8 cfft =     0.0244 ms. Mfl =   3.4  rad2 =   4.0
 n=       16 cfft =     0.0293 ms. Mfl =   9.3  rad2 =  10.9
 n=       32 cfft =     0.0336 ms. Mfl =  18.1  rad2 =  21.3
 n=       64 cfft =     0.0486 ms. Mfl =  33.6  rad2 =  39.5
 n=      128 cfft =     0.0778 ms. Mfl =  45.2  rad2 =  53.2
 n=      256 cfft =     0.1340 ms. Mfl =  64.9  rad2 =  76.4
 n=      512 cfft =     0.2646 ms. Mfl =  69.7  rad2 =  81.9
 n=     1024 cfft =     0.5196 ms. Mfl =  83.8  rad2 =  98.5
 n=     2048 cfft =     1.1107 ms. Mfl =  82.1  rad2 =  96.5
 n=     4096 cfft =     2.2842 ms. Mfl =  91.5  rad2 = 107.6
 n=     8192 cfft =     5.0909 ms. Mfl =  85.3  rad2 = 100.3
 n=    16384 cfft =    10.2632 ms. Mfl =  95.0  rad2 = 111.7
 n=    32768 cfft =    22.3112 ms. Mfl =  90.3  rad2 = 106.3
 n=    65536 cfft =    45.9704 ms. Mfl =  96.9  rad2 = 114.0
 n=   131072 cfft =    99.7807 ms. Mfl =  92.0  rad2 = 108.2
 n=   262144 cfft =   204.0159 ms. Mfl =  98.3  rad2 = 115.6
 n=   524288 cfft =   438.3224 ms. Mfl =  93.9  rad2 = 110.5
 n=  1048576 cfft =   901.5899 ms. Mfl =  98.9  rad2 = 116.3
^LTrace 7/300 double precision 1d fft times:
starting power of 2 =      3  ending power of 2 =     20
 n=        8  dcfft =     0.0234 ms. Mfl =   3.6  rad2 =   4.2
 n=       16  dcfft =     0.0411 ms. Mfl =   6.6  rad2 =   7.8
 n=       32  dcfft =     0.0749 ms. Mfl =   8.1  rad2 =   9.6
 n=       64  dcfft =     0.1514 ms. Mfl =  10.8  rad2 =  12.7
 n=      128  dcfft =     0.3150 ms. Mfl =  11.2  rad2 =  13.1
 n=      256  dcfft =     0.6866 ms. Mfl =  12.7  rad2 =  14.9
 n=      512  dcfft =     1.4806 ms. Mfl =  12.4  rad2 =  14.6
 n=     1024  dcfft =     3.2431 ms. Mfl =  13.4  rad2 =  15.8
 n=     2048  dcfft =     7.0022 ms. Mfl =  13.0  rad2 =  15.3
 n=     4096  dcfft =    15.1922 ms. Mfl =  13.8  rad2 =  16.2
 n=     8192  dcfft =    32.6350 ms. Mfl =  13.3  rad2 =  15.7
 n=    16384  dcfft =    70.1754 ms. Mfl =  13.9  rad2 =  16.3
 n=    32768  dcfft =   149.0771 ms. Mfl =  13.5  rad2 =  15.9
 n=    65536  dcfft =   317.1601 ms. Mfl =  14.1  rad2 =  16.5
 n=   131072  dcfft =   672.3318 ms. Mfl =  13.6  rad2 =  16.1
 n=   262144  dcfft =  1421.8750 ms. Mfl =  14.1  rad2 =  16.6
 n=   524288  dcfft =  3128.8379 ms. Mfl =  13.2  rad2 =  15.5
 n=  1048576  dcfft =  6471.4912 ms. Mfl =  13.8  rad2 =  16.2
Trace 14/300 double precision 1d fft times:
starting power of 2 =      3  ending power of 2 =     20
 n=        8  dcfft =     0.0196 ms. Mfl =   4.3  rad2 =   5.1
 n=       16  dcfft =     0.0294 ms. Mfl =   9.2  rad2 =  10.9
 n=       32  dcfft =     0.0491 ms. Mfl =  12.4  rad2 =  14.6
 n=       64  dcfft =     0.0896 ms. Mfl =  18.2  rad2 =  21.4
 n=      128  dcfft =     0.1751 ms. Mfl =  20.1  rad2 =  23.6
 n=      256  dcfft =     0.3667 ms. Mfl =  23.7  rad2 =  27.9
 n=      512  dcfft =     0.7762 ms. Mfl =  23.7  rad2 =  27.9
 n=     1024  dcfft =     1.6808 ms. Mfl =  25.9  rad2 =  30.5
 n=     2048  dcfft =     3.5950 ms. Mfl =  25.4  rad2 =  29.8
 n=     4096  dcfft =     7.7551 ms. Mfl =  26.9  rad2 =  31.7
 n=     8192  dcfft =    16.6725 ms. Mfl =  26.0  rad2 =  30.6
 n=    16384  dcfft =    35.7785 ms. Mfl =  27.2  rad2 =  32.1
 n=    32768  dcfft =    75.6102 ms. Mfl =  26.7  rad2 =  31.4
 n=    65536  dcfft =   160.3618 ms. Mfl =  27.8  rad2 =  32.7
 n=   131072  dcfft =   342.1053 ms. Mfl =  26.8  rad2 =  31.6
 n=   262144  dcfft =   723.4101 ms. Mfl =  27.7  rad2 =  32.6
 n=   524288  dcfft =  1532.3464 ms. Mfl =  26.9  rad2 =  31.6
 n=  1048576  dcfft =  3242.8730 ms. Mfl =  27.5  rad2 =  32.3
Trace 28/300 double precision 1d fft times:
starting power of 2 =      3  ending power of 2 =     20
 n=        8  dcfft =     0.0221 ms. Mfl =   3.8  rad2 =   4.5
 n=       16  dcfft =     0.0259 ms. Mfl =  10.5  rad2 =  12.4
 n=       32  dcfft =     0.0396 ms. Mfl =  15.4  rad2 =  18.1
 n=       64  dcfft =     0.0629 ms. Mfl =  25.9  rad2 =  30.5
 n=      128  dcfft =     0.1167 ms. Mfl =  30.2  rad2 =  35.5
 n=      256  dcfft =     0.2175 ms. Mfl =  40.0  rad2 =  47.1
 n=      512  dcfft =     0.4603 ms. Mfl =  40.0  rad2 =  47.1
 n=     1024  dcfft =     0.9302 ms. Mfl =  46.8  rad2 =  55.0
 n=     2048  dcfft =     2.0565 ms. Mfl =  44.3  rad2 =  52.1
 n=     4096  dcfft =     4.2072 ms. Mfl =  49.7  rad2 =  58.4
 n=     8192  dcfft =     9.3226 ms. Mfl =  46.6  rad2 =  54.8
 n=    16384  dcfft =    19.1722 ms. Mfl =  50.8  rad2 =  59.8
 n=    32768  dcfft =    41.9885 ms. Mfl =  48.0  rad2 =  56.5
 n=    65536  dcfft =    86.5954 ms. Mfl =  51.5  rad2 =  60.5
 n=   131072  dcfft =   187.9797 ms. Mfl =  48.8  rad2 =  57.4
 n=   262144  dcfft =   450.7950 ms. Mfl =  44.5  rad2 =  52.3
 n=   524288  dcfft =   883.7720 ms. Mfl =  46.6  rad2 =  54.8
 n=  1048576  dcfft =  1767.5441 ms. Mfl =  50.4  rad2 =  59.3
^LTrace 7/300 single precision 2d fft times:
Starting power of 2 =      3  ending power of 2 =     11
 n=     8  2d cfft =   0.00030 sec, Mflops =   4.5 rad2 =   5.3
 n=    16  2d cfft =   0.00074 sec, Mflops =  11.7 rad2 =  13.8
 n=    32  2d cfft =   0.00275 sec, Mflops =  14.2 rad2 =  16.7
 n=    64  2d cfft =   0.01069 sec, Mflops =  19.5 rad2 =  23.0
 n=   128  2d cfft =   0.04327 sec, Mflops =  20.8 rad2 =  24.5
 n=   256  2d cfft =   0.18353 sec, Mflops =  24.3 rad2 =  28.6
 n=   512  2d cfft =   0.78399 sec, Mflops =  24.1 rad2 =  28.3
 n=  1024  2d cfft =   3.40680 sec, Mflops =  26.2 rad2 =  30.8
 n=  2048  2d cfft =  15.35197 sec, Mflops =  24.3 rad2 =  28.6
Trace 14/300 single precision 2d fft times:
Starting power of 2 =      3  ending power of 2 =     11
 n=     8  2d cfft =   0.00021 sec, Mflops =   6.3 rad2 =   7.4
 n=    16  2d cfft =   0.00057 sec, Mflops =  15.2 rad2 =  17.9
 n=    32  2d cfft =   0.00172 sec, Mflops =  22.7 rad2 =  26.7
 n=    64  2d cfft =   0.00741 sec, Mflops =  28.2 rad2 =  33.2
 n=   128  2d cfft =   0.02444 sec, Mflops =  36.9 rad2 =  43.4
 n=   256  2d cfft =   0.09978 sec, Mflops =  44.7 rad2 =  52.5
 n=   512  2d cfft =   0.41338 sec, Mflops =  45.7 rad2 =  53.7
 n=  1024  2d cfft =   1.78180 sec, Mflops =  50.0 rad2 =  58.8
 n=  2048  2d cfft =   7.86842 sec, Mflops =  47.4 rad2 =  55.8
Trace 28/300 single precision 2d fft times:
Starting power of 2 =      3  ending power of 2 =     11
 n=     8  2d cfft =   0.00021 sec, Mflops =   6.3 rad2 =   7.4
 n=    16  2d cfft =   0.00057 sec, Mflops =  15.2 rad2 =  17.9
 n=    32  2d cfft =   0.00129 sec, Mflops =  30.2 rad2 =  35.5
 n=    64  2d cfft =   0.00428 sec, Mflops =  48.8 rad2 =  57.5
 n=   128  2d cfft =   0.01527 sec, Mflops =  59.0 rad2 =  69.4
 n=   256  2d cfft =   0.05524 sec, Mflops =  80.7 rad2 =  94.9
 n=   512  2d cfft =   0.24945 sec, Mflops =  75.7 rad2 =  89.0
 n=  1024  2d cfft =   0.96930 sec, Mflops =  92.0 rad2 = 108.2
 n=  2048  2d cfft =   4.49013 sec, Mflops =  83.1 rad2 =  97.8
 (4k x 4k also run recently : 17.7 seconds 96.6 megaflops)
^LTrace 7/300 double precision 2d fft times:
Starting power of 2 =      3  ending power of 2 =     11
 n=     8  2d dcfft =   0.00034 sec, Mflops =   3.9 rad2 =   4.6
 n=    16  2d dcfft =   0.00126 sec, Mflops =   6.9 rad2 =   8.1
 n=    32  2d dcfft =   0.00464 sec, Mflops =   8.4 rad2 =   9.9
 n=    64  2d dcfft =   0.01910 sec, Mflops =  10.9 rad2 =  12.9
 n=   128  2d dcfft =   0.08044 sec, Mflops =  11.2 rad2 =  13.2
 n=   256  2d dcfft =   0.35101 sec, Mflops =  12.7 rad2 =  14.9
 n=   512  2d dcfft =   1.51809 sec, Mflops =  12.4 rad2 =  14.6
 n=  1024  2d dcfft =   6.65680 sec, Mflops =  13.4 rad2 =  15.8
 n=  2048  2d dcfft =  28.82237 sec, Mflops =  13.0 rad2 =  15.2
Trace 14/300 double precision 2d fft times:
Starting power of 2 =      3  ending power of 2 =     11
 n=     8  2d dcfft =   0.00021 sec, Mflops =   6.3 rad2 =   7.4
 n=    16  2d dcfft =   0.00086 sec, Mflops =  10.1 rad2 =  11.9
 n=    32  2d dcfft =   0.00266 sec, Mflops =  14.6 rad2 =  17.2
 n=    64  2d dcfft =   0.01055 sec, Mflops =  19.8 rad2 =  23.3
 n=   128  2d dcfft =   0.04276 sec, Mflops =  21.1 rad2 =  24.8
 n=   256  2d dcfft =   0.18174 sec, Mflops =  24.5 rad2 =  28.8
 n=   512  2d dcfft =   0.77686 sec, Mflops =  24.3 rad2 =  28.6
 n=  1024  2d dcfft =   3.43531 sec, Mflops =  25.9 rad2 =  30.5
 n=  2048  2d dcfft =  14.69627 sec, Mflops =  25.4 rad2 =  29.9
Trace 28/300 double precision 2d fft times:
Starting power of 2 =      3  ending power of 2 =     11
 n=     8  2d dcfft =   0.00021 sec, Mflops =   6.3 rad2 =   7.4
 n=    16  2d dcfft =   0.00063 sec, Mflops =  13.8 rad2 =  16.3
 n=    32  2d dcfft =   0.00180 sec, Mflops =  21.6 rad2 =  25.4
 n=    64  2d dcfft =   0.00627 sec, Mflops =  33.3 rad2 =  39.2
 n=   128  2d dcfft =   0.02545 sec, Mflops =  35.4 rad2 =  41.6
 n=   256  2d dcfft =   0.09800 sec, Mflops =  45.5 rad2 =  53.5
 n=   512  2d dcfft =   0.44189 sec, Mflops =  42.7 rad2 =  50.3
 n=  1024  2d dcfft =   1.79605 sec, Mflops =  49.6 rad2 =  58.4
 n=  2048  2d dcfft =   8.08224 sec, Mflops =  46.2 rad2 =  54.3
 (4k x 4k also run recently : 32.8 seconds 52.2 megaflops)
^LTrace 7/300 single precision 3d fft times:
Starting power of 2 =      3  ending power of 2 =      7
 n=     8 time for 3d  cfft =   0.00300 sec, Mflops = 5.376
 n=    16 time for 3d  cfft =   0.02167 sec, Mflops = 9.641
 n=    32 time for 3d  cfft =   0.14500 sec, Mflops =12.881
 n=    64 time for 3d  cfft =   1.21667 sec, Mflops =16.483
 n=   128 time for 3d  cfft =  10.00000 sec, Mflops =17.302
Trace 14/300 single precision 3d fft times:
Starting power of 2 =      3  ending power of 2 =      7
 n=     8 time for 3d  cfft =   0.00250 sec, Mflops = 6.451
 n=    16 time for 3d  cfft =   0.01333 sec, Mflops =15.667
 n=    32 time for 3d  cfft =   0.08667 sec, Mflops =21.551
 n=    64 time for 3d  cfft =   0.70000 sec, Mflops =28.649
 n=   128 time for 3d  cfft =   5.63333 sec, Mflops =30.713
Trace 28/300 single precision 3d fft times:
Starting power of 2 =      3  ending power of 2 =      7
 n=     8 time for 3d  cfft =   0.00283 sec, Mflops = 5.692
 n=    16 time for 3d  cfft =   0.01167 sec, Mflops =17.905
 n=    32 time for 3d  cfft =   0.06167 sec, Mflops =30.288
 n=    64 time for 3d  cfft =   0.48333 sec, Mflops =41.491
 n=   128 time for 3d  cfft =   3.71667 sec, Mflops =46.551
(256 ** 3 case run recently, 24.4s, 70 mflps)
^LTrace 7/300 double precision 3d fft times:
Starting power of 2 =      3  ending power of 2 =      7
 n=     8 time for 3d dcfft =   0.00500 sec, Mflops = 3.226
 n=    16 time for 3d dcfft =   0.03500 sec, Mflops = 5.968
 n=    32 time for 3d dcfft =   0.26333 sec, Mflops = 7.093
 n=    64 time for 3d dcfft =   2.28333 sec, Mflops = 8.783
 n=   128 time for 3d dcfft =  21.31667 sec, Mflops = 8.116
Trace 14/300 double precision 3d fft times:
Starting power of 2 =      3  ending power of 2 =      7
 n=     8 time for 3d dcfft =   0.00300 sec, Mflops = 5.376
 n=    16 time for 3d dcfft =   0.02167 sec, Mflops = 9.641
 n=    32 time for 3d dcfft =   0.14500 sec, Mflops =12.881
 n=    64 time for 3d dcfft =   1.28333 sec, Mflops =15.627
 n=   128 time for 3d dcfft =  10.61667 sec, Mflops =16.297
Trace 28/300 double precision 3d fft times:
Starting power of 2 =      3  ending power of 2 =      7
 n=     8 time for 3d dcfft =   0.00250 sec, Mflops = 6.451
 n=    16 time for 3d dcfft =   0.01333 sec, Mflops =15.667
 n=    32 time for 3d dcfft =   0.08833 sec, Mflops =21.145
 n=    64 time for 3d dcfft =   0.80000 sec, Mflops =25.068
 n=   128 time for 3d dcfft =   6.70000 sec, Mflops =25.823
(256 ** 3 case run recently, 44.2, 38 mflps)


----------------------------------------------------------------------------

I havent included all of the replies I received, but there is a lot of
interesting info in all this. It should at least give a good feel for what
is out there. 

David
Internet, Bitnet: dakramer@olympus.princeton.edu