[comp.sys.amiga.hardware] Performance of A3000 for numerical work

mjl@ut-emx.UUCP (Maurice LeBrun) (07/26/90)
Hello all,

Thanks to all (esp.  Dave Haynie) who have contributed articles on high-end
Amiga performance & architecture.  I've ordered an Amiga 3000 (what else?),
which I intend to put to use in my research for developing and testing
plasma simulation codes, including perhaps post-processing work & graphics.
I still have several performance questions, and I would appreciate your
response.

For reference, numerical codes often spend most of their time doing SAXPY
operations:  (written in pseudo-Fortran)

	do i=1,N
	    j = jmin + (i-1)*jinc

	    y(j) = a(j) + b(j) * x(j)
	enddo i

	(a,b,x,y   are  real*4)

Exactly where the time is spent in this loop depends heavily on
architecture.  On high-end machines (e.g.  Crays), often the memory fetch
for vectors a,b,x and store for y is the most time intensive part.  For
Crays, the throughput depends critically on the stride through memory
(jinc), with large powers of 2 being exceptionally bad due to the way memory
is organized.  Even a stride as low as 2 is bad on some large memory
machines (Cray-2), where the clock cycle is much, much smaller than the
memory refresh time.  Also, a random stride can be bad (they occur a lot in
particle/mesh codes).

From what I've read here, the 68030's "burst mode" of memory transfers can
give high rates of transfer.  It seems the trick is similar to that used on
the more expensive machines.

#1:
    How will stride affect this transfer rate?

#2:
    I've seen the phrase "1 word every 1 or 2 cycles" used to describe
    this transfer rate.  Can someone be more exact?

#3: (a bit more ho-hum)
    I'll need to buy some SCRAMS to fill out my motherboard some. :-)
    I'll be buying 8 of the 4Mb variety.  Does it make much of a
    difference in performance to get 80ns ones, or will 100ns do?
    How do they compare in price?

The other place where the SAXPY operation spends time is, of course, the
floating point ops.  The reason these are so cheap on the high end machines
is due to vectorization or pipelining -- once the pipeline is set up, you
get a result out every clock cycle.  Which leads me to..

#4:
    Does the 68030/68882 employ a similar pipelining scheme to the big
    boys?  If so, how many clocks to get out a floating point add/
    multiply once the pipeline is going, and how many for the first one?

Thanks in advance.

Maurice LeBrun                Institute for Fusion Studies  
mjl@fusion.ph.utexas.edu      University of Texas at Austin