mjl@ut-emx.UUCP (Maurice LeBrun) (07/26/90)
Hello all, Thanks to all (esp. Dave Haynie) who have contributed articles on high-end Amiga performance & architecture. I've ordered an Amiga 3000 (what else?), which I intend to put to use in my research for developing and testing plasma simulation codes, including perhaps post-processing work & graphics. I still have several performance questions, and I would appreciate your response. For reference, numerical codes often spend most of their time doing SAXPY operations: (written in pseudo-Fortran) do i=1,N j = jmin + (i-1)*jinc y(j) = a(j) + b(j) * x(j) enddo i (a,b,x,y are real*4) Exactly where the time is spent in this loop depends heavily on architecture. On high-end machines (e.g. Crays), often the memory fetch for vectors a,b,x and store for y is the most time intensive part. For Crays, the throughput depends critically on the stride through memory (jinc), with large powers of 2 being exceptionally bad due to the way memory is organized. Even a stride as low as 2 is bad on some large memory machines (Cray-2), where the clock cycle is much, much smaller than the memory refresh time. Also, a random stride can be bad (they occur a lot in particle/mesh codes). From what I've read here, the 68030's "burst mode" of memory transfers can give high rates of transfer. It seems the trick is similar to that used on the more expensive machines. #1: How will stride affect this transfer rate? #2: I've seen the phrase "1 word every 1 or 2 cycles" used to describe this transfer rate. Can someone be more exact? #3: (a bit more ho-hum) I'll need to buy some SCRAMS to fill out my motherboard some. :-) I'll be buying 8 of the 4Mb variety. Does it make much of a difference in performance to get 80ns ones, or will 100ns do? How do they compare in price? The other place where the SAXPY operation spends time is, of course, the floating point ops. The reason these are so cheap on the high end machines is due to vectorization or pipelining -- once the pipeline is set up, you get a result out every clock cycle. Which leads me to.. #4: Does the 68030/68882 employ a similar pipelining scheme to the big boys? If so, how many clocks to get out a floating point add/ multiply once the pipeline is going, and how many for the first one? Thanks in advance. Maurice LeBrun Institute for Fusion Studies mjl@fusion.ph.utexas.edu University of Texas at Austin