rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) (05/04/91)
McAlpin comments that he finds vectorization (even on the Cyber 205) simpler, more intuitive and more transportable than the optimization techniques used on cached machines like the RS/6000. I think this is partly because the vector model of parallelism is so rigid; optimization for the superscalars involves a bigger bag of tricks. Still, I have found that there are fewer things they choke on, and that it is easier to localize optimization in a few reusable routines. Two case-studies: (a) I have some semi-spectral 2D fluid codes (finite diff in one direction, spectral in the other) which I never got around to optimizing on the Cyber, because it would have involved some major structural changes. On the other hand, on the RS/6000, i860 based machines, and even my hated DN10000, 1D FFT's scream right along at nearly the machine's top speed (lots of data re-use). In this case, a simple plug-in of canned FFT's gave a major speed-up. (b) Tridiagonal solving. Comes up in lots of codes, and it is a real vector-breaker. In fact, vector machines choke on all sorts of recursion, whereas the superscalars just love them. On the RS/6000, the tridiag code basically vanished, whereas on the vector Stardent, it was a bottleneck. A third example that occurs to me is evaluation of transcendental functions. Lots of recursion, and pretty efficient on the RISCS. On a vector machine, you have to keep iterating the vector until the slowest converging argument is done converging (unless you do a lot of reshuffling in memory) Now, the $64 question: Why no supercomputer based on an architecture for the processor like the RS/6000, BUT with your extra $2M buying bandwidth to memory like the Cray's (no cache)? This would seem to be a real winner. You could simulate vectorization on it, but it would have all the flexibility of the newer machines. .
hrubin@pop.stat.purdue.edu (Herman Rubin) (05/04/91)
In article <1991May4.031835.7979@midway.uchicago.edu>, rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes: > McAlpin comments that he finds vectorization (even on the Cyber 205) > simpler, more intuitive and more transportable than the optimization > techniques used on cached machines like the RS/6000. > I think this is partly because the vector model of parallelism is > so rigid; optimization for the superscalars involves a bigger bag > of tricks. Still, I have found that there are fewer things they > choke on, and that it is easier to localize optimization in a few > reusable routines. Two case-studies: ................... There are differences in the various types of vector machines; most have highly rigid vectors, but the 205 has quite a bit of flexibility. > (b) Tridiagonal solving. Comes up in lots of codes, and it is > a real vector-breaker. In fact, vector machines choke on all > sorts of recursion, whereas the superscalars just love them. > On the RS/6000, the tridiag code basically vanished, whereas on > the vector Stardent, it was a bottleneck. There are tricky ways of doing this efficiently on vector machines, especially flexible ones. This uses partitioning. > A third example that occurs to me is evaluation of transcendental > functions. Lots of recursion, and pretty efficient on the RISCS. > On a vector machine, you have to keep iterating the vector until > the slowest converging argument is done converging (unless you > do a lot of reshuffling in memory) But this reshuffling in memory on the 205 is dirt cheap! It could be improved by adding more instructions compatible with a CISC vector machine. In fact, this type of management is quite useful in the most efficient type of generation of non-uniform random numbers, acceptance-rejection, and the vector methods are quite competitive on scalar machines with the older scalar versions, and only use more memory, no longer a bottleneck here. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
rmarq@iago.caltech.edu (Marquardt, Ron R.) (05/05/91)
In article <11875@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes... [regarding previous comments contrasting the RS/6000 superscalar architecture with the Stardent vector architecture] >> (b) Tridiagonal solving. Comes up in lots of codes, and it is >> a real vector-breaker. In fact, vector machines choke on all >> sorts of recursion, whereas the superscalars just love them. >> On the RS/6000, the tridiag code basically vanished, whereas on >> the vector Stardent, it was a bottleneck. > >There are tricky ways of doing this efficiently on vector machines, >especially flexible ones. This uses partitioning. > Our experience with these machines (we have two IBM RS/6000's, a 530 and a 540, and three of what are now Stardent 3000's) has shown that for matrix problems exceeding 200x200, tridiagonal or otherwise, the Stardents can be considerably faster than the RS/6000's. The relative performance difference increases with the size of the problem and can reach as great as a factor of 10 (for naively written code). This problem can be traced to the small TLB (mapping 512K) on the IBM machines. Back when Stardent was Ardent, they had similar problems, which were more easily fixed since their TLB is external. The Stardent ETLBs now can map >128 Mb (and I believe it goes as high as the maximum allowed memory in the system, which escapes me). For an admittingly simple benchmark, a traditionally coded matrix multiply, the 530 is actually SLOWER than a Sparc 1, and the 540 only marginally faster, for 701x701 [530: 568 sec., Sparc 1: 525 sec., and 540: 470 sec.]. Before people complain too loudly, it is very easy to recode a matrix multiply for better efficiency and get the 540 time down to 49 sec., BUT it is not necessarily true that for all matrix routines such a recoding is simple, or even possible. Better compiler technology would certainly help, and IBM's latest beta version for the RS/6000's (exact version ID escapes me) promises a great deal of improvement. For arbitrarily complex and general problems, however, a larger TLB would certainly benefit the RS/6000 family. By the way, although they are not superscalar, the latest HP workstations seem to be avoiding the TLB problems of the IBMs. Included in the TLB scheme for the Snakes is TLB mapping for four large blocks of memory, each 16Mb in size. From a user's point of view, 64 Mb is far better than the 512Kb mapped by the RS/6000's TLB. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Ron Marquardt E-mail: rmarq@iago.caltech.edu Solid State Device Physics Group California Institute of Technology
eugene@nas.nasa.gov (Eugene N. Miya) (05/05/91)
Vectors are simpler and "rigid:" In article <1991May4.031835.7979@midway.uchicago.edu> rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes: >Now, the $64 question: Why no supercomputer based on an >architecture for the processor like the RS/6000, BUT with >your extra $2M buying bandwidth to memory like the Cray's >(no cache)? 1) Few supercomputers exist. New architectures are expensive propositions. 2) Vectors were proposed in the 1960s by IBM, built in the 1970s by Cray and others. S-Scalar were not implemented until the 1980s. 3) Technological and economic factors play on any person trying to design and build a super. Do you want to build one? I didn't think so. Can I get my $64? Ain't hindsight wonderful? 8^) It's not only clear, but it's cheap.
mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (05/05/91)
> On 4 May 91 03:18:35 GMT, rtp1@quads.uchicago.edu (raymond thomas
pierrehumbert) said:
raymond> McAlpin comments that he finds vectorization (even on the
raymond> Cyber 205) simpler, more intuitive and more transportable
raymond> than the optimization techniques used on cached machines like
raymond> the RS/6000.
raymond> (a) I have some semi-spectral 2D fluid codes (finite diff in
raymond> one direction, spectral in the other) which I never got
raymond> around to optimizing on the Cyber, because it would have involved
raymond> some major structural changes.
I have had no trouble optimizing similar codes for the ETA-10. Of
course, my codes were designed with long vectors in mind, and it
sounds like yours were not. It also helps if the problems are large
enough. On the ETA-10, one did not get 90% of full speed on FFT's
until you were calculating a minimum of several hundred of them
simultaneously. I agree that it was often very difficult to
optimizing codes written for other machines on the Cyber. But I found
the long vector programming model so easy to use that I often re-wrote
the applications from scratch faster than I could have "optimized" the
original...
raymond> (b) Tridiagonal solving. Comes up in lots of codes, and it is
raymond> a real vector-breaker. In fact, vector machines choke on all
raymond> sorts of recursion, whereas the superscalars just love them.
raymond> On the RS/6000, the tridiag code basically vanished, whereas on
raymond> the vector Stardent, it was a bottleneck.
In fluid dynamics codes, tridiagonal systems almost always arise in
groups (eg. one system per row of the 2-D domain). In these cases it
is trivial to vectorize sideways across the systems and get full
vector performance. You still have a vector divide or two in there,
but as long as you are not working on an IBM 3090VF that should not be
too much of a problem. ;-)
Once again, I agree that (for example) the RS/6000 is a wonderful
machine. I am now completing the optimization of a 2-D finite
difference vorticity model which (I currently estimate) will run at a
sustained 15 MFLOPS (64-bit) on the RS/6000-320. *But* the
optimizations which enabled me to do this are *much* uglier and lead
to *much* less maintainable code than the original long vector code.
If I did not anticipate a need for 200 days wall time to complete the
work with the original code, I would not have considered these
optimizations worthwhile.
(The optimizations that I use are: (1) use IBM ESSL routines for FFT's
and tridiagonal solves; (2) inner loop unrolling; (3) outer loop
unrolling followed by inner loop jamming/fusion; (4) subroutine
inlining followed by manual loop jamming followed by unrolling.)
--
John D. McCalpin mccalpin@perelandra.cms.udel.edu
Assistant Professor mccalpin@brahms.udel.edu
College of Marine Studies, U. Del. J.MCCALPIN/OMNET
csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/06/91)
>raymond> (b) Tridiagonal solving. Comes up in lots of codes, and it is >raymond> a real vector-breaker. In fact, vector machines choke on all >raymond> sorts of recursion, whereas the superscalars just love them. >raymond> On the RS/6000, the tridiag code basically vanished, whereas on >raymond> the vector Stardent, it was a bottleneck. The latest Fujitsu compiler (for the VP2000 series) will handle 1st order backwards recurrence, ie A(I)=A(I-1)+B(I), as well as the usual forward recurrence, in full vector mode. Many old scalar problems can now be vectorized. I'd be interested to hear if Cray/ETA/Alliant etc have this capability too. -- Rowan Hughes James Cook University Marine Modelling Unit Townsville, Australia. Dept. Civil and Systems Engineering csrdh@marlin.jcu.edu.au
rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) (05/06/91)
>recurrence, in full vector mode. Many old scalar problems can now be >vectorized. I'd be interested to hear if Cray/ETA/Alliant etc have this >capability too. ETA is irrelevant, except for comp.folklore.computers, and I don't know about the current crop of Crays, but the new Alliants (FX/800, FX/2800) are based on the i860 chip, and so they do their "vectorization" by generating pipelined code for the RISC chip. The i860 has no problems with recursion, except for possible delays in re-use of a result (during which you usually want to store something else, or do another computation not dependent on the result). I wonder how the Fujitsu handles the recursion. For long enough vectors to make it worthwhile, there is a pretty well known hack to make sum-reduction vectorizable (split the vector in two, do a vector add, split again, etc.). Works for dot products as well, of course. The RISC architectures can handle a much more general kind of recursion. Concerning McAlpin's comments on vectorizability and maintainability of code, I'm not claiming that the codes I'm talking about are intrinsically unvectorizable. I'm just giving some examples from one particular (and perhaps rather lazy, as far as coding goes) real-world user. My experience also reflects that of a lot of postdocs and students that have worked for me. We write a lot of banged-out code, get interesting results, and somehow never get around to doing much optimization until the optimization becomes more interesting to do than looking at the scientific results (at which point, we've basically moved onto something else anyway). My own experience with the IBM R6000 architecture is that we have gotten more payoff from a little optimization effort, and that it is easier to plug in optimized ESSL routines. Tridiagonal solving is a case in point. True, in fluids codes, you can always vectorize in the direction(s) perpendicular to the recursion. On the other hand, this still means you have to do a little thinking to write a special purpose tridiag solver suited to your particular storage configuration. Not really all that hard, but I find it a lot easier to have a super-optimized canned tridiag solver I don't even have to recompile, and then simply do a bunch of calls to this solver; this code structure lays the groundwork for parallelization as well. True, you do still have to think about storage layout because of the performance hit for non-unit strides on cache-based architectures (wish somebody would build a cache that could pre-load more general patterns than contiguous memory). While I'm waxing nostalgic, some of the codes I have seen for cache-based machines have seemed hauntingly familiar, and I have only recently realized why: They hark back to the 70's, when computers had too limited RAM do hold your field, and so doing fluid dynamics on computers meant loading stuff into memory from disk, crunching the life out of it, and squirting the result back to disk. There was of course a huge premium on re-use of data (so I'm told-- this was really before my time, of course). .
hrubin@pop.stat.purdue.edu (Herman Rubin) (05/06/91)
In article <1991May6.035310.26794@marlin.jcu.edu.au>, csrdh@marlin.jcu.edu.au (Rowan Hughes) writes: ..................... > The latest Fujitsu compiler (for the VP2000 series) will handle 1st order > backwards recurrence, ie A(I)=A(I-1)+B(I), as well as the usual forward > recurrence, in full vector mode. Many old scalar problems can now be > vectorized. I'd be interested to hear if Cray/ETA/Alliant etc have this > capability too. The question is not what the compiler can handle; the older vector machines' compilers could handle this very well. The question is what can the vector hardware handle. Vector, and other, computers get their speed by allowing the interval between operations to be short compared to the time for an operation to complete. If instructions can issue in one cycle, and the adder is segmented so that a new addition can start one cycle after one has started, but addition takes 5 cycles (not at all unusual), the cited code can not use the vector capabilities as written, but can only have an operation every 5 cycles. Prohibiting the hardware from doing this cannot pay. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/07/91)
In <1991May6.055943.6234@midway.uchicago.edu> rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes: >I wonder how the Fujitsu handles the recursion. For long enough >vectors to make it worthwhile, there is a pretty well known hack >to make sum-reduction vectorizable (split the vector in two, >do a vector add, split again, etc.). Works for dot products as >well, of course. The RISC architectures can handle a much >more general kind of recursion. The sum-reduction has been on vector machines for many years. The recurrsion I meant was A(I)=A(I-1) + ... in general, when reordering is just not possible. The theoretical solution has also been around for several years, but Fijitsu appears to be the first vendor to have used it. See Willie Schonauer's book for the theory. "Scientific Computing on Vector Computers" Willie Schonauer, Elsevier Sci Pub 1987, ISBN 0444702881 Although a little dated now, it gives an excellent review of several vector architectures, and algorithms. -- Rowan Hughes James Cook University Marine Modelling Unit Townsville, Australia. Dept. Civil and Systems Engineering csrdh@marlin.jcu.edu.au
glew@pdx007.intel.com (Andy Glew) (05/07/91)
Just blueskying, but: The biggest advantage of vector instructions is that they convey to the memory system the access pattern. You have to add a lot of logic to your memory buffers to detect patterns beyond the simple "the next address is a stride of 64 past the previous" address differencing. A secondary advantage of vector instructions is that they map to multiple simple operations. Most vector implementations are pipelined, not element-by-element parallel. It isn't too hard to do superscalar/superpipelined implementations of scalar instruction sets that obtain parallelism comparable to most vector implementations. Multiple memory ports get harder, but are doable. But the access pattern info isn't there. A disadvantage of architecting a vector instruction set is the combinatoric explosion: first you provide element by element vector ops, then vector reductions, then recurrences; then you have to decide which vector ops are going to chain... So: why not combine vector memory access instructions that convey access pattern, with scalar computational operations? -- Andy Glew, glew@ichips.intel.com Intel Corp., M/S JF1-19, 5200 NE Elam Young Parkway, Hillsboro, Oregon 97124-6497 This is a private posting; it does not indicate opinions or positions of Intel Corp.
jlg@cochiti.lanl.gov (Jim Giles) (05/08/91)
In article <1991May6.035310.26794@marlin.jcu.edu.au>, csrdh@marlin.jcu.edu.au (Rowan Hughes) writes: > [...] > The latest Fujitsu compiler (for the VP2000 series) will handle 1st order > backwards recurrence, ie A(I)=A(I-1)+B(I), as well as the usual forward > recurrence, in full vector mode. Many old scalar problems can now be > vectorized. I'd be interested to hear if Cray/ETA/Alliant etc have this > capability too. First order recurrence can be rewritten using a technique call recursive doubling (which is, presumably what the Fujitsu compiler is doing automatically). If you can't get your compiler to do it automatically, use the following code as a guide to the technique (or read "A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations", Kogge, H.S. and P.M. Stone, IEEE Transactions of Computers, Vol. C-22, No. 8, August 1973, pp. 786-793) - (yes, the method really _has_ been around for 18 years): DO 10 i = 1, N A(i) = B(i) 10 continue j = 1 20 IF (j<N) THEN DO 30 i = j+1, N A(i) = A(i) + A(i-j) 30 continue j = 2 * j GO TO 20 ENDIF This is Fortran 77 code (with lower case as a readibility extension). The contents of both DO loops fully vectorize on the Crays, etc. If the DO loops are replaced with the Fortran 90 style array assignments: A(1:N) = B(1:N) and A(j+1:N) = A(j+1:N) + A(1:N-j) respectively, then the code will be fully parallel on the Connection Machine. The 'while' loop is scalar and will execute floor(LOG2(N)) times. Although it would be nicer if the compiler automatically saw recursions of this kind and applied this optimization, this is not too bad and gives the user direct control. Fujitso apparently already does this, maybe some of the compilers of the other companies on your list do also (or will soon follow suit). J. Giles
jlg@cochiti.lanl.gov (Jim Giles) (05/08/91)
In article <23348@lanl.gov>, jlg@cochiti.lanl.gov (Jim Giles) writes: |> [...] |> |> DO 30 i = j+1, N |> A(i) = A(i) + A(i-j) |> 30 continue Whoops! I just realized I still have a dependency here! That's what I get for reading articles on functional programming just before a Fortran-like posting. The '30' loop should be replaced with the two loops below: DO 30 i = j+1, N temp(i) = A(i) + A(i-j) 30 continue DO 40 i = j+1, N A(i) = temp(i) 40 continue These vectorize and eliminate the dependency. It's all so much simpler in a language with whole array syntax! Sorry for the inconvenience. J. Giles
csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/08/91)
In <11921@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: >The question is not what the compiler can handle; the older vector machines' >compilers could handle this very well. The question is what can the vector ^^^^^^^^^^^^^^^^ Herman, I think you'll find that backwards recurrence has been around for a while. Forward recurrence is still a problem ( or is it the other way around?). -- Rowan Hughes James Cook University Marine Modelling Unit Townsville, Australia. Dept. Civil and Systems Engineering csrdh@marlin.jcu.edu.au
bjornl@sics.se (Bj|rn Lisper) (05/08/91)
In article <23348@lanl.gov> jlg@cochiti.lanl.gov (Jim Giles) writes: >In article <1991May6.035310.26794@marlin.jcu.edu.au>, >csrdh@marlin.jcu.edu.au (Rowan Hughes) writes: >> [...] >> The latest Fujitsu compiler (for the VP2000 series) will handle 1st order >> backwards recurrence, ie A(I)=A(I-1)+B(I), as well as the usual forward >> recurrence, in full vector mode. Many old scalar problems can now be >> vectorized. I'd be interested to hear if Cray/ETA/Alliant etc have this >> capability too. >First order recurrence can be rewritten using a technique call recursive >doubling (which is, presumably what the Fujitsu compiler is doing >automatically). The transformation of linear recurrences by recursive doubling uses associativity and distributivity of the arithmetical operators. Thus, the numerical properties of the code are altered (since floating point operations are not perfectly associative, due to rounding errors). I think compilers should use such transformations with great care. Does the Fujitsu compiler provide a switch to control the vectorization of recurrences? Bjorn Lisper
anik@crhc.uiuc.edu (Sadun Anik) (05/08/91)
In article <GLEW.91May7175310@pdx007.intel.com> glew@pdx007.intel.com (Andy Glew) writes:
[ stuff deleted ]
So: why not combine vector memory access instructions that convey
access pattern, with scalar computational operations?
I thought Davidson's SMA architecture could do this. If I remember
correctly, a decoupled memory access unit could handle vector access and other
data structures. I don't know how much of the original SMA concepts are
implemented in ZS1 processor.
--
Sadun Anik, U of Illinois at Urbana-Champaign
e-mail: anik@crhc.uiuc.edu
rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) (05/08/91)
Andrew Glew, of Intel, writes > So: why not combine vector memory access instructions that convey >access pattern, with scalar computational operations? Yes, yes, my point exactly. Please do it! I'll take 10. In fact, I think a fruitful area for extending the current architectures is in a more general model for pre-loading the cache. Doing pre-loads by constant strides rather than by contiguous lines would already be a big performance boost, but I could also imagine more general solutions, where you gave the memory subsystem a vector of addresses which specified the pre-load pattern. I'm just a user though. I'm not at all clear how hard these things would be to implement in hardware. .
fu@crhc.uiuc.edu (John W. C. Fu) (05/09/91)
In article <1991May8.155455.14491@midway.uchicago.edu> rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes: >Andrew Glew, of Intel, writes >> So: why not combine vector memory access instructions that convey >>access pattern, with scalar computational operations? > >Yes, yes, my point exactly. Please do it! I'll take 10. >In fact, I think a fruitful area for extending the current >architectures is in a more general model for pre-loading >the cache. Doing pre-loads by constant strides rather >than by contiguous lines would already be a big >performance boost, but I could also imagine more >general solutions, where you gave the memory subsystem a >vector of addresses which specified the pre-load pattern. > It just so happens we have done some preliminary work in prefetching data into vector caches. The initial results seem quite encouraging. If you are interested the references are: J. W. C. Fu and J. H. Patel, "Data Prefetching Strategies for Vector Cache Memories," 5th. Int'l Parallel Processing Symp., May 1991 J. W. C. Fu and J. H. Patel, "Data Prefetching in Multiprocessor Vector Cache Memories," Proc. of 18th. Int'l Symp. on Computer Architecture, May 1991. John Fu Center for Reliable and High Performance Computing University of Illinois at Urbana-Champaign fu@crhc.uiuc.edu
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (05/09/91)
In article <1991May8.155455.14491@midway.uchicago.edu> rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes: >Andrew Glew, of Intel, writes >In fact, I think a fruitful area for extending the current >architectures is in a more general model for pre-loading >the cache. Doing pre-loads by constant strides rather >than by contiguous lines would already be a big >performance boost, but I could also imagine more >general solutions, where you gave the memory subsystem a >vector of addresses which specified the pre-load pattern. > >I'm just a user though. I'm not at all clear how hard these >things would be to implement in hardware. The best place to preload the cache is in the compiler. :-) -kym
billm@oakhill.sps.mot.com (Bill Moyer) (05/11/91)
In article <GLEW.91May7175310@pdx007.intel.com> glew@pdx007.intel.com (Andy Glew) writes: >Just blueskying, but: > > The biggest advantage of vector instructions is that they convey >to the memory system the access pattern. You have to add a lot of >logic to your memory buffers to detect patterns beyond the simple "the >next address is a stride of 64 past the previous" address > Most vector implementations are pipelined, not element-by-element >parallel. It isn't too hard to do superscalar/superpipelined >implementations of scalar instruction sets that obtain parallelism >comparable to most vector implementations. Multiple memory ports get >harder, but are doable. But the access pattern info isn't there. > > So: why not combine vector memory access instructions that convey >access pattern, with scalar computational operations? > - > >Andy Glew, glew@ichips.intel.com >Intel Corp., M/S JF1-19, 5200 NE Elam Young Parkway, >Hillsboro, Oregon 97124-6497 Check out the article "The WM Computer Architecture" by Wm. (Bill) Wulf (I believe it appeared in CAN circa early 1988). He descibes an interesting multi-op/instruction load/store architecture (RISC-like) based on a producer/consumer model, with special "Streaming" (his terminology) loads and stores which enter fifo queues. The semantics are Sin.size fifo#, stride, base, count Sout.size fifo#, stride, base, count As Wulf notes: "The similarity of streaming to 'vector load/store' operations should be apparent, and the rationale for its existence is similar -- it reduces the number of instructions in loop bodies and signals predictable reference patterns to the memory system." All in all a very interesting paper... Bill Moyer insert disclaimer here --> <> Senior Paleontologist / MC88110 design Motorola Microprocessor Products Group
lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (05/12/91)
In article <2646@fornax.UUCP> bremner@cs.sfu.ca (David Bremner) writes: >>I think this is partly because the vector model of parallelism is >>so rigid; optimization for the superscalars involves a bigger bag >>of tricks. >I think the vector model is actually quite a bit nicer. There's yet another model - the data parallel one. By this, I mean that one says e.g. "a=5", and 64,000 a's are set. This is obviously the natural programming style of SIMDs like the Connection Machine. - or so I thought, until I heard about the CM Fortran 1.0 compiler from Thinking Machines. It uses the Fortran-90 style of vector operations, e.g. x(1:100:2) = y(1:50) + z(1:50) and (for scientific problems) it seems to be generating faster code than ever before. TMI is claiming that they commonly see 5 GFLOPS, wheras data-parallel Paris was getting more like 1.5 GFLOPS. -- Don D.C.Lindsay Carnegie Mellon Robotics Institute