[comp.arch] Vector vs Cache/Superscalar

rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) (05/04/91)

McAlpin comments that he finds vectorization (even on the Cyber 205)
simpler, more intuitive and more transportable than the optimization
techniques used on cached machines like the RS/6000.

I think this is partly because the vector model of parallelism is
so rigid; optimization for the superscalars involves a bigger bag
of tricks.  Still, I have found that there are fewer things they
choke on, and that it is easier to localize optimization in a few
reusable routines.  Two case-studies:

(a)  I have some semi-spectral 2D fluid codes (finite diff in
one direction, spectral in the other) which I never got
around to optimizing on the Cyber, because it would have involved
some major structural changes.  On the other hand, on the RS/6000,
i860 based machines, and even my hated DN10000, 1D FFT's scream
right along at nearly the machine's top speed (lots of data re-use).
In this case, a simple plug-in of canned FFT's gave a major
speed-up.

(b) Tridiagonal solving.  Comes up in lots of codes, and it is
a real vector-breaker.  In fact, vector machines choke on all
sorts of recursion, whereas the superscalars just love them.
On the RS/6000, the tridiag code basically vanished, whereas on
the vector Stardent, it was a bottleneck.

A third example that occurs to me is evaluation of transcendental
functions.  Lots of recursion, and pretty efficient on the RISCS.
On a vector machine, you have to keep iterating the vector until
the slowest converging argument is done converging (unless you
do a lot of reshuffling in memory)

Now, the $64 question:  Why no supercomputer based on an 
architecture for the processor like the RS/6000, BUT with
your extra $2M buying bandwidth to memory like the Cray's
(no cache)?  This would seem to be a real winner. You could
simulate vectorization on it, but it would have all the
flexibility of the newer machines.
.

hrubin@pop.stat.purdue.edu (Herman Rubin) (05/04/91)

In article <1991May4.031835.7979@midway.uchicago.edu>, rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes:
> McAlpin comments that he finds vectorization (even on the Cyber 205)
> simpler, more intuitive and more transportable than the optimization
> techniques used on cached machines like the RS/6000.
 
> I think this is partly because the vector model of parallelism is
> so rigid; optimization for the superscalars involves a bigger bag
> of tricks.  Still, I have found that there are fewer things they
> choke on, and that it is easier to localize optimization in a few
> reusable routines.  Two case-studies:

			...................

There are differences in the various types of vector machines; most
have highly rigid vectors, but the 205 has quite a bit of flexibility.

> (b) Tridiagonal solving.  Comes up in lots of codes, and it is
> a real vector-breaker.  In fact, vector machines choke on all
> sorts of recursion, whereas the superscalars just love them.
> On the RS/6000, the tridiag code basically vanished, whereas on
> the vector Stardent, it was a bottleneck.

There are tricky ways of doing this efficiently on vector machines,
especially flexible ones.  This uses partitioning.

> A third example that occurs to me is evaluation of transcendental
> functions.  Lots of recursion, and pretty efficient on the RISCS.
> On a vector machine, you have to keep iterating the vector until
> the slowest converging argument is done converging (unless you
> do a lot of reshuffling in memory)

But this reshuffling in memory on the 205 is dirt cheap!  It could
be improved by adding more instructions compatible with a CISC
vector machine.  In fact, this type of management is quite useful
in the most efficient type of generation of non-uniform random 
numbers, acceptance-rejection, and the vector methods are quite
competitive on scalar machines with the older scalar versions, and
only use more memory, no longer a bottleneck here.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

rmarq@iago.caltech.edu (Marquardt, Ron R.) (05/05/91)

In article <11875@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu
(Herman Rubin) writes...

[regarding previous comments contrasting the RS/6000 superscalar
 architecture with the Stardent vector architecture]

>> (b) Tridiagonal solving.  Comes up in lots of codes, and it is
>> a real vector-breaker.  In fact, vector machines choke on all
>> sorts of recursion, whereas the superscalars just love them.
>> On the RS/6000, the tridiag code basically vanished, whereas on
>> the vector Stardent, it was a bottleneck.
> 
>There are tricky ways of doing this efficiently on vector machines,
>especially flexible ones.  This uses partitioning.
> 

Our experience with these machines (we have two IBM RS/6000's, a
530 and a 540, and three of what are now Stardent 3000's) has 
shown that for matrix problems exceeding 200x200, tridiagonal
or otherwise, the Stardents can be considerably faster than
the RS/6000's.  The relative performance difference increases
with the size of the problem and can reach as great as a factor
of 10 (for naively written code).  This problem can be traced to
the small TLB (mapping 512K) on the IBM machines.  Back when Stardent
was Ardent, they had similar problems, which were more easily
fixed since their TLB is external.  The Stardent ETLBs now can
map >128 Mb (and I believe it goes as high as the maximum allowed
memory in the system, which escapes me).

For an admittingly simple benchmark, a traditionally coded matrix
multiply, the 530 is actually SLOWER than a Sparc 1, and the 540
only marginally faster, for 701x701 [530: 568 sec., Sparc 1: 525 sec.,
and 540: 470 sec.].  Before people complain too loudly, it is very
easy to recode a matrix multiply for better efficiency and get the 540
time down to 49 sec., BUT it is not necessarily true that for all
matrix routines such a recoding is simple, or even possible.  Better
compiler technology would certainly help, and IBM's latest beta version
for the RS/6000's (exact version ID escapes me) promises a great
deal of improvement.  For arbitrarily complex and general problems,
however, a larger TLB would certainly benefit the RS/6000 family.

By the way, although they are not superscalar, the latest HP 
workstations seem to be avoiding the TLB problems of the IBMs.
Included in the TLB scheme for the Snakes is TLB mapping for
four large blocks of memory, each 16Mb in size.  From a user's point
of view, 64 Mb is far better than the 512Kb mapped by the RS/6000's
TLB.


-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Ron Marquardt				E-mail: rmarq@iago.caltech.edu
Solid State Device Physics Group
California Institute of Technology

eugene@nas.nasa.gov (Eugene N. Miya) (05/05/91)

Vectors are simpler and "rigid:"
In article <1991May4.031835.7979@midway.uchicago.edu> rtp1@quads.uchicago.edu
(raymond thomas pierrehumbert) writes:
>Now, the $64 question:  Why no supercomputer based on an 
>architecture for the processor like the RS/6000, BUT with
>your extra $2M buying bandwidth to memory like the Cray's
>(no cache)?

1) Few supercomputers exist.  New architectures are expensive propositions.
2) Vectors were proposed in the 1960s by IBM, built in the 1970s by Cray
and others. S-Scalar were not implemented until the 1980s.
3) Technological and economic factors play on any person trying to
design and build a super.  Do you want to build one?  I didn't think so.
Can I get my $64?

Ain't hindsight wonderful? 8^)  It's not only clear, but it's cheap.

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (05/05/91)

> On 4 May 91 03:18:35 GMT, rtp1@quads.uchicago.edu (raymond thomas
	 pierrehumbert) said:

raymond> McAlpin comments that he finds vectorization (even on the
raymond> Cyber 205) simpler, more intuitive and more transportable
raymond> than the optimization techniques used on cached machines like
raymond> the RS/6000.

raymond> (a)  I have some semi-spectral 2D fluid codes (finite diff in
raymond> one direction, spectral in the other) which I never got
raymond> around to optimizing on the Cyber, because it would have involved
raymond> some major structural changes. 

I have had no trouble optimizing similar codes for the ETA-10.  Of
course, my codes were designed with long vectors in mind, and it
sounds like yours were not.  It also helps if the problems are large
enough.  On the ETA-10, one did not get 90% of full speed on FFT's
until you were calculating a minimum of several hundred of them
simultaneously.   I agree that it was often very difficult to
optimizing codes written for other machines on the Cyber.  But I found
the long vector programming model so easy to use that I often re-wrote
the applications from scratch faster than I could have "optimized" the
original... 

raymond> (b) Tridiagonal solving.  Comes up in lots of codes, and it is
raymond> a real vector-breaker.  In fact, vector machines choke on all
raymond> sorts of recursion, whereas the superscalars just love them.
raymond> On the RS/6000, the tridiag code basically vanished, whereas on
raymond> the vector Stardent, it was a bottleneck.

In fluid dynamics codes, tridiagonal systems almost always arise in
groups (eg. one system per row of the 2-D domain).  In these cases it
is trivial to vectorize sideways across the systems and get full
vector performance.  You still have a vector divide or two in there,
but as long as you are not working on an IBM 3090VF that should not be
too much of a problem. ;-)

Once again, I agree that (for example) the RS/6000 is a wonderful
machine.  I am now completing the optimization of a 2-D finite
difference vorticity model which (I currently estimate) will run at a
sustained 15 MFLOPS (64-bit) on the RS/6000-320.  *But* the
optimizations which enabled me to do this are *much* uglier and lead
to *much* less maintainable code than the original long vector code.
If I did not anticipate a need for 200 days wall time to complete the
work with the original code, I would not have considered these
optimizations worthwhile.

(The optimizations that I use are: (1) use IBM ESSL routines for FFT's
and tridiagonal solves; (2) inner loop unrolling; (3) outer loop
unrolling followed by inner loop jamming/fusion; (4) subroutine
inlining followed by manual loop jamming followed by unrolling.)
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/06/91)

>raymond> (b) Tridiagonal solving.  Comes up in lots of codes, and it is
>raymond> a real vector-breaker.  In fact, vector machines choke on all
>raymond> sorts of recursion, whereas the superscalars just love them.
>raymond> On the RS/6000, the tridiag code basically vanished, whereas on
>raymond> the vector Stardent, it was a bottleneck.

The latest Fujitsu compiler (for the VP2000 series) will handle 1st order
backwards recurrence, ie A(I)=A(I-1)+B(I), as well as the usual forward
recurrence, in full vector mode. Many old scalar problems can now be
vectorized. I'd be interested to hear if Cray/ETA/Alliant etc have this
capability too.

-- 
Rowan Hughes                                James Cook University
Marine Modelling Unit                       Townsville, Australia.
Dept. Civil and Systems Engineering         csrdh@marlin.jcu.edu.au

rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) (05/06/91)

>recurrence, in full vector mode. Many old scalar problems can now be
>vectorized. I'd be interested to hear if Cray/ETA/Alliant etc have this
>capability too.

ETA is irrelevant, except for comp.folklore.computers, and I don't 
know about the current crop of Crays, but the new Alliants
(FX/800, FX/2800) are based on the i860 chip, and so they do
their "vectorization" by generating pipelined code for the
RISC chip.  The i860 has no problems with recursion, except
for possible delays in re-use of a result (during which you
usually want to store something else, or do another computation
not dependent on the result).

I wonder how the Fujitsu handles the recursion.  For long enough
vectors to make it worthwhile, there is a pretty well known hack
to make sum-reduction vectorizable (split the vector in two,
do a vector add, split again, etc.).  Works for dot products as
well, of course.  The RISC architectures can handle a much
more general kind of recursion.

Concerning McAlpin's comments on vectorizability and maintainability
of code, I'm not claiming that the codes I'm talking about are
intrinsically unvectorizable.  I'm just giving some examples from
one particular (and perhaps rather lazy, as far as coding goes)
real-world user.  My experience also reflects that of a lot of
postdocs and students that have worked for me.  We write a
lot of banged-out code, get interesting results, and somehow
never get around to doing much optimization until the optimization
becomes more interesting to do than looking at the scientific
results (at which point, we've basically moved onto something 
else anyway).  My own experience with the IBM R6000 architecture
is that we have gotten more payoff from a little optimization effort,
and that it is easier to plug in optimized ESSL routines.  

Tridiagonal solving is a case in point.  True, in fluids codes,
you can always vectorize in the direction(s) perpendicular to
the recursion.  On the other hand, this still means you have
to do a little thinking to write a special purpose tridiag 
solver suited to your particular storage configuration. Not
really all that hard, but I find it a lot easier to have
a super-optimized canned tridiag solver I don't even have
to recompile, and then simply do a bunch of calls to this
solver; this code structure lays the groundwork for 
parallelization as well. True, you do still have to think
about storage layout because of the performance hit for non-unit
strides on cache-based architectures (wish somebody would build
a cache that could pre-load more general patterns than contiguous
memory).

While I'm waxing nostalgic, some of the codes I have seen for
cache-based machines have seemed hauntingly familiar, and I 
have only recently realized why:  They hark back to the 70's,
when computers had too limited RAM do hold your field, and
so doing fluid dynamics on computers meant loading stuff into
memory from disk, crunching the life out of it, and squirting
the result back to disk.  There was of course a huge premium
on re-use of data (so I'm told-- this was really before my
time, of course).
.

hrubin@pop.stat.purdue.edu (Herman Rubin) (05/06/91)

In article <1991May6.035310.26794@marlin.jcu.edu.au>, csrdh@marlin.jcu.edu.au (Rowan Hughes) writes:

			.....................

> The latest Fujitsu compiler (for the VP2000 series) will handle 1st order
> backwards recurrence, ie A(I)=A(I-1)+B(I), as well as the usual forward
> recurrence, in full vector mode. Many old scalar problems can now be
> vectorized. I'd be interested to hear if Cray/ETA/Alliant etc have this
> capability too.

The question is not what the compiler can handle; the older vector machines'
compilers could handle this very well.  The question is what can the vector
hardware handle.  Vector, and other, computers get their speed by allowing
the interval between operations to be short compared to the time for an
operation to complete.

If instructions can issue in one cycle, and the adder is segmented so that
a new addition can start one cycle after one has started, but addition takes
5 cycles (not at all unusual), the cited code can not use the vector
capabilities as written, but can only have an operation every 5 cycles.

Prohibiting the hardware from doing this cannot pay.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/07/91)

In <1991May6.055943.6234@midway.uchicago.edu> rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes:
>I wonder how the Fujitsu handles the recursion.  For long enough
>vectors to make it worthwhile, there is a pretty well known hack
>to make sum-reduction vectorizable (split the vector in two,
>do a vector add, split again, etc.).  Works for dot products as
>well, of course.  The RISC architectures can handle a much
>more general kind of recursion.

The sum-reduction has been on vector machines for many years. The
recurrsion I meant was  A(I)=A(I-1) + ... in general, when reordering
is just not possible. The theoretical solution has also been around for
several years, but Fijitsu appears to be the first vendor to have used it.
See Willie Schonauer's book for the theory.
"Scientific Computing on Vector Computers"
Willie Schonauer, Elsevier Sci Pub  1987, ISBN 0444702881
Although a little dated now, it gives an excellent review of several
vector architectures, and algorithms.

-- 
Rowan Hughes                                James Cook University
Marine Modelling Unit                       Townsville, Australia.
Dept. Civil and Systems Engineering         csrdh@marlin.jcu.edu.au

glew@pdx007.intel.com (Andy Glew) (05/07/91)

Just blueskying, but:

    The biggest advantage of vector instructions is that they convey
to the memory system the access pattern.  You have to add a lot of
logic to your memory buffers to detect patterns beyond the simple "the
next address is a stride of 64 past the previous" address
differencing.

    A secondary advantage of vector instructions is that they map to
multiple simple operations.

    Most vector implementations are pipelined, not element-by-element
parallel.  It isn't too hard to do superscalar/superpipelined
implementations of scalar instruction sets that obtain parallelism
comparable to most vector implementations. Multiple memory ports get
harder, but are doable. But the access pattern info isn't there.

    A disadvantage of architecting a vector instruction set is the
combinatoric explosion: first you provide element by element vector
ops, then vector reductions, then recurrences; then you have to decide
which vector ops are going to chain...

    So: why not combine vector memory access instructions that convey
access pattern, with scalar computational operations?



--

Andy Glew, glew@ichips.intel.com
Intel Corp., M/S JF1-19, 5200 NE Elam Young Parkway, 
Hillsboro, Oregon 97124-6497

This is a private posting; it does not indicate opinions or positions
of Intel Corp.

jlg@cochiti.lanl.gov (Jim Giles) (05/08/91)

In article <1991May6.035310.26794@marlin.jcu.edu.au>, csrdh@marlin.jcu.edu.au (Rowan Hughes) writes:
> [...]
> The latest Fujitsu compiler (for the VP2000 series) will handle 1st order
> backwards recurrence, ie A(I)=A(I-1)+B(I), as well as the usual forward
> recurrence, in full vector mode. Many old scalar problems can now be
> vectorized. I'd be interested to hear if Cray/ETA/Alliant etc have this
> capability too.

First order recurrence can be rewritten using a technique call recursive
doubling (which is, presumably what the Fujitsu compiler is doing automatically).
If you can't get your compiler to do it automatically, use the following
code as a guide to the technique (or read "A Parallel Algorithm for the 
Efficient Solution of a General Class of Recurrence Equations", Kogge, H.S.
and P.M. Stone, IEEE Transactions of Computers, Vol. C-22, No. 8, August 1973,
pp. 786-793) - (yes, the method really _has_ been around for 18 years):

   
      DO 10 i = 1, N
         A(i) = B(i)
  10  continue

      j = 1

  20  IF (j<N) THEN

         DO 30 i = j+1, N
            A(i) = A(i) + A(i-j)
  30     continue

         j = 2 * j

         GO TO 20
      ENDIF

This is Fortran 77 code (with lower case as a readibility extension).
The contents of both DO loops fully vectorize on the Crays, etc.  If the
DO loops are replaced with the Fortran 90 style array assignments:

   A(1:N) = B(1:N)
and
   A(j+1:N) = A(j+1:N) + A(1:N-j)

respectively, then the code will be fully parallel on the Connection Machine.
The 'while' loop is scalar and will execute floor(LOG2(N)) times.

Although it would be nicer if the compiler automatically saw recursions of
this kind and applied this optimization, this is not too bad and gives the
user direct control.  Fujitso apparently already does this, maybe some of the
compilers of the other companies on your list do also (or will soon follow
suit).

J. Giles

jlg@cochiti.lanl.gov (Jim Giles) (05/08/91)

In article <23348@lanl.gov>, jlg@cochiti.lanl.gov (Jim Giles) writes:
|> [...]
|> 
|>          DO 30 i = j+1, N
|>             A(i) = A(i) + A(i-j)
|>   30     continue

Whoops!  I just realized I still have a dependency here!  That's what I get
for reading articles on functional programming just before a Fortran-like
posting.

The '30' loop should be replaced with the two loops below:

         DO 30 i = j+1, N
            temp(i) = A(i) + A(i-j)
  30     continue

         DO 40 i = j+1, N
            A(i) = temp(i)
  40     continue

These vectorize and eliminate the dependency.  It's all so much simpler in
a language with whole array syntax!

Sorry for the inconvenience.

J. Giles 

csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/08/91)

In <11921@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:

>The question is not what the compiler can handle; the older vector machines'
>compilers could handle this very well.  The question is what can the vector
                    ^^^^^^^^^^^^^^^^
Herman,

I think you'll find that backwards recurrence has been around for a while.
Forward recurrence is still a problem ( or is it the other way around?).

-- 
Rowan Hughes                                James Cook University
Marine Modelling Unit                       Townsville, Australia.
Dept. Civil and Systems Engineering         csrdh@marlin.jcu.edu.au

bjornl@sics.se (Bj|rn Lisper) (05/08/91)

In article <23348@lanl.gov> jlg@cochiti.lanl.gov (Jim Giles) writes:
>In article <1991May6.035310.26794@marlin.jcu.edu.au>,
>csrdh@marlin.jcu.edu.au (Rowan Hughes) writes:
>> [...]
>> The latest Fujitsu compiler (for the VP2000 series) will handle 1st order
>> backwards recurrence, ie A(I)=A(I-1)+B(I), as well as the usual forward
>> recurrence, in full vector mode. Many old scalar problems can now be
>> vectorized. I'd be interested to hear if Cray/ETA/Alliant etc have this
>> capability too.

>First order recurrence can be rewritten using a technique call recursive
>doubling (which is, presumably what the Fujitsu compiler is doing
>automatically).

The transformation of linear recurrences by recursive doubling uses
associativity and distributivity of the arithmetical operators. Thus, the
numerical properties of the code are altered (since floating point
operations are not perfectly associative, due to rounding errors). I think
compilers should use such transformations with great care.  Does the Fujitsu
compiler provide a switch to control the vectorization of recurrences?

Bjorn Lisper

anik@crhc.uiuc.edu (Sadun Anik) (05/08/91)

In article <GLEW.91May7175310@pdx007.intel.com> glew@pdx007.intel.com (Andy Glew) writes:

      [ stuff deleted ]   
   
       So: why not combine vector memory access instructions that convey
   access pattern, with scalar computational operations?


   I thought Davidson's SMA architecture could do this. If I remember
correctly, a decoupled memory access unit could handle vector access and other
data structures. I don't know how much of the original SMA concepts are
implemented in ZS1 processor.

--
Sadun Anik, U of Illinois at Urbana-Champaign
e-mail: anik@crhc.uiuc.edu

rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) (05/08/91)

Andrew Glew, of Intel, writes
>    So: why not combine vector memory access instructions that convey
>access pattern, with scalar computational operations?

Yes, yes, my point exactly.  Please do it!  I'll take 10.  
In fact, I think a fruitful area for extending the current
architectures is in a more general model for pre-loading
the cache.  Doing pre-loads by constant strides rather
than by contiguous lines would already be a big
performance boost, but I could also imagine more
general solutions, where you gave the memory subsystem a 
vector of addresses which specified the pre-load pattern.

I'm just a user though.  I'm not at all clear how hard these
things would be to implement in hardware.
.

fu@crhc.uiuc.edu (John W. C. Fu) (05/09/91)

In article <1991May8.155455.14491@midway.uchicago.edu> rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes:
>Andrew Glew, of Intel, writes
>>    So: why not combine vector memory access instructions that convey
>>access pattern, with scalar computational operations?
>
>Yes, yes, my point exactly.  Please do it!  I'll take 10.  
>In fact, I think a fruitful area for extending the current
>architectures is in a more general model for pre-loading
>the cache.  Doing pre-loads by constant strides rather
>than by contiguous lines would already be a big
>performance boost, but I could also imagine more
>general solutions, where you gave the memory subsystem a 
>vector of addresses which specified the pre-load pattern.
>

It just so happens we have done some preliminary work in prefetching data
into vector caches. The initial results seem quite encouraging. If you
are interested the references are:

J. W. C. Fu and J. H. Patel, "Data Prefetching Strategies for Vector
Cache Memories," 5th. Int'l Parallel Processing Symp., May 1991

J. W. C. Fu and J. H. Patel, "Data Prefetching in Multiprocessor
Vector Cache Memories," Proc. of 18th. Int'l Symp. on Computer    
Architecture, May 1991.


John Fu
Center for Reliable and High Performance Computing
University of Illinois at Urbana-Champaign
fu@crhc.uiuc.edu

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (05/09/91)

In article <1991May8.155455.14491@midway.uchicago.edu> rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes:
>Andrew Glew, of Intel, writes
>In fact, I think a fruitful area for extending the current
>architectures is in a more general model for pre-loading
>the cache.  Doing pre-loads by constant strides rather
>than by contiguous lines would already be a big
>performance boost, but I could also imagine more
>general solutions, where you gave the memory subsystem a 
>vector of addresses which specified the pre-load pattern.
>
>I'm just a user though.  I'm not at all clear how hard these
>things would be to implement in hardware.

The best place to preload the cache is in the compiler. :-)

-kym

billm@oakhill.sps.mot.com (Bill Moyer) (05/11/91)

In article <GLEW.91May7175310@pdx007.intel.com> glew@pdx007.intel.com (Andy Glew) writes:
>Just blueskying, but:
>
>    The biggest advantage of vector instructions is that they convey
>to the memory system the access pattern.  You have to add a lot of
>logic to your memory buffers to detect patterns beyond the simple "the
>next address is a stride of 64 past the previous" address

>    Most vector implementations are pipelined, not element-by-element
>parallel.  It isn't too hard to do superscalar/superpipelined
>implementations of scalar instruction sets that obtain parallelism
>comparable to most vector implementations. Multiple memory ports get
>harder, but are doable. But the access pattern info isn't there.
>

>    So: why not combine vector memory access instructions that convey
>access pattern, with scalar computational operations?
>
-
>
>Andy Glew, glew@ichips.intel.com
>Intel Corp., M/S JF1-19, 5200 NE Elam Young Parkway, 
>Hillsboro, Oregon 97124-6497


Check out the article "The WM Computer Architecture" by Wm. (Bill) Wulf
(I believe it appeared in CAN circa early 1988).

He descibes an interesting multi-op/instruction load/store architecture
(RISC-like) based on a producer/consumer model,
with special "Streaming" (his terminology) loads and stores which enter fifo queues. 

The semantics are 
      Sin.size fifo#, stride, base, count
      Sout.size fifo#, stride, base, count

As Wulf notes: "The similarity of streaming to 'vector load/store'
operations should be apparent, and the rationale for its existence
is similar -- it reduces the number of instructions in loop bodies
and signals predictable reference patterns to the memory system."


All in all a very interesting paper...

Bill Moyer				insert disclaimer here --> <>
Senior Paleontologist / MC88110 design
Motorola Microprocessor Products Group

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (05/12/91)

In article <2646@fornax.UUCP> bremner@cs.sfu.ca (David Bremner) writes:
>>I think this is partly because the vector model of parallelism is
>>so rigid; optimization for the superscalars involves a bigger bag
>>of tricks.
>I think the vector model is actually quite a bit nicer.

There's yet another model - the data parallel one. By this, I mean
that one says e.g. "a=5", and 64,000 a's are set. This is obviously
the natural programming style of SIMDs like the Connection Machine.

- or so I thought, until I heard about the CM Fortran 1.0 compiler
from Thinking Machines. It uses the Fortran-90 style of vector
operations, e.g.
	x(1:100:2) = y(1:50) + z(1:50)

and (for scientific problems) it seems to be generating faster code
than ever before. TMI is claiming that they commonly see 5 GFLOPS,
wheras data-parallel Paris was getting more like 1.5 GFLOPS.
-- 
Don		D.C.Lindsay 	Carnegie Mellon Robotics Institute