[comp.parallel] SIMD Extentions

jps@cat.cmu.edu (James Salsman) (06/30/89)

I really hate having to deal with indirect addressing on
most SIMD machines.  I wish someone would build a SIMD array
using PE's with address buffers.  Just one tiny address
buffer per processor is all I want... nothing fancy.  As
long as *ALL* the memory addresses have to come over the
global instruction stream and thus are the *SAME* for each
element, a lot of potential processing power is going to
waste!

For example, on the Connection Machine in *Lisp, indirect
aref!!'s take FOR EVER.  This is SERIOUSLY slowing down the
Production System that I wrote in *Lisp (regardless, it's
faster than CMU/Soar's Production System Machine or any
other implementation of a production system that I've heard
about.)  TMC has added somthing called "sideways arrays" to
help indirect addressing, but the *Lisp manual is totally
obscure (so what else is new) and from what I can tell, it
looks like "sideways" means "spread out over several
physical processors."  Ack/Pft!

The way I would hack an address buffer in to the CM is by
employing a shift register added to each PE.

  (1) Add a new nanoinstruction pin [or two] that selects
      memory input to the ALU between "Address A [or B]"
      from the instruction pins and the contents of the
      indirection register.

  (2) Add a new nanoinstruction pin that causes the output
      from the ALU to be shifted into the indirection register.

That's all there is to it.  Two or three new pins, a shift
register, and PE memory indirection takes 13 nanocycles
instead of a zillion.

I am sure that a similar thing could be done to other SIMD
architectures.  If anybody thinks that indirect addressing
is not worth a register and a couple of new nanopins.

:James

P.S. If anyone wants to use this idea :-) it's free -- I think
patents are morally wrong.
-- 

:James P. Salsman (jps@CAT.CMU.EDU)

prins@prins.cs.unc.edu (Jan Prins) (07/03/89)

In article <5886@hubcap.clemson.edu>, jps@cat.cmu.edu (James Salsman) writes:

> I really hate having to deal with indirect addressing on
> most SIMD machines.  I wish someone would build a SIMD array
> using PE's with address buffers.   [...]

Early proposals for SIMD parallel computers included indirect
addressing.  But when you build a massively parallel processor, the
wiring to bring 65K (or however many) individual addresses out to the
memories from the PEs is daunting.

> The way I would hack an address buffer in to the CM is by
> employing a shift register added to each PE.
> 
>   (1) Add a new nanoinstruction pin [or two] that selects
>       memory input to the ALU between "Address A [or B]"
>       from the instruction pins and the contents of the
>       indirection register.
> 
>   (2) Add a new nanoinstruction pin that causes the output
>       from the ALU to be shifted into the indirection register.
> 
> That's all there is to it.  Two or three new pins, a shift
> register, and PE memory indirection takes 13 nanocycles
> instead of a zillion.

Where would this shift register reside?  If it is on chip with the PEs,
then you suddenly have a lot of extra address lines to bring off chip
-- with 16 PEs, and 16 bits of local addressing that amounts to 256
extra wires!  If the register is off-chip, say in the memory, then you
can fill it bit-serially without extra wires but you need logic to use
it as an address.  I was under the impression that TMC used standard
memory parts (or was that only for the CM-1?), so the latter approach
would be extremely cumbersome in that setting.
 
> I am sure that a similar thing could be done to other SIMD
> architectures.  [...]

There are examples of massively-parallel SIMD architectures that
support indirect addressing.  One of them is BLITZEN, an extension of
MPP that permits the contents of the PE shift register to be used as a
local modification to the global address.  The wiring problem is solved
by placing PEs and memories on the same chip, although this approach
limits the size of local memory so that very fast I/O to external
memory is required.  The current BLITZEN design places 128 PEs, each
with 1K of local memory, per chip.


> :James P. Salsman (jps@CAT.CMU.EDU)

Jan Prins (prins@cs.unc.edu)
Dept. of Computer Science 
UNC - Chapel Hill

Blevins, Davis, Heaton, Reif "BLITZEN:  A Highly Integrated Massively 
  Parallel Machine", Frontiers Mass. Par. Comp. 1988.

Heaton, Blevins "BLITZEN: A VLSI Array Processing Chip", IEEE CICC 1989.