[comp.arch] vectors on cheaper machines

jmd@granite.dec.com (John Danskin) (03/05/88)

Hi.

I have some thoughts about vector architectures on mainframe
machines  that  I  thought  I  would  expose to the friendly
breezes of the network. In this article, I make some  fairly
terrifying  assumptions about 'vector architectures by main-
stream computer manufacturers'. Feel free to correct  me  if
any are wrong.

Recently there has been  a  trend  for  mainstream  computer
manufacturers    to    put    vector    units    on    their
mainframes/workstations to try to edge into  the  scientific
market  now held by Cray/CDC etc. Often, the vector instruc-
tion sets are more or less copied from the Cray  instruction
set,  and  will  feature a small number (8/16) of relatively
long (64 elements) vector registers.

Some problems vectorize naturally into  long  vector  opera-
tions.   For  another,  larger  class  of probloems, another
approach is helpful:

If your problem is parallel enough so that it could be  pro-
fitably  attacked  by an N processor MIMD unit, then, if you
have enough mask registers, you  can  simulate  the  N  MIMD
units  with  one  vector unit with registers of length N. Of
course there is a performance hit due to masked  out  cycles
on  the  vector unit, but for many applications where paral-
lelism is significant, this  is  unimportant  when  held  up
against  the  cost  advantages  (better  code  density,  one
sequencer, deep pipelining) that a simd vector unit has over
a comparable mimd unit.

Of course, usually, the number  of  processing  elements  is
usually  much  less  than than the vector length. This stems
both from the need for deep pipelines to achieve peak vector
performance, and the need to conserve $$ when building these
things.

Now, the problem with this kind of programming, is that  the
memory  traffic generated by the vector unit is proportional
to the performance increase. So a vector  unit  with  16  64
element registers, operating as fast as 8 machines each with
16 registers, generates as much memory traffic as 8 machines
with 16 registers. This makes normal caches pretty much use-
less, and in fact, can't be done  without  going  to  highly
interleaved memory ---> the Cray solution.

Unfortunately for  the  aforementioned  mainstream  computer
manufacturers,  highly interleaved memory is expensive, very
fast busses are expensive, and it costs money to design  all
of  this  stuff when it isn't really your design center. So,
what can be done without building a Cray?

Well, most of  these  vector  units  are  'low  performance'
units.  I.E.   They don't really need to be 64 elements long
to get their performance.  So, why not use more shorter vec-
tors,  say  64  vectors  each with 16 elements? We know that
machines with 64 registers generate far fewer memory  refer-
ences than machines with 16 registers, so for an inexpensive
architecture (<$17M) it seems that shorter  registers  would
be a win.

An alternative design is to allow unaligned vector  instruc-
tions   on  a  large  homogenous  register  file.  Then  the
programmer/compiler can decide how best  to  register  usage
and minimize memory references.  This appears to be the tack
adopted by Ardent.

Can anybody discuss the tradeoffs I've  mentiontioned  in  a
more  informed manner? I would really appreciate some deeper
insights into these problems.

John Danskin				| decwrl!jmd
DEC Workstation Systems Engineering	| (415)853-6724 
100 Hamilton Avenue			| My comments are my own.
Palo Alto, CA  94306			| I do not speak for DEC.
-- 
John Danskin				| decwrl!jmd
DEC Workstation Systems Engineering	| (415)853-6724 
100 Hamilton Avenue			| My comments are my own.
Palo Alto, CA  94306			| I do not speak for DEC.