[comp.arch] Vectorizing division in a do loop

gsg0384@uxa.cso.uiuc.edu (08/17/89)

Hi,

I've heard that vector machines are more expensive than multi-cpu parallel
machines.  I've got two questions about vector machines.

1.  For compiler design, I think vector machine architecture is easier.
    Is this true?

2. Our machine is Ardent Titan.  Each cpu has 64-register length vector
   registers.  The problem is that this machine does not vectorize do loops
   with division.  How much harder is to implement division than the other
   three operations, + - x?  Is this a hardware limitation?

I'd rather have a vector machine than a multi-cpu parallel machine for 
my application.  I just want more people in computer industries to pay
more attention to vector machines.

Hugh

mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) (08/18/89)

In the above referenced message, Hugh ??? <gsg0384@uxa.cso.uiuc.edu> writes:

>I've heard that vector machines are more expensive than multi-cpu parallel
>machines.  I've got two questions about vector machines.

    There is no reason, in principle, why this should be true.
    However, almost all of the machines designed around vector pipelines
    are quite expensive, while there are a fair number of parallel
    boxes becoming available.  
    This may change if Ardent and/or Stellar make any headway in the
    workstation market.

>1.  For compiler design, I think vector machine architecture is easier.
>    Is this true?

    For operations that map onto the vector hardware, this is likely to
    be true.  There is also considerably more experience in the industry
    at vectorizing scientific codes than parallelizing them.  One of
    the reasons for this is that "vectorizing" means basically one thing,
    while "parallelizing" means many different things depending on the
    model (MIND or SIMD), the size of the pieces, the communication
    and synchronization options, etc.....

>2. Our machine is Ardent Titan.  Each cpu has 64-register length vector
>   registers.  The problem is that this machine does not vectorize do loops
>   with division.  How much harder is to implement division than the other
>   three operations, + - x?  Is this a hardware limitation?

    The Ardent Titan uses the Weitek 2264/2265 pipelined floating-point
    adder and multiplier chips.  Divide is not implemented in the hardware,
    but is iterated using one or both of the chips (I don't remember 
    which way Ardent does it).
    Pipelined divide units are not likely to be found on anything in
    the Titan's price range.

>I'd rather have a vector machine than a multi-cpu parallel machine for 
>my application.  I just want more people in computer industries to pay
>more attention to vector machines.

    I'll second that motion!

--
John D. McCalpin - mccalpin@masig1.ocean.fsu.edu - mccalpin@nu.cs.fsu.edu
		   mccalpin@delocn.udel.edu

frazier@oahu.cs.ucla.edu (Greg Frazier) (08/18/89)

In article <MCCALPIN.89Aug17175558@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes:
+>In the above referenced message, Hugh ??? <gsg0384@uxa.cso.uiuc.edu> writes:
+>>I've heard that vector machines are more expensive than multi-cpu parallel
+>>machines.  I've got two questions about vector machines.
+>
+>    There is no reason, in principle, why this should be true.

I think the reason behind this is that multicomputers can
use off-the-shelf processors in their processing nodes.
Vector machines, on the other hand, require custom parts.
Greg Frazier	    o	Internet: frazier@CS.UCLA.EDU
CS dept., UCLA	   /\	UUCP: ...!{ucbvax,rutgers}!ucla-cs!frazier
	       ----^/----
		   /

slackey@bbn.com (Stan Lackey) (08/18/89)

In article <112400003@uxa.cso.uiuc.edu> gsg0384@uxa.cso.uiuc.edu writes:
>I've heard that vector machines are more expensive than multi-cpu parallel
>machines.  I've got two questions about vector machines.

It depends upon what you're comparing.  A Cray is more expensive than
a Sequent.  I know of vector machines that are cheaper than simple
single-processor scalar machines.  You statement may be a result of a
perception that "for a fixed MIPS, parallel processing is cheaper than
a single processor."  Which is a probably a valid statement, if your
application gets anywhere near linear speedup.

>1.  For compiler design, I think vector machine architecture is easier.
>    Is this true?

Today your statement is true.  The reason is that vector machines have
been around much much longer than parallel machines (in an industrial
sense) and there has been many more manyears of development put into
automatic vectorization.  Now that usable parallel machines (i.e,
shared global memory) exist, my guess is that compilation technology
for them will catch up to exploit the massive performance capacity.

>2. Our machine is Ardent Titan.  Each cpu has 64-register length vector
>   registers.  The problem is that this machine does not vectorize do loops
>   with division.  How much harder is to implement division than the other
>   three operations, + - x?  Is this a hardware limitation?

I am guessing, but it sounds like there is no vector divide instruction.
If their divider is a successive approximation type, especially if it
uses the multiplier, which is likely, it cannot be pipelined anyway.
BTW the Alliant has a standalone divider gate array.  Only one is used 
in scalar instructions.  But for vectors there are two of them, which
operate on alternate operand pairs.

>I'd rather have a vector machine than a multi-cpu parallel machine for 
>my application.  I just want more people in computer industries to pay
>more attention to vector machines.

For a vector machine you'll need to go to the minisupercomputer guys,
I'm afraid.  It looks like the suppliers of single-chip micros are not
to the point where they are (willing or able?) to include them.  Intel
has a first step with the dual instruction issue capability if the
i860, but they're not there yet; also their data cache is far too
small for it to be effective.
:-) Stan

mkraiesk@midas.UUCP (Mark Kraieski) (08/25/89)

in article <112400003@uxa.cso.uiuc.edu>, gsg0384@uxa.cso.uiuc.edu says:
> Nf-ID: #N:uxa.cso.uiuc.edu:112400003:000:698
> Nf-From: uxa.cso.uiuc.edu!gsg0384    Aug 16 21:38:00 1989
> 
> 
> 
> Hi,
> 
> I've heard that vector machines are more expensive than multi-cpu parallel
> machines.  I've got two questions about vector machines.

I was involved in the compiler work for Gould's NP1 vector computer and
would like to shed some light on these questions.

> 
> 1.  For compiler design, I think vector machine architecture is easier.
>     Is this true?

Not that not all code can be vectorized, in fact, vectorization is just
a subset of all parallel optimizations.  A vector compiler must handle
all the normal scalar operations PLUS the vector operations.  Also, unless
the vectorization is syntax driven, a vectorizer (such as those from PSR
and KAI) must be used to determine when it is safe to vectorize.

> 
> 2. Our machine is Ardent Titan.  Each cpu has 64-register length vector
>    registers.  The problem is that this machine does not vectorize do loops
>    with division.  How much harder is to implement division than the other
>    three operations, + - x?  Is this a hardware limitation?

I am not familiar with Ardent's box but of the 4 operations, division
takes the most time.  Our original architecture used a recipricol/multiply
instead of divide since we could chain the operations and do it faster.
My guess is that the gates required to do a fast divide were not available
on the Titan so they just punted.  But note that a compiler could
vectorize the portion of the expression that doesn't do division for a
potential speed up.  For example, given the array operation
	A = (B+C+D) / E
we can make 2 loops, one vector and one scalar:
        A = B+C+D
        A = A/E
Depending on the vector loop overhead and the number of elements (and
size of cache if we want to get technical), the partially vectorized
loop can outperform an unvectorized version.  Note that vector operations
are not faster than scalar - a multiply takes the same in either case.
But for scalar code we have to execute multiple instructions to set us
up for the next operation while vector code has just one instruction
fetch and no branches (if the array is less than the vector length).

> 
> I'd rather have a vector machine than a multi-cpu parallel machine for 
> my application.  I just want more people in computer industries to pay
> more attention to vector machines.
> 
> Hugh

I'd rather have a multi-cpu system where each cpu has vector capabilities
so my fine grain parallelism is done within a node but other non-vector-
izable code could still benefit from multiple processors.

                                | "Far better it is to dare mighty things, to
Mark E. Kraieski                | win glorious triumphs, even though checkered
Encore Computer Corp., MS#404   | by failure, than to take rank with those
Computer Performance Evaluation | poor spirits who neither enjoy much nor
6901 West Sunrise Blvd.         | suffer much, because they live in the gray
Ft. Lauderdale, FL 33313-4499   | twilight that knows not victory or defeat."
(305) 587-2900                  |                       -- Theodore Roosevelt