[comp.parallel] Virtual processor ratio

rjc@cs.ucla.edu (Robert Collins) (03/16/90)

In article <8374@hubcap.clemson.edu> bcsaic!carroll@beaver.cs.washington.edu (Jeff Carroll) writes:
>
>	Since my friendly neighborhood "experienced CM hacker" is not
>currently available for consultation, I'm asking the net at large for a
>clarification of something that puzzles me. (I'm not a CM hacker, but we
>did take a close look at a CM when we went computer shopping some time
>ago.)
>	I've read a couple of statements now to the effect that
>efficiency improves as virtual processor ratio increases. This seems
>counterintuitive, as I would expect each virtual processor to add
>(superlinearly) to the overhead of "processor management".
>	Would someone please explain?

If your front end (scalar) processing time exceeds the parallel processing
time during a given sequence, then going to a high VP ratio (more work
for the CM for a given amount of front end work), you will get a better
than linear speedup.  Unless you have a very fast front end, you can get
a situation like this at a VP ratio of 1.  I have long code sequences that
will keep our CM only .75 busy even when driven by a 16 MIP Sun4/330
(and only cranking out Paris calls--no scalar computation), although
typically this is not a big problem with a fast front end.
Things were really ugly when we had a VAX 8350 front end (somewhere
around 1 MIP for a CM process).  In general, you want the bottleneck of
the computation to be the CM.  Get the fastest front end you can.

Now, assuming that you have the CM running full out, you can get better
than linear speedup by increasing the VP ratio for some operations.  In
particular, floating point (with the floating point accelerators).  The
more operations you push through the pipeline, the less overhead you
have per op for filling and emptying the pipeline.  Other operations
scale worse than linear (for example, routing at certain VP ratios).

As far as I know, there is no real "processor management" overhead that
increases with VP ratio.  I think you just loop that many more times in
the sequencer (or wherever the VP stuff is handled).

Hope this helps clear up the confusion,
rob

-------------------------------------------------------------------------------
rjc@cs.ucla.edu	                    C++/Paris on the CM2:  The only way to fly.
-------------------------------------------------------------------------------

barmar@Think.COM (Barry Margolin) (03/16/90)

In article <8374@hubcap.clemson.edu> bcsaic!carroll@beaver.cs.washington.edu (Jeff Carroll) writes:
>	I've read a couple of statements now to the effect that
>efficiency improves as virtual processor ratio increases. This seems
>counterintuitive, as I would expect each virtual processor to add
>(superlinearly) to the overhead of "processor management".
>	Would someone please explain?

The efficiency gain is because there is some per-instruction overhead that
is amortized across the virtual processors.  The comparison must be made
between ways of implementing a program that operates on more data elements
than there are PEs.  If the size of your application is larger than the
number of physical processors then there are two ways you can choose to
implement it: 1) iterate in the application code, or 2) use virtual
processors (I'm ignoring CMIS for this discussion).  In the former case
each iteration involves executing front-end code, pushing PARIS
instructions down the FIFO, and decoding the PARIS into CM microcode.  With
VP's, however, these steps are done once; the iteration occurs in the CM
microcontroller after the instruction has been decoded.  At high VP ratios
this overhead amortization can really add up
--
Barry Margolin, Thinking Machines Corp.

barmar@think.com
{uunet,harvard}!think!barmar

rrt@romeo.cs.duke.edu (Russell R. Tuck) (03/16/90)

In article <8374@hubcap.clemson.edu> bcsaic!carroll@beaver.cs.washington.edu (Jeff Carroll) writes:
>	I've read a couple of statements now to the effect that
>efficiency improves as virtual processor ratio increases. This seems
>counterintuitive, as I would expect each virtual processor to add
>(superlinearly) to the overhead of "processor management".

There are two unavoidable costs to virtual processors that I know of.
(1) Physical PE memory is shared by virtual processors, so memory available
    per virtual processor is inversely proportional to VPR.
(2) Each operation has to be done VPR times, so it takes VPR times as long.

There are also two kinds of processing overhead that I know of.
(a) Any PE addresses have to be adjusted for each repetition of the operation.
    This is just adding a constant, I think.
(b) Every operation, regardless of VPR, has to be sent from host to sequencer.
Note that all the overhead is linear (1,2,a) or constant (b) in VPR.

Here's the explanation for improved efficiency at high VPR as I understand it.
Host to sequencer overhead dominates execution time for operations with VPR=1.
The sequencer handles (2) and (a), so (b) is invariant (or nearly so) with
respect to VPR.  (1) affects whether programs run out of PE memory, but not
the speed of those that don't.  So larger VPRs get to amortize (b) over more 
work, and so make better use of the available PE cycles.
 
PS- I over-generalized when I said recently that MIMD folks blithely asssume
    constant-time parallel memory access.  Some do, but many others don't, as
    a friend working on the BBN Butterfly here at Duke reminded me.  Sorry.
 
Russ Tuck                              rrt@cs.duke.edu
Computer Science Department            ...!{decvax|mcnc}!duke!rrt
Duke University                        
Durham, NC 27706, USA                  (919) 660-6527

art@cs.bu.edu (Al Thompson) (03/16/90)

In article <8374@hubcap.clemson.edu> bcsaic!carroll@beaver.cs.washington.edu (Jeff Carroll) writes:
|
|	Since my friendly neighborhood "experienced CM hacker" is not
|currently available for consultation, I'm asking the net at large for a
|clarification of something that puzzles me. (I'm not a CM hacker, but we
|did take a close look at a CM when we went computer shopping some time
|ago.)
|	I've read a couple of statements now to the effect that
|efficiency improves as virtual processor ratio increases. This seems
|counterintuitive, as I would expect each virtual processor to add
|(superlinearly) to the overhead of "processor management".
|	Would someone please explain?

That is in fact a correct statement.  What happens is that the vp's are
really abstractions.  Each physical pe has its memory segmented in such a
way that the the vp uses different address bases.  Thus each physical pe
behaves as if it were doing the work of several pe's.  The gain in
efficiency comes when you realize that the instructions need only be
broadcast once and decoded once for the physical processors.  That means
that when you are using vp's you only get charged once for this overhead.
>From that comes the statement that vp's are "more efficient" than physical
processors.  So, the more vp's you have the better the overall gain.

billo@nova.npac.syr.edu (Bill O) (03/17/90)

In article <8374@hubcap.clemson.edu> bcsaic!carroll@beaver.cs.washington.edu (Jeff Carroll) writes:
>

>	I've read a couple of statements now to the effect that
>efficiency improves as virtual processor ratio increases. This seems
>counterintuitive, as I would expect each virtual processor to add
>(superlinearly) to the overhead of "processor management".

Other respondents have mentioned that higher v/p ratios usually lead
to higher CM utilization figures, and reduced cm instruction overhead.
It should also be mentioned that for many cm instructions, especially
floating point, higher v/p ratios mean more opportunities to keep
cm internal pipelines full. Fuller pipelines mean more computation
per second.

-- 
Bill O'Farrell, Northeast Parallel Architectures Center at Syracuse University
(billo@nova.npac.syr.edu)

slim@cis.ohio-state.edu (Scott Whitman) (03/19/90)

In regards to virtual processor comments, right now I am working on
theorizing the nearly optimal ratio for a particular application
of virtual processors to real processors.  In this case, I am
working on a BBN Butterfly and the virtual processors are really just
tasks.  In any case, I am trying to come up with a theory for determining
apriori what a nearly optimal ratio would be based on the size of
the input data and the number of processors one is working on (i.e., the
best ratio may change as depending one how many processors are assigned
to the problem).

In general, I believe that an increase in this ratio is worthwhile in order
to reduce the end-effects problem.  However, a certain point of no return
is hit when the overhead of starting a certain number of tasks exceeds
the gains one receives in reducing the end effects.  At this point, we have
hit the nearly optimal ratio (actually, just prior to this).

Anyway, we are trying to determine a good heuristic to use to obtain this
ratio.  Right now I am more or less basing it on the input data size, and
the overhead to start a new task, and additionally, the fact that the
dataset somewhat increases in size due to duplication as the ratio increases.

The problem is that I don't believe this model is very accurate.  In fact,
for two inputs of the same size, it is not necessarily true that the time will
be nearly the same for each.  It is roughly true, though, that a larger input
size results in a longer overall time and so our model bases its heuristic
on this fact.  Obviously the machine parameters also make a great deal of
difference, but for the moment we are ignoring these.

What I'd like to know is if anyone has any comments regarding this strategy
and also if there are any papers on this subject matter.

		Advthanksance,


		-Scott Whitman


--
 Scott Whitman, Department of Computer and Information Science
 slim@cis.ohio-state.edu    or   ...!{pyramid,killer}!osu-cis!slim
The Ohio State University; 2036 Neil Ave. Columbus OH USA 43210-1277