[comp.parallel] Superlinear

kale@m.cs.uiuc.edu (L. V. Kale') (12/06/88)

Here is one more way / reason that leads to the famed superlinear speedups.

We observed superlinear speedups on some simple divide-and-conquer
programs running on the "Chare Kernel" system (**) of ours on the iPSC/2
hypercubes.
Further investigation showed the cause of this to be memory usage.
In particular, the memory allocation scheme we used has the property that
with larger number of allocations (with subsequent deallocations), its
perfromance tends to degrade somewhat.
(One has to search more for the right size block due to fragmentation)
With increasing number of  processors, the number of allocations done
on each processor decreases, thus reducing the time spent in memory management.
As our dynamic load balancing scheme is very good :-) we get the
computation nicely distributed, and presto: superlinear speedup.

In a sense, this is simply a consequence of having more memory.
But then, that is part of having more processors.
The question is having paid (for P processors) real dollar costs P
times more than one processor, can one get speedup more than P.
And in this case, the answer is yes...

I won't get into the controversy about whether superlinear speedup are real
or not. Just reporting an observation, for now.

** The Chare Kernel Language for Parallel Programming: A perspective"
   L.V. Kale and W.W. Shu , TR  UIUCDCS-R-88-1451, 
  
(if interested in receiving copies of this and related papers, write
 to me: L.V. Kale,  Dept. of Computer Science, University of Illinois
	1304 W. Springfield Ave., Urbana, IL-61801)
	or kale@cs.uiuc.edu

kale@m.cs.uiuc.edu (L. V. Kale') (12/07/88)

Wait a second..
There was a parsing ambiguity in my earlier posting about superlinear
speedups. The Tech Report is not about such speedups, rather about the
chare kernel, a "machine independent parallel programming system".
(More on that later).
The superlinear speedup in specific cases was a very recent
observation while running programs on this system on an Intel
hypercube with 64 processors.
I have no papers documenting the superlinear speedups, yet.
(In fact, I am not sure there will ever be.)

	L.V. Kale (kale@cs.uiuc.edu)

franco%alberta.uucp@RELAY.CS.NET (Franco Carlacci) (12/07/88)

You can also try 

R. Janben " A note on superlinear speedup",
Parrallel Computing 4 (1987) 211-213.

franco

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$                                     ob. quote.                        $
$ Franco Carlacci                       " never spit in the well,       $
$ Dept. of Computing Science              you may have to drink from    $
$ U. of. Alberta                          it one day"                   $
$ franco@alberta.uucp  or alberta!franco                                $
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

dfk@romeo.cs.duke.edu (David F. Kotz) (12/07/88)

In article <3755@hubcap.UUCP>, robison@m.cs.uiuc.edu (Arch Robison) writes:
> I'm looking for references (either pro or con) on the following claim:
> 
> 	A p-processor parallel processor can never exhibit speedup
> 	of greater than p.

Try these, all short: 

D. Parkinson, "Parallel Efficiency can be greater than unity",
Parallel Computing 3 (1986) 261-262. 

V. Faber, Olaf M. Lubeck and Andrew B. White, Jr, "Superlinear speedup
of an efficient sequential algorithm is not possible", Parallel
Computing 3 (1986) 259-260. 

"Isomorphic Computers, Ltd", International Journal of Parallel
Programming, Vol 16, No 2, 1987, pp 179-182.

David Kotz
Department of Computer Science, Duke University, Durham, NC 27706
ARPA:	dfk@cs.duke.edu
CSNET:	dfk@duke        
UUCP:	decvax!duke!dfk

mmh@uunet.UU.NET (Matthew Huntbach) (12/09/88)

In article <3755@hubcap.UUCP>, robison@m.cs.uiuc.edu (Arch Robison) writes:
> I'm looking for references (either pro or con) on the following claim:
> 
> 	A p-processor parallel processor can never exhibit speedup
> 	of greater than p.

Depends on the algorithm. In parallel heuristic search superlinear
speedup can happen easily. The reference I have to hand is:

G-J.Li and B.W.Wah. How to cope with anomalies in parallel approximate
branch-and-bound algorithms.
In National Conference on Artificial Intelligence (AAAI-84) pp.212-215.

though I think there was something in a recent CACM, possibly by the 
same authors.

As a simple demonstration, suppose you are searching a tree with one branch
which your heuristic tells you is promising, but which after a lot of
computation (say 11 time units) does not yield a result, and another branch 
which your heuristic tells you is less promising but which does in fact yield
a result after a small amount of computation (say 1 time unit).

On a single processor you could search the whole large branch first 
before turning to the small branch and finding your solution: total time
12 units.

With two processors you could assign one branch to each processor. Let us 
assume there is a time delay of one unit shipping the less promising 
branch to a remote processor, and a time delay of another unit getting the
result back. You would still get you result in 3 units of time.

So 2 processors give you a speedup of 4   Q.E.D.

halldors@paul.rutgers.edu (Magnus M Halldorsson) (12/10/88)

In article <3801@hubcap.UUCP> mcvax!ivax.doc.imperial.ac.uk!mmh@uunet.UU.NET (Matthew Huntbach) writes:

> Depends on the algorithm. In parallel heuristic search superlinear
> speedup can happen easily.
>... 
> As a simple demonstration, suppose you are searching a tree with one branch
> which your heuristic tells you is promising, but which after a lot of
> computation (say 11 time units) does not yield a result, and another branch 
> which your heuristic tells you is less promising but which does in fact yield
> a result after a small amount of computation (say 1 time unit).
> 
> On a single processor you could search the whole large branch first 
> before turning to the small branch and finding your solution: total time
> 12 units.

But you can always simulate the parallel processors by traversing the
tree breadth-first. That will require a stack and extra memory, but no
more than the memory used by all the parallel processors combined.
  For this algorithm, the parallel version would be an example of
AND-parallelism, while for the standard sequential algorithm, it would
be an example of OR-parallelism.
  The moral is, you can only get superlinear speedup if the
corresponding sequential algorithm you're comparing with is less optimal.

Magnus

vera@portia.stanford.edu (James Vera) (12/10/88)

In article <3801@hubcap.UUCP> mcvax!ivax.doc.imperial.ac.uk!mmh@uunet.UU.NET (Matthew Huntbach) writes:
 >In article <3755@hubcap.UUCP>, robison@m.cs.uiuc.edu (Arch Robison) writes:
 >> I'm looking for references (either pro or con) on the following claim:
 >> 
 >> 	A p-processor parallel processor can never exhibit speedup
 >> 	of greater than p.
 >
 >Depends on the algorithm. In parallel heuristic search superlinear
 >speedup can happen easily. The reference I have to hand is:
 >[...]
 >As a simple demonstration, suppose you are searching a tree with one branch
 >which your heuristic tells you is promising, but which after a lot of
 >computation (say 11 time units) does not yield a result, and another branch 
 >which your heuristic tells you is less promising but which does in fact yield
 >a result after a small amount of computation (say 1 time unit).
 >
 >On a single processor you could search the whole large branch first 
 >before turning to the small branch and finding your solution: total time
 >12 units.
 >
 >With two processors you could assign one branch to each processor. Let us 
 >assume there is a time delay of one unit shipping the less promising 
 >branch to a remote processor, and a time delay of another unit getting the
 >result back. You would still get you result in 3 units of time.
 >
 >So 2 processors give you a speedup of 4   Q.E.D.

One has to be careful in comparing apples to oranges.  Here you are
comparing two different algorithms.  Just because you run the same
program on each of your many processors as you ran on your
uniprocessor does not mean that your global algorithm is consistent.

In the case described above, the uniprocessor is running an algorithm
that says "exhaustively search a branch until you find an answer or it
fails to yield a result", while the multiprocessor is running an
algorithm that says "search N (the number of processors) branches
simultaneously until your find an answer replacing branches that fail
to yield a result with new branches."

To realistically compare the performance increase of the multiprocessor
over the uniprocessor you should run the second algorithm on the
uniprocessor.  This could be done by specifying N and looping over N
branches one level at a time.


-- 
James S. Vera      |      Internet	     |Standard Disclaimers
Stanford University|vera@portia.stanford.edu |Blah Blah Blah Blah
Bellcore           |vera2%mruxb@bellcore.arpa|vvv My Cutesy Quote vvv
"When I was young it seemed that life was so wonderful..." - Supertramp

gld@cunixd.cc.columbia.edu (Gary L Dare) (12/10/88)

In article <3801@hubcap.UUCP> mmh@ivax.doc.imperial.ac.uk wrote:
>As a simple demonstration,.... [had to edit out excessive included text]

>So 2 processors give you a speedup of 4   Q.E.

Wait!  You still have to search your promising branch thoroughly on the
first processor; that's 11 steps, not 1.  The fastest turnaround time
if you have each branch ALREADY residing on seperate processors is 11
if they are searched simultaneously; master-slave, it's 13.-- 
~~~~~~~~~~~~~~~~~~~~~~~~ je me souviens ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Gary L. Dare                           > gld@eevlsi.ee.columbia.EDU
                                        > gld@cunixd.cc.columbia.EDU 	
 "Free Canada, Trade Mulroney!"         > gld@cunixc.BITNET

mgv@uunet.UU.NET (Marco Valtorta) (12/10/88)

If you search breadth-first, instead of depth-first,
you will return a solution in 2 units of time on a serial
machine.

Should I conclude that the speedup in parallel search is
always less than 1 ?  :-)

Marco Valtorta			usenet: ...!ncrae!usceast!mgv
Department of Computer Science	csnet: mgv@cs.scarolina.edu
University of South Carolina	tel.: (1)(803)777-4641
Columbia, SC 29208		tlx: 805038 USC
U.S.A.				fax: (1)(803)777-3065
usenet from Europe: ...!mcvax!uunet!ncrlnk!ncrae!usceast!mgv

mmh@uunet.uu.net (Matthew Huntbach) (12/13/88)

What I was assuming was that you have an algorithm where the any solution
is acceptable and on the whole a depth-first search leads to solutions quickest.
It just happens that in this particular example, the heuristics lead you down
the wrong path. I assumed the two processors still searched depth-first, so a
completely breadth-first search would not give a quicker solution.

I agree that bringing in parallelism could be said to have altered the algorithm
by introducing some breadth-first search into what was a purely depth-first
search. But this is what is generally accepted by the term "superlinear 
speedup"

sru@uunet.UU.NET (Steve Rush) (12/14/88)

  There has been a lot of discussion recently about superlinear speedup so
I thought I would add my personal experiences eg.
In article <3756@hubcap.UUCP>, kale@m.cs.uiuc.edu (L. V. Kale') writes:
> Here is one more way / reason that leads to the famed superlinear speedups.

The Background :
  I am doing work on graphics algorithms using Transputers, and one of my
algorithms is for rotating images which are stored using 'stripcodes'
('objects' are defined using the coordinates of the left end, along with a
colour and a length). The algorithm rotates this about a user specified
point and then regenerates the new strips.
  To parallelize this, the screen is split into horizontal bands with one
band for each processor. For efficiency, each processor has an array of
linked lists, one for each scanline of the image.

Results :
  This program took more than 4 times as long to run on one processor as it
did on 4 processors (that is the definition of superlinear speedup isn't it?). 
  The reason for this (I had to explain it to my supervisor somehow) is
that the newly generated strips had to be added to the correct linked list,
and merged with the strips already there. In the single processor system
this required searching through the whole list, while in the 4 processor
system it had a smaller list to search.
  The time spent at the end of the algorithm, sending the strips back to
the correct processors plus the time to merge the strips came out less than
the time that was saved by inserting the strips in the shorter lists.

  Although you can always claim that my single processor version was not
efficient, the principle still holds (and I claim that the program WAS
efficient for a single processor). (The only complaints I can think of are:-
"a linked list is not efficient". Well, I can't think of anything else
that is better, that wouldn't still show the behaviour I have described. 
OR "You could have the 4 linked lists per scanline on the single
processor". This is cheating by running different algorithms on the two
systems, in this case I should have 4 lists per scanline on every
processor, which takes you back to the case above.)

  Has anyone got any other comments? Have I missed something obvious? 
I would be interrested if anyone has anything to say. (P.S. Sorry about the 
length of this posting).

                                             Steve R

---
Steve Rush			JANET: sru@uk.ac.exeter.cs 
Computer Science Dept.		UUCP:  ukc!expya!sru
University of Exeter

vera@portia.stanford.edu (James Vera) (12/15/88)

In article <3881@hubcap.UUCP> mcvax!cs.exeter.ac.uk!sru@uunet.UU.NET (Steve Rush) writes:
> [...]
>OR "You could have the 4 linked lists per scanline on the single
>processor". This is cheating by running different algorithms on the two
>systems, in this case I should have 4 lists per scanline on every
>processor, which takes you back to the case above.)
> [...]
>                                             Steve R
>
>---
>Steve Rush			JANET: sru@uk.ac.exeter.cs 
>Computer Science Dept.		UUCP:  ukc!expya!sru
>University of Exeter

No, this would not be cheating.  It is you who is running different
algorithms. In the multiprocessor case you are running an algorithm
which says "search 4 linked lists simultaneously", and in uniprocessor
case you are running a different global algorithm, which says "search
1 linked list".

It does not seem possible that n processors could run faster than n
times one processor for a given global algorithm.  One need only time
slice the runs of the n processors on the 1 processor, ignoring the
very real effects of cache effectiveness, etc ...




-- 
James S. Vera      |      Internet	     |Standard Disclaimers
Stanford University|vera@portia.stanford.edu |Blah Blah Blah Blah
Bellcore           |vera2%mruxb@bellcore.arpa|vvv My Cutesy Quote vvv
"When I was young it seemed that life was so wonderful..." - Supertramp

bzs@EDDIE.MIT.EDU (Barry Shein) (12/19/88)

I think it's unfortunate to immediately toss out non-CPU effects, it
often tosses out the baby with the bathwater. A while back, using
parallel make on an Encore Multimax, I observed a 50X speedup over
doing exactly the same build on a Vax750. Sheer processor comparison
only predicted about a 15X speedup (ie. a 6 NS32332 Encore, about 12
MIPS, vs a 750, about .75MIPS.) I think you would find this sort of
result on any decent MIMD (both of them :-)

I suppose you can argue that this is somehow cheating (?) but watching
a compile finish in 2 minutes instead of 100 minutes impressed me
(this was before I worked at Encore and I really had to build the 100+
C module systems on both systems for practical reasons, when I saw the
time difference I ran it again on both systems several times to see if
it was just an accident, nope, repeatedly about 50X difference.) It is
an unfortunate theory that dismisses that extra 3X speedup, no? The
whole system is parallelized, not just the ALUs (see next pp as to why
I say ALUs and not CPUs.)

As to the breadth-first problem where one might simulate this on a
uni-processor again we run into a practical problem. If the search
runs depth-first on N processors then the only way you can simulate it
on a uni is by saving and restoring a fair amount of context or using
a processor and algorithm which, for example, keeps a lot of state in
registers on the uni, probably requiring dozens of registers being
available which a parallel system couldn't take advantage of in the
same context. On several CPUs one does have dozens of registers
available. Some of this has to be taken into account.

So, in short, although it sounds good I suspect that these
super-linear speedups will be impossible to shoot down on a similar
uniprocessor, in practice. I realize that's unsatisfying to those who
like to ignore reality but it seems such practicalities are what
people trying to do real computations are concerned with.

I'm always astounded at the energy people will put into "theoretical"
arguments that parallel processing doesn't work (or doesn't work very
well, or is too complicated to utilize) while people around them are
sitting there making it work just fine.

	-Barry Shein, ||Encore||

aglew@mcdurb.Urbana.Gould.COM (12/19/88)

>It does not seem possible that n processors could run faster than n
>times one processor for a given global algorithm.  One need only time
>slice the runs of the n processors on the 1 processor, ignoring the
>very real effects of cache effectiveness, etc ...

And as I am fond of pointing out, timeslicing => context switch costs,
and reduction of context switch costs is one very real source of
linear super-unitary speedup.

david@june.cs.washington.edu (David Callahan) (12/20/88)

In article <3935@hubcap.UUCP> aglew@mcdurb.Urbana.Gould.COM writes:
>>It does not seem possible that n processors could run faster than n
>>times one processor for a given global algorithm.  One need only time
>>slice the runs of the n processors on the 1 processor, ignoring the
>>very real effects of cache effectiveness, etc ...
>
>And as I am fond of pointing out, timeslicing => context switch costs,
>and reduction of context switch costs is one very real source of
>linear super-unitary speedup.

Yeah, but setting a lower bound on time-slice size will put a 
upper bound on context switch overhead as a percentage of execution
time. Many of the examples discussed recently have not given
any arguments to beleive that the parallel algorithm is asymptotically
super-linear but seem to exploit increases in fast-memory, such
as registers. 

How many examples of super-linear speedup can be explained
by increases in memory-bandwidth due to more registers
and more cache rather than increases in CPU horsepower?

David Callahan

D
D
maintaining multiple sets of registers which 

aglew@mcdurb.Urbana.Gould.COM (12/21/88)

>How many examples of super-linear speedup can be explained
>by increases in memory-bandwidth due to more registers
>and more cache rather than increases in CPU horsepower?
>
>David Callahan

Probably many, if not most. But then, look - we know that 
the approximate shape of the curve that describes size
of fast memory vs. performance, right? Ie. for small amounts
of fast memory the ratio performance/memory is super-unitary;
actually, this curve goes up quite a way if the fast memory speed
remains constant. 'Course, at some point it turns back over, or else
caches won't work.

So, more processors => more fast memory, and if we're at the right
point in the curve we get a super-unitary payoff. But, you say,
why not put all that memory on a single processor? OK - but then
the memory speed falls off, typically by a logarithmic factor
(it's easier to put small amounts of fast memory on lots of processors
than it is to put a large amount of memory on a single processor).

Oh, you earlier argument about limiting context switch frequency
has too limited a concept of "context". There is also the context,
within a single job, like when you switch from processing matrix
A to matrix B, and then back again. Such context switches are part
of your algorithm, not artifacts of the multiprogramming environment
you are working in.

I have earlier posted a small conceptual "proof" that super-unitary
speedup is possible in "context-rich" problems. In that "proof" it
is possible to play around with the scaling factors to see under
what circumstances super-unitary speedup is possible. For example,
if fast memory speed stays constant for all sizes, super-unitary
sppedup is much less likely than if fast memory speed scales 
logarithmically.