[comp.arch] Superlinear Speedup

hankd@ee.ecn.purdue.edu (Hank Dietz) (05/04/90)

In article <1990May1.154558.24009@cs.rochester.edu> crowl@cs.rochester.edu (Lawrence Crowl) writes:
>In article <49622@lanl.gov> ryg@lanl.gov (Richard S Grandy) writes:
>>I've got a glossy from SGI that shows the POWER center performance on 
>>LINPACK (100x100, coded) as:
>>		1 CPU	3.8 DP MFLOPS
>>		4 CPU	16  DP MFLOPS
>>		8 CPU	28  DP MFLOPS
>>Does this mean than with 4 cpus you really get GREATER than a linear speedup?? 
>
>A common trap is measuring parallel speedup is to use a "loaded" CPU.  For
>instance, the single processor might have been serving interrupts, clock
>icons, etc.  The result is that the performance for the single processor case
>is artificially low.  After taking care, the numbers usually turn out slightly
>less than linear.  The second trap is measuring with a relatively coarse
>clock.  For speedup that is near linear, barely missing a clock tick can lead
>to measurements that indicate superlinear speedup.
			       ^^^^^^^^^^^

Not really.  Superlinear means better than O(n) for n processors...  hence,
even though the case of n=1 might be a bit strange, even k*n| k>1 isn't
superlinear.  For example, HJ Siegel has pointed out that for appropriate
pipelined SIMD machines (read: PASM) the overlap of the CU and PEs can
achieve an apparent speedup of 2*n...  but that is O(n), i.e., linear.

I do research in optimizing/parallelizing compilers and the first benchmarks
I did all yielded >n speedup.  Why?  BECAUSE WHEN A PROGRAM IS PARALLELIZED,
IT ISN'T THE SAME PROGRAM -- it simply produces the same results.  In other
words, it can be transformed to take advantage of various side benefits of
parallel hardware & execution:

[1]	Increased register, cache, and page frame space:  if 1 processor can
	keep K things in registers, N processors might keep K*N in registers.
	The program is different because it doesn't spill registers, etc.

[2]	Different management of cache and page frames:  for example, using
	the LRU policy, having N separate K-element LRU queues can yield
	better performance than one N*K-element queue.  We tend to forget
	that even if the same space is available, a management policy change
	can dramatically alter performance.  Proof left to the reader.  ;-)

[3]	Introduction of constants and other code simplifications:  for
	example, with N processors a loop involving a[i] might not need the
	usual indexing operation, because for each processor i is a
	compile-time constant.  This is rather like unraveling every loop,
	but unraveling loops can have negative impacts on the memory
	reference pattern which might not surface relative to a parallel
	machine.  This also happens in non-looping code.

[4]	In general, parallelization transformations restructure code so that
	locality is enhanced, but only in true parallel execution.  This is
	due to the fact that live ranges overlapping in time on a parallel
	machine does not imply that they interfere.  This is also hard to
	explain in detail, so I guess it must be intuitively obvious.  ;-)
	It has to do primarily with the effects of asynchrony....

etc.

Note that, for example, the benfits due to, for example, [1] and [3] do not
interfere -- hence their benefits can actually multiply.  In summary, you
can get greater than O(n) speedup -- true superlinear speedup -- because of
such combined effects.  My claim is, however, that you really are not
comparing equivalent programs.

						-hankd@ecn.purdue.edu

sccowan@watmsg.uwaterloo.ca (S. Crispin Cowan) (05/04/90)

In article <1990May3.203405.23456@ecn.purdue.edu> hankd@ee.ecn.purdue.edu (Hank Dietz) writes:
>In article <1990May1.154558.24009@cs.rochester.edu> crowl@cs.rochester.edu (Lawrence Crowl) writes:
>>In article <49622@lanl.gov> ryg@lanl.gov (Richard S Grandy) writes:
>>>I've got a glossy from SGI that shows the POWER center performance on 
>>>LINPACK (100x100, coded) as:
>>>		1 CPU	3.8 DP MFLOPS
>>>		4 CPU	16  DP MFLOPS
>>>Does this mean than with 4 cpus you really get GREATER than a linear speedup?? 
>I do research in optimizing/parallelizing compilers and the first benchmarks
>I did all yielded >n speedup.  Why?  BECAUSE WHEN A PROGRAM IS PARALLELIZED,
>IT ISN'T THE SAME PROGRAM -- it simply produces the same results.  In other
>words, it can be transformed to take advantage of various side benefits of
>parallel hardware & execution:

This all looks like a complicated way of noticing the (somewhat subtle)
point that a parallel machine with n processors _also_ has something
approximating n times the bandwidth to memory, because it has n times
the cache space with n fetch cycles and n centres of locality.

>						-hankd@ecn.purdue.edu

----------------------------------------------------------------------
(S.) Crispin Cowan, CS grad student, University of Waterloo
Office:		DC3548	x3934		Home phone: 570-2517
Post Awful:	60 Overlea Drive, Kitchener, N2M 1T1
UUCP:		watmath!watmsg!sccowan
Domain:		sccowan@watmsg.waterloo.edu

"You can have peace.  Or you can have freedom.  Don't ever count on
having both at once."
	-Lazarus Long
"You can't separate peace from freedom because no one can be at peace
unless he has his freedom."
	-Malcolm X

csimmons@jewel.oracle.com (Charles Simmons) (06/02/90)

In article <1990May3.203405.23456@ecn.purdue.edu>,
hankd@ee.ecn.purdue.edu (Hank Dietz) writes:
> Not really.  Superlinear means better than O(n) for n processors...  hence,
> even though the case of n=1 might be a bit strange, even k*n| k>1 isn't
> superlinear.  For example, HJ Siegel has pointed out that for appropriate
> pipelined SIMD machines (read: PASM) the overlap of the CU and PEs can
> achieve an apparent speedup of 2*n...  but that is O(n), i.e., linear.

I'm curious as to what extent researchers have observed real superlinear
speedups.  For example, consider a chess program.  On an old-fashioned
single processor architechture, the program will examine multiple
alternatives one at a time.  The results of each alternative can trim
the amount of searching required in subsequent alternatives.  (The
alpha-beta cutoffs become closer together.)  Now we might imagine that
a version of the program written for a parallel machine might be able
to benefit from examining multiple alternatives in parallel.  For example,
if one alternative is highly advantageous, it might cause the alpha-beta
cutoff values to get lots closer lots faster.

-- Chuck

sls@beaner.cs.wisc.edu (Steve Scott) (06/05/90)

In article <1990Jun2.080658.12651@oracle.com> csimmons@oracle.com writes:
>I'm curious as to what extent researchers have observed real superlinear
>speedups.  For example, consider a chess program.  On an old-fashioned
>single processor architechture, the program will examine multiple
>alternatives one at a time.  The results of each alternative can trim
>the amount of searching required in subsequent alternatives.  (The
>alpha-beta cutoffs become closer together.)  Now we might imagine that
>a version of the program written for a parallel machine might be able
>to benefit from examining multiple alternatives in parallel.  For example,
>if one alternative is highly advantageous, it might cause the alpha-beta
>cutoff values to get lots closer lots faster.
>
>-- Chuck

Not that this is related to architecture,  but alpha-beta search actually
displays just the opposite sort of behavior.  Since pruning is so effective,
exploring multiple branches of the search tree in parallel results in a
tremendous amount of wasted search because the cutoffs from previous
branches are not available yet.  Many people (myself included) have
attempted to alter the algorithm in various ways to lessen this affect, but
no one has been able to approach even linear speedup.

--Steve