[comp.arch] Superlinear speed-up

jms@central.cis.upenn.edu (Jonathan M. Smith) (06/16/90)
Re: Super-linear speedup:

The references are:

%T Superlinear speedup of an efficient sequential
algorithm is not possible
%A V. Faber
%A Olaf M. Lubeck
%A Andrew B. White,Jr.
%J Parallel Computing
%I North-Holland
%V 3
%D 1986
%P 259-260

and:

%T Parallel efficiency can be greater than unity
%A D. Parkinson
%J Parallel Computing
%I North-Holland
%V 3
%D 1986
%P 261-262

Note the interesting juxtaposition of the articles in the same issue
of the same journal. It's nice to know editors have a sense of humor.

In any case, Chuck Simmons mentions the idea of speed-up from trying
alternatives in parallel. I attacked this in my thesis, and illustrated
several of the ideas in the much more compact paper in ICPP:

%T Exploring ``Multiple Worlds'' in Parallel
%A Jonathan M. Smith
%A Gerald Q. Maguire,Jr.
%B Proceedings, International Conference on Parallel Processing
%C St. Charles, Illinois
%V II, Software
%P 239-245
%I The Pennsylvania State University Press
%X CUCS-436-89
%D August 8-12, 1989

The thesis is:

%T Concurrent Execution
of Mutually Exclusive Alternatives
%A Jonathan M. Smith
%R Ph.D. Thesis
%X Technical Report CUCS-442-89
%I Columbia University Computer Science Department
%O also available from UMI
%D May, 1989

The basic conclusion is: If you have alternatives with varying
execution time, and the variation can be characterized in some
reasonable way (e.g., a probability distribution function of a random
variable defined over the execution times of the alternatives), you
can get a speed-up by exploring the alternatives in parallel. The
speed-up may be at some expense in throughput (but it if the iron is
idle anyway, who cares) and can be estimated from the distribution.
The analysis is completely general, and applies to other values which
can be described as random variables with PDFs, e.g., error and
failures. I implemented the scheme on an IBM ACE multiprocessor using
the Jenkins-Traub (complex polynomials with complex coefficients)
rootfinder on some polynomials of degrees 19->43 and it worked pretty
decently. Numbers for a 2-head Ardent Titan are in the ICPP paper.
Numbers for the ACE are in the thesis (which really focuses on
building).

On superlinearity (which I discuss briefly in the ICPP paper and
somewhat more extensively in the thesis): The reason that the speed-up
can be super-linear (ignoring the timesharing method, which anybody
who's measured a context switch knows is suspicious) is that the only
reasonable way to discuss speedup when any of a set of alternatives
can be executed is in terms of the *expected* execution time. Since
the probability of hitting a "good" case (fast) increases and the
probability of hitting a "bad" case (slow) decreases according to the
probability distribution function, it is easy to see that you can get
good speedups. Viz, you have two algorithms for X(). One takes O(1) on
1/2 the inputs, O(N**3) on the other 1/2. The other algorithm has
complementary run-times on the input set. Now, on average, the
algorithm will take (O(N**3)+O(1))/2 = O(N**3). But, in parallel, we
just choose the fastest and kill the other: O(1). So the speedup (in
this admittedly contrived example) is O(N**3)/O(1) on a two processor
system!

							-Jonathan