[comp.parallel] superlinear speedup

charlie@jupiter.ucsc.edu (Charlie McDowell) (08/22/88)

Over the past several years I recall having read or heard of several
instances of "superlinear" speedup using parallel processors. If you
know any instances of "superlinear" performance measures on a parallel
machine, I would appreciate hearing about them.  They can be either
short unpublished results or pointers to articles in which the results
are mentioned.

Please respond by email to charlie@jupiter.ucsc.edu.

Charlie McDowell
Computer and Information Sciences
University of California, Santa Cruz
(408) 429-4772

aboulang@BBN.COM (Albert Boulanger) (08/24/88)

See the two short communications in Parallel Computing 3 (1986):

"Superlinear speedup of an efficient sequential algorithm is not
possible", V Faber, et al., pp 259-260

&
"Parallel efficiency can be greater than unity", D. Parkinson, pp
261-262.



Albert Boulanger
BBN Labs Inc.
ABoulanger@bbn.com (arpa)
Phone: (617)873-3891

gusciora@sam.cs.cmu.edu (George Gusciora) (12/12/88)

Why is everyone so surprised about superlinear speedup?  Sometimes
an n-processor machine is just better suited for a particular
algorithm than that machine with less than n processors.  Example:
I've got a neural network program running on the Warp (10 processor
MIMD machine).  The algorithm basically uses 9 cells to perform
the error propagation and uses one cell to update the connections.
Each of the 9 cell achieves about 8 MFLOPS and the 10th cell is well
underutilized (giving a total performance of about 72 MFLOPS out of
a possible 100).  Now, if I ran this on a 5-cell Warp, only 4 cells 
would be performing error propagation (instead of 9) and I'd get only
about 32 MFLOPS out of a possible 50.  The net result is that the
program runs more than twice as fast with just twice as many processors.
Superlinear speedup.  Several programs behave like this.

That was a simple example.  It's easy to say "I'm really comparing
a 9-processor machine with a 4-processor machine (because the last
processor is so underutilized) and a 9/4 speedup is expected."  So
what.  Nearly every alogrithm on any machine wastes machine cycles.
I picked the neural network example just to illustrate this.
Superlinear speedup is only impossible when both machines are 
running at full 100 percent utilization.

- George Gusciora
  Carnegie Mellon University
  (412) 268 - 7553
  gusciora@sam.cs.cmu.edu

anand@amax.npac.syr.edu (Anand Rangachari) (12/12/88)

  This may seem obvious but in order to measure the speedup of a program
it is neccessay to be able to run the given program with differing
numbers of processors.

  I have seen some programs developed on the Encore Multimax here and 
have been apalled to find that the number of processors used for some 
programs is hardwired in. ie nothing less than major surgery would be 
needed to do a proper speedup measurement.

  Also, I note that when the Sandia group achieved their prodigious
speedups, they were unable to measure the single processor case. Two 
reasons were mentioned for this. Firstly with a speedup of say 500
on 1000 processors a program taking 10 min to run will need 5000 min
on a single processor. Secondly it may not be possible to fit the 
entire problem on a single processor in the case of distributed
machines with a limited memory for each processor.

                                                   R. Anand

Internet: anand@amax.npac.syr.edu
Bitnet:   ranand@sunrise

eugene@eos.arc.nasa.gov (Eugene Miya) (12/13/88)

In article <3841@hubcap.UUCP> gusciora@sam.cs.cmu.edu (George Gusciora) writes:
>Why is everyone so surprised about superlinear speedup?

The above coming from a person at the University where it was originally
written up.

"Speed up," let's face it, is a dumb measure.  Users, given less than
half a chance would not care about the n-1 cases leading up to maximum
performance, they only care about the nth, "highest performance case."
Only computer scientists care immediately about the progression.
A couple of years ago, I can't remember the paper, someone wrote that
computists don't induct: we don't completely study all there is know about
2-processors before moving to 3-processors, etc.  We don't learn a whole
which is transferable between cases.

George also noted the WARP. The best I've heard for a 10 processor system
is 85 MFLOPS on a matrix multiply (at CMU).

The problem isn't parallelism, it's the sequential stuff (within and
without the "parallel streams.")

Another gross generalization from

--eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov
  resident cynic at the Rock of Ages Home for Retired Hackers:
  "Mailers?! HA!", "If my mail does not reach you, please accept my apology."
  {uunet,hplabs,ncar,decwrl,allegra,tektronix}!ames!aurora!eugene
  "Send mail, avoid follow-ups.  If enough, I'll summarize."

  If you want a cute little exercise work a week without using the
phrase "parallel processing."  Then excise the other common terms:
massive parallelism, multiprocess, multitask, etc.  Have fun.
You will learn something.

toulouse@iro.umontreal.ca (Michel Toulouse) (11/08/90)

Does anybody have references to papers that claim to get
theoretical or empirical superlinear speedup (speedup
superior to the number of processors) for a parallel
algorithm or experiment on a parallel machine. I am
actually collecting such references, once I will have
a good core of it I will mail it on this news group.
Thank's

Michel Toulouse
Universite de Montreal
email: toulouse@iro.umontreal.ca

toulouse@iro.umontreal.ca (Michel Toulouse) (11/21/90)

	Two weeks ago, I sent a message on this news group asking 
for references about papers claiming superlinear speedup. Since 
then, I have received many references and also enjoy the reading of 
very interesting comments. 

	My first plan was to take a close look to every papers and 
add my owns comments. But this may take a lost of time and many of 
those who send me references and comments did manifest their interest 
in receiving this list of references. So I do it now with this simple
comment.

	The traditional argument against superlinear speedup is that
"if a processus Q is running on p processors for t time, then there
is a sequential machine that will simulate Q in at most (kpt) + k' 
time steps for k and k' being constants. So the speedup is by a 
factor of p only". As it was mention in many comments, there is 
superlinear based on avantages like more physical memory in the parallel 
machine or obtained by heuristically guided search. They are not
in my point of view convincing arguments against the traditional 
scepticism. In the first case it is clearly not convincing since 
you can always add more memory to the sequential machine. In the 
second case if you partition the search space equally among the p 
processors, you can achieve very big speedup (as mention in one 
of the comments I received). But I think this, from a theoretical point 
of view, is meaningless since the speedup depend only on factors like 
the partition of the search space, the heuristic used, your random 
function, etc.

	However, relate to heuristics in search space, this paper: "The
performance of cooperative processes", B.A. Huberman, Physica D42, 1990
38--47, claim superlinear speedup not based on partition of search space
but on "hints" that cooperative processes might used to eliminate 
ininteresting subspaces. The paper give a formal explanation to this
possible superlinear speedup. Will it be a counter-argument to the argument
above, I don't know for the moment. 

	The volume 42 of Physica D concentrate only on nonlinear phenomena,
the discussion about emergent computation is traited extensively, I found
many interesting papers, one about a complete characterizations of Turing
machines with a restrained number of states (no problem you still can
teach that the halting problem is undecidable), also many papers on neural
computation. I think this review of Physica over emergent computation 
is a very serious work (as it is almost always the case in this Journal).
At least (for me) this number introduce an approach to parallel computation
that might challenge more seriously the argument against superlinear speedup. 
As I said before, I didn't have time to look at every paper, I'm anxious to
find challenger to the traditional argument.

	I thank's very much every one who send me references and comments,
I'm still very interesting to received any comments on this subject or
relate one like emergent computation.


	Michel Toulouse
	Universite de Montreal
	email:toulouse@iro.umontreal.ca

-------------------------------------------------------------------------

@inproceedings{RAO88fsttcs,
       AUTHOR = "Nageshwara Rao, V. and Kumar, Vipin",
        TITLE = "Superlinear Speedup in State-Space Search",
BOOKTITLE = "Proceedings of the 1988 Foundation of Software Technology
and Theoretcal Computer Science",
        YEAR = "December 1988",
Note = "Lecture Notes in Computer Science number 338, Springer Verlag"
       } 


@techreport{RAO90suplin,
       AUTHOR = "V. Nageshwara Rao and Vipin Kumar",
        TITLE = "On the Efficicency of Parallel Depth-First Search",
        YEAR = "1990",
    INSTITUTION = "Tech Report 89-41, Computer Science Department, University of Minnesota",
NOTE = "Submitted for publication"
      }  


-------------------------------------------------------------------------

@article{lihudak89
    ,key={LiKai}
    ,author={Li, K. and Hudak, P.}
    ,title={Memory Coherence in Shared Virtual Memory Systems}
    ,journal={ACM Transactions on Computer Systems}
    ,volume=7
    ,number=4
    ,month=Nov
    ,year=1989
    ,pages={321-359}
    }

@phdthesis{liphd
  ,key={li}
  ,author={Li, K.}
  ,title={Shared Virtual Memory on Loosely Coupled Multiprocessors}
  ,school={Yale University, Department of Computer Science}
  ,year=1986
  }

-------------------------------------------------------------------------

Charlie McDowell and myself (David Helmbold) have a paper
in the IEEE transactions on Parallel and Distributed Systems,
Vol 1, No. 2 (april 1990) titled "Modeling Speedup (n) greater
than n.  Our references include:
@techreport{sandia,
  AUTHOR = "Benner, R.E. and J.L. Gustafson and R.E. Montry",
  TITLE = "Development and analysis of scientific application programs 
	on a 1024-processor hypercube",
  INSTITUTION = "Sandia National Laboratories",
  YEAR = "1988",
  MONTH = "February",
  NUMBER = "SAND 88-0317",
  ADDRESS = "Albuquerque, N.M.", 
  note = ""}

@article{Gus88,
  AUTHOR = "Gustafson, J.L.",
  TITLE = "Reevaluating {A}mdahl's Law",
  JOURNAL = "CACM",
  YEAR = "1988",
  volume = "31",
  number = "5",
  pages = "532-533",
  month = "May",
  note = ""}

@article{EZL89,
  AUTHOR = "D.L. Eager and J. Zahorjan and E.D. Lazowska",
  TITLE = "Speedup versus Efficiency in Parallel Systems",
  JOURNAL = "IEEE Trans. on Comp.",
  YEAR = "1989",
  volume = "38",
  number = "3",
  pages = "408-423",
  month = "March",
  note = ""}

@inproceedings{SF89,
  AUTHOR = "C.B. Stunkel and W.K. Fuchs",
  TITLE = "Analysis of Hypercube cache performance using address 
	traces generated by {TRAPEDS}",
  BOOKTITLE = "Proc. 1989 International Conference on Parallel Processing",
  YEAR = "1989",
  pages = "I-33--I-40",
  note = ""}
%
@article{LS84,
  AUTHOR = "T. Lai and S. Sahni",
  TITLE = "Anomalies in parallel branch-and-bound algorithms",
  JOURNAL = "CACM",
  YEAR = "1984",
  volume = "27",
  number = "6",
  pages = "594-602",
  month = "June",
  note = ""}
% Ohio State and Univ of Minnesota

@article{Wei82,
  AUTHOR = "B. W. Weide",
  TITLE = "Modeling unusual behavior of parallel algorithms",
  JOURNAL = "IEEE Trans. on Computers",
  YEAR = "1982",
  volume = "C-31",
  number = "11",
  pages = "1126-1130",
  month = "November",
  note = ""}
% Ohio State

@article{LW86,
  AUTHOR = "G.-J. Li and B. W. Wah",
  TITLE = "Coping with anomalies in parallel branch-and-bound algorithms",
  JOURNAL = "IEEE Trans. on Comp.",
  YEAR = "1986",
  volume = "C-35",
  number = "6",
  pages = "568-573",
  month = "June",
  note = ""}
% was Purdue now Univ of Illinois

@inproceedings{PH88,
  AUTHOR = "B. R. Preiss and V. C. Hamacher",
  TITLE = "Semi-static dataflow",
  BOOKTITLE = "Proc. 1988 Inter. Conf. on Parallel Processing",
  YEAR = "1988",
  pages = "127-134",
  note = ""}
% Univ of Waterloo and Univ of Toronto

@article{FLW86,
  AUTHOR = "V. Faber and O. M. Lubeck and A. B. White Jr.",
  TITLE = "Superlinear speedup of an efficient sequential algorithm is not
  possible",
  JOURNAL = "Parallel Computing",
  YEAR = "1986",
  volume = "3",
  pages = "259-260",
  note = ""}
% LANL

@article{Par86,
  AUTHOR = "D. Parkinson",
  TITLE = "Parallel efficiency can be greater than unity",
  JOURNAL = "Parallel Computing",
  YEAR = "1986",
  volume = "3",
  pages = "261-262",
  note = ""}
% DAP support unit, Queen Mary College, London

@book{GSS87,
	AUTHOR = "E. F. Gehringer and D. P Siewiorek and Z. Segall",
	TITLE = "Parallel processing, the {Cm}* experience",
	PUBLISHER = "Digital press",
	pages = "227-231",
	YEAR = "1987"}

@inproceedings{Li88,
  AUTHOR = "K. Li",
  TITLE = "{IVY}: a shared virtual memory system for parallel computing",
  BOOKTITLE = "Proc. 1988 Inter. Conf. on Parallel Processing",
  YEAR = "1988",
  pages = "94-101",
  note = ""}
% Princeton

@inproceedings{MG85,
  AUTHOR = "R. Mehrotra and E. F. Gehringer",
  TITLE = "Superlinear speedup through randomized algorithms",
  BOOKTITLE = "Proc. 1985 Inter. Conf. on Parallel Processing",
  YEAR = "1985",
  pages = "291-300",
  note = ""}
% North Carolina State

@article{Kor82,
  AUTHOR = "W. A. Kornfeld",
  TITLE = "Combinatorially implosive algorithms",
  JOURNAL = "CACM",
  YEAR = "1982",
  volume = "25",
  number = "10",
  pages = "734-738",
  month = "October",
  note = ""}
% MIT

@techreport{Fla87,
  AUTHOR = "H. P. Flatt",
  TITLE = "Further comments on a model for parallel processing",
  INSTITUTION = "IBM PASC G320-3503",
  YEAR = "1987",
  note = ""}

@article{Miy88,
  AUTHOR = "E. Miya",
  TITLE = "Suggestion on Superlinear speed up terminology",
  JOURNAL = "Network news posting",
  YEAR = "1988",
  volume = "comp.parallel",
  number = "318",
  month = "December",
  note = ""}
% NASA AMES

@inproceedings{Smi81,
  AUTHOR = "B. J. Smith",
  TITLE = "Architecture and applications of the {HEP} multiprocessor computer",
  BOOKTITLE = "Real-Time signal processing IV",
  YEAR = "1981",
  volume = "298",
  pages = "241-248",
  note = ""}
% 

@article{San86,
  AUTHOR = "J. Sanguinetti",
  TITLE = "Performance of a message-based multiprocessor",
  JOURNAL = "Computer",
  YEAR = "1986",
  volume = "19",
  number = "9",
  month = "September",
  pages = "47-55",
  note = ""}
% just slightly superunity on an {ELXSI}, due to reduced {OS} overhead

@article{Min70,
  AUTHOR = "M. Minsky",
  TITLE = "Form and Content in Computer Science, {ACM Turing Lecture}",
  JOURNAL = "J. ACM",
  YEAR = "1970",
  volume = "17",
  pages = "197-215",
  note = ""}
% the log(n) Minsky conjecture

@article{Lip88,
  AUTHOR = "J. Lipkis",
  TITLE = "Superlinear performance of a monte carlo code on the {NYU}
  {Ultracomputer}",
  JOURNAL = "Personal communication",
  YEAR = "1988",
  note = ""}
% 

@article{Bar88,
  AUTHOR = "J. Barton",
  TITLE = "Superlinear performance of the {SAXPY} kernel on the {Silicon}
  {Graphics} {4D120/GTX}",
  JOURNAL = "Personal communication",
  YEAR = "1988",
  note = ""}
% 

@article{GM85,
  AUTHOR = "D. H. Grit and J. R. McGraw",
  TITLE = "Programming Divide and Conquer for a {MIMD} Machine",
  JOURNAL = "Software Practice and Experience",
  YEAR = "1985",
  month = " January",
  volume = "15",
  number = "1",
  pages = "41-53",
  note = ""}

@article{Amd67,
  AUTHOR = "G. Amdahl",
  TITLE = "Validity of the single processor approach to achieving large
  scale computing capabilities",
  JOURNAL = "Proceedings AFIPS Comp. Conf.",
  YEAR = "1967",
  volume = "30",
  pages = "483-485",
  note = ""}

-------------------------------------------------------------------------

\bibitem{Kuma88c}
Kumar, Vipin, Ramesh, K., and Nageshwara, Rao V.,
``Parallel Best-First Search of State-Space Graphs: A Summary of Results",
{\em AAAI 1988} : 122-127.


\bibitem{Kuma89a}
Kumar, Vipin, and Nageshwara, Rao V.,
``Load Balancing on the Hypercube Architecture",
{\em Proc. 4th Conf. on Hypercubes, Concurrent Computers and Applications},
March 1989.


\bibitem{Nels90b}
Nelson, Peter C., and Toptsis, Anestis A., 
``Artificial Intelligence, Parallel Processing, and Bidirectional
Heuristic Search",
Submitted

\bibitem{Nels90c}
Nelson, Peter C., and Toptsis, Anestis A., 
``Superlinear Speedup in Multiprocessor Bidirectional Heuristic 
State-Space Search", 
Submitted

\bibitem{Nels90d}
Nelson, Peter C., and Toptsis, Anestis A., 
``Wave-Shaping in Parallel Bidirectional Heuristic State-Space Search", 
Submitted



\bibitem{Rao87a}
Rao, Nageshwara, V., Kumar, Vipin, and Ramesh, K.,
``A Parallel Implementation of Iterative-Deepening-A*",
{\em AAAI 1987}, 178-182.

\bibitem{Rao87d}
Rao, Nageshwara, V., and Kumar, Vipin, and Ramesh, K., 
``Parallel Heuristic Search on a Shared Memory  Multiprocessor", 
{\em Tech. Report. AI Lab, TR-87-45},
Univ. of Texas at Austin, January 1987.

\bibitem{Rao88b}
Rao, Nageshwara, V., and Kumar, Vipin, 
``Concurrent Insertions and Deletions in a Priority Queue",
{\em 1988 Inter. Conf. on Parallel Processing}, Vol. 3, (1988) : 207-210.


-------------------------------------------------------------------------

In the Conpar90-Vapp 4 Proceedings you'll find
M.Cosnard, J.L-Philippe
Achieving Superlinear Speedups for the Multiple Polinomial Quadratic 
Sieve Factoring Algorithm on a Distributed Memory Multiprocessor
p.863 - 874

The Proceedings have been printed at Springer Verlag, Germany and
may be ordered by bookshops or with a 30% reduction directly at

*********************************************************
* Institut fuer Informatik   | phone : +41 61 321 99 67    *
* Mittlere Strasse 142       | fax   : +41 61 321 99 15     *
* 4000 Basel                 | email : waser@urz.unibas.ch   *
* Switzerland                                                 *
***************************************************************

-------------------------------------------------------------------------

Li & Wah "How to cope with anomalies in parallel approximate
branch-andf-bound search". AAAI-84 (National Conf. on AI).

McBurney & Sleep "Transputer based experimanets with the ZAPP
architecture" In PARLE-87 (Parallel Architectures and Languages,
Europe) conference, Springer LNCS 258.

The former is a theoretical analysis, the latter gives some
practical results. In both cases the superlinear is speedup is
due to a single processor doing a long search of a tree with no
solutions, while in a multiprocessor system this tree is
searched in parallel with another in which a solution is easily
found.

-------------------------------------------------------------------------

David P. Helmbold, and Charles E. McDowell,"Modeling Speedup (n) Greater
than n", IEEE Trans. on Parallel and Distributed Sys., Apr. 1990,
pp. 250 - 256.