charlie@jupiter.ucsc.edu (Charlie McDowell) (08/22/88)
Over the past several years I recall having read or heard of several instances of "superlinear" speedup using parallel processors. If you know any instances of "superlinear" performance measures on a parallel machine, I would appreciate hearing about them. They can be either short unpublished results or pointers to articles in which the results are mentioned. Please respond by email to charlie@jupiter.ucsc.edu. Charlie McDowell Computer and Information Sciences University of California, Santa Cruz (408) 429-4772
aboulang@BBN.COM (Albert Boulanger) (08/24/88)
See the two short communications in Parallel Computing 3 (1986): "Superlinear speedup of an efficient sequential algorithm is not possible", V Faber, et al., pp 259-260 & "Parallel efficiency can be greater than unity", D. Parkinson, pp 261-262. Albert Boulanger BBN Labs Inc. ABoulanger@bbn.com (arpa) Phone: (617)873-3891
gusciora@sam.cs.cmu.edu (George Gusciora) (12/12/88)
Why is everyone so surprised about superlinear speedup? Sometimes an n-processor machine is just better suited for a particular algorithm than that machine with less than n processors. Example: I've got a neural network program running on the Warp (10 processor MIMD machine). The algorithm basically uses 9 cells to perform the error propagation and uses one cell to update the connections. Each of the 9 cell achieves about 8 MFLOPS and the 10th cell is well underutilized (giving a total performance of about 72 MFLOPS out of a possible 100). Now, if I ran this on a 5-cell Warp, only 4 cells would be performing error propagation (instead of 9) and I'd get only about 32 MFLOPS out of a possible 50. The net result is that the program runs more than twice as fast with just twice as many processors. Superlinear speedup. Several programs behave like this. That was a simple example. It's easy to say "I'm really comparing a 9-processor machine with a 4-processor machine (because the last processor is so underutilized) and a 9/4 speedup is expected." So what. Nearly every alogrithm on any machine wastes machine cycles. I picked the neural network example just to illustrate this. Superlinear speedup is only impossible when both machines are running at full 100 percent utilization. - George Gusciora Carnegie Mellon University (412) 268 - 7553 gusciora@sam.cs.cmu.edu
anand@amax.npac.syr.edu (Anand Rangachari) (12/12/88)
This may seem obvious but in order to measure the speedup of a program
it is neccessay to be able to run the given program with differing
numbers of processors.
I have seen some programs developed on the Encore Multimax here and
have been apalled to find that the number of processors used for some
programs is hardwired in. ie nothing less than major surgery would be
needed to do a proper speedup measurement.
Also, I note that when the Sandia group achieved their prodigious
speedups, they were unable to measure the single processor case. Two
reasons were mentioned for this. Firstly with a speedup of say 500
on 1000 processors a program taking 10 min to run will need 5000 min
on a single processor. Secondly it may not be possible to fit the
entire problem on a single processor in the case of distributed
machines with a limited memory for each processor.
R. Anand
Internet: anand@amax.npac.syr.edu
Bitnet: ranand@sunriseeugene@eos.arc.nasa.gov (Eugene Miya) (12/13/88)
In article <3841@hubcap.UUCP> gusciora@sam.cs.cmu.edu (George Gusciora) writes: >Why is everyone so surprised about superlinear speedup? The above coming from a person at the University where it was originally written up. "Speed up," let's face it, is a dumb measure. Users, given less than half a chance would not care about the n-1 cases leading up to maximum performance, they only care about the nth, "highest performance case." Only computer scientists care immediately about the progression. A couple of years ago, I can't remember the paper, someone wrote that computists don't induct: we don't completely study all there is know about 2-processors before moving to 3-processors, etc. We don't learn a whole which is transferable between cases. George also noted the WARP. The best I've heard for a 10 processor system is 85 MFLOPS on a matrix multiply (at CMU). The problem isn't parallelism, it's the sequential stuff (within and without the "parallel streams.") Another gross generalization from --eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov resident cynic at the Rock of Ages Home for Retired Hackers: "Mailers?! HA!", "If my mail does not reach you, please accept my apology." {uunet,hplabs,ncar,decwrl,allegra,tektronix}!ames!aurora!eugene "Send mail, avoid follow-ups. If enough, I'll summarize." If you want a cute little exercise work a week without using the phrase "parallel processing." Then excise the other common terms: massive parallelism, multiprocess, multitask, etc. Have fun. You will learn something.
toulouse@iro.umontreal.ca (Michel Toulouse) (11/08/90)
Does anybody have references to papers that claim to get theoretical or empirical superlinear speedup (speedup superior to the number of processors) for a parallel algorithm or experiment on a parallel machine. I am actually collecting such references, once I will have a good core of it I will mail it on this news group. Thank's Michel Toulouse Universite de Montreal email: toulouse@iro.umontreal.ca
toulouse@iro.umontreal.ca (Michel Toulouse) (11/21/90)
Two weeks ago, I sent a message on this news group asking
for references about papers claiming superlinear speedup. Since
then, I have received many references and also enjoy the reading of
very interesting comments.
My first plan was to take a close look to every papers and
add my owns comments. But this may take a lost of time and many of
those who send me references and comments did manifest their interest
in receiving this list of references. So I do it now with this simple
comment.
The traditional argument against superlinear speedup is that
"if a processus Q is running on p processors for t time, then there
is a sequential machine that will simulate Q in at most (kpt) + k'
time steps for k and k' being constants. So the speedup is by a
factor of p only". As it was mention in many comments, there is
superlinear based on avantages like more physical memory in the parallel
machine or obtained by heuristically guided search. They are not
in my point of view convincing arguments against the traditional
scepticism. In the first case it is clearly not convincing since
you can always add more memory to the sequential machine. In the
second case if you partition the search space equally among the p
processors, you can achieve very big speedup (as mention in one
of the comments I received). But I think this, from a theoretical point
of view, is meaningless since the speedup depend only on factors like
the partition of the search space, the heuristic used, your random
function, etc.
However, relate to heuristics in search space, this paper: "The
performance of cooperative processes", B.A. Huberman, Physica D42, 1990
38--47, claim superlinear speedup not based on partition of search space
but on "hints" that cooperative processes might used to eliminate
ininteresting subspaces. The paper give a formal explanation to this
possible superlinear speedup. Will it be a counter-argument to the argument
above, I don't know for the moment.
The volume 42 of Physica D concentrate only on nonlinear phenomena,
the discussion about emergent computation is traited extensively, I found
many interesting papers, one about a complete characterizations of Turing
machines with a restrained number of states (no problem you still can
teach that the halting problem is undecidable), also many papers on neural
computation. I think this review of Physica over emergent computation
is a very serious work (as it is almost always the case in this Journal).
At least (for me) this number introduce an approach to parallel computation
that might challenge more seriously the argument against superlinear speedup.
As I said before, I didn't have time to look at every paper, I'm anxious to
find challenger to the traditional argument.
I thank's very much every one who send me references and comments,
I'm still very interesting to received any comments on this subject or
relate one like emergent computation.
Michel Toulouse
Universite de Montreal
email:toulouse@iro.umontreal.ca
-------------------------------------------------------------------------
@inproceedings{RAO88fsttcs,
AUTHOR = "Nageshwara Rao, V. and Kumar, Vipin",
TITLE = "Superlinear Speedup in State-Space Search",
BOOKTITLE = "Proceedings of the 1988 Foundation of Software Technology
and Theoretcal Computer Science",
YEAR = "December 1988",
Note = "Lecture Notes in Computer Science number 338, Springer Verlag"
}
@techreport{RAO90suplin,
AUTHOR = "V. Nageshwara Rao and Vipin Kumar",
TITLE = "On the Efficicency of Parallel Depth-First Search",
YEAR = "1990",
INSTITUTION = "Tech Report 89-41, Computer Science Department, University of Minnesota",
NOTE = "Submitted for publication"
}
-------------------------------------------------------------------------
@article{lihudak89
,key={LiKai}
,author={Li, K. and Hudak, P.}
,title={Memory Coherence in Shared Virtual Memory Systems}
,journal={ACM Transactions on Computer Systems}
,volume=7
,number=4
,month=Nov
,year=1989
,pages={321-359}
}
@phdthesis{liphd
,key={li}
,author={Li, K.}
,title={Shared Virtual Memory on Loosely Coupled Multiprocessors}
,school={Yale University, Department of Computer Science}
,year=1986
}
-------------------------------------------------------------------------
Charlie McDowell and myself (David Helmbold) have a paper
in the IEEE transactions on Parallel and Distributed Systems,
Vol 1, No. 2 (april 1990) titled "Modeling Speedup (n) greater
than n. Our references include:
@techreport{sandia,
AUTHOR = "Benner, R.E. and J.L. Gustafson and R.E. Montry",
TITLE = "Development and analysis of scientific application programs
on a 1024-processor hypercube",
INSTITUTION = "Sandia National Laboratories",
YEAR = "1988",
MONTH = "February",
NUMBER = "SAND 88-0317",
ADDRESS = "Albuquerque, N.M.",
note = ""}
@article{Gus88,
AUTHOR = "Gustafson, J.L.",
TITLE = "Reevaluating {A}mdahl's Law",
JOURNAL = "CACM",
YEAR = "1988",
volume = "31",
number = "5",
pages = "532-533",
month = "May",
note = ""}
@article{EZL89,
AUTHOR = "D.L. Eager and J. Zahorjan and E.D. Lazowska",
TITLE = "Speedup versus Efficiency in Parallel Systems",
JOURNAL = "IEEE Trans. on Comp.",
YEAR = "1989",
volume = "38",
number = "3",
pages = "408-423",
month = "March",
note = ""}
@inproceedings{SF89,
AUTHOR = "C.B. Stunkel and W.K. Fuchs",
TITLE = "Analysis of Hypercube cache performance using address
traces generated by {TRAPEDS}",
BOOKTITLE = "Proc. 1989 International Conference on Parallel Processing",
YEAR = "1989",
pages = "I-33--I-40",
note = ""}
%
@article{LS84,
AUTHOR = "T. Lai and S. Sahni",
TITLE = "Anomalies in parallel branch-and-bound algorithms",
JOURNAL = "CACM",
YEAR = "1984",
volume = "27",
number = "6",
pages = "594-602",
month = "June",
note = ""}
% Ohio State and Univ of Minnesota
@article{Wei82,
AUTHOR = "B. W. Weide",
TITLE = "Modeling unusual behavior of parallel algorithms",
JOURNAL = "IEEE Trans. on Computers",
YEAR = "1982",
volume = "C-31",
number = "11",
pages = "1126-1130",
month = "November",
note = ""}
% Ohio State
@article{LW86,
AUTHOR = "G.-J. Li and B. W. Wah",
TITLE = "Coping with anomalies in parallel branch-and-bound algorithms",
JOURNAL = "IEEE Trans. on Comp.",
YEAR = "1986",
volume = "C-35",
number = "6",
pages = "568-573",
month = "June",
note = ""}
% was Purdue now Univ of Illinois
@inproceedings{PH88,
AUTHOR = "B. R. Preiss and V. C. Hamacher",
TITLE = "Semi-static dataflow",
BOOKTITLE = "Proc. 1988 Inter. Conf. on Parallel Processing",
YEAR = "1988",
pages = "127-134",
note = ""}
% Univ of Waterloo and Univ of Toronto
@article{FLW86,
AUTHOR = "V. Faber and O. M. Lubeck and A. B. White Jr.",
TITLE = "Superlinear speedup of an efficient sequential algorithm is not
possible",
JOURNAL = "Parallel Computing",
YEAR = "1986",
volume = "3",
pages = "259-260",
note = ""}
% LANL
@article{Par86,
AUTHOR = "D. Parkinson",
TITLE = "Parallel efficiency can be greater than unity",
JOURNAL = "Parallel Computing",
YEAR = "1986",
volume = "3",
pages = "261-262",
note = ""}
% DAP support unit, Queen Mary College, London
@book{GSS87,
AUTHOR = "E. F. Gehringer and D. P Siewiorek and Z. Segall",
TITLE = "Parallel processing, the {Cm}* experience",
PUBLISHER = "Digital press",
pages = "227-231",
YEAR = "1987"}
@inproceedings{Li88,
AUTHOR = "K. Li",
TITLE = "{IVY}: a shared virtual memory system for parallel computing",
BOOKTITLE = "Proc. 1988 Inter. Conf. on Parallel Processing",
YEAR = "1988",
pages = "94-101",
note = ""}
% Princeton
@inproceedings{MG85,
AUTHOR = "R. Mehrotra and E. F. Gehringer",
TITLE = "Superlinear speedup through randomized algorithms",
BOOKTITLE = "Proc. 1985 Inter. Conf. on Parallel Processing",
YEAR = "1985",
pages = "291-300",
note = ""}
% North Carolina State
@article{Kor82,
AUTHOR = "W. A. Kornfeld",
TITLE = "Combinatorially implosive algorithms",
JOURNAL = "CACM",
YEAR = "1982",
volume = "25",
number = "10",
pages = "734-738",
month = "October",
note = ""}
% MIT
@techreport{Fla87,
AUTHOR = "H. P. Flatt",
TITLE = "Further comments on a model for parallel processing",
INSTITUTION = "IBM PASC G320-3503",
YEAR = "1987",
note = ""}
@article{Miy88,
AUTHOR = "E. Miya",
TITLE = "Suggestion on Superlinear speed up terminology",
JOURNAL = "Network news posting",
YEAR = "1988",
volume = "comp.parallel",
number = "318",
month = "December",
note = ""}
% NASA AMES
@inproceedings{Smi81,
AUTHOR = "B. J. Smith",
TITLE = "Architecture and applications of the {HEP} multiprocessor computer",
BOOKTITLE = "Real-Time signal processing IV",
YEAR = "1981",
volume = "298",
pages = "241-248",
note = ""}
%
@article{San86,
AUTHOR = "J. Sanguinetti",
TITLE = "Performance of a message-based multiprocessor",
JOURNAL = "Computer",
YEAR = "1986",
volume = "19",
number = "9",
month = "September",
pages = "47-55",
note = ""}
% just slightly superunity on an {ELXSI}, due to reduced {OS} overhead
@article{Min70,
AUTHOR = "M. Minsky",
TITLE = "Form and Content in Computer Science, {ACM Turing Lecture}",
JOURNAL = "J. ACM",
YEAR = "1970",
volume = "17",
pages = "197-215",
note = ""}
% the log(n) Minsky conjecture
@article{Lip88,
AUTHOR = "J. Lipkis",
TITLE = "Superlinear performance of a monte carlo code on the {NYU}
{Ultracomputer}",
JOURNAL = "Personal communication",
YEAR = "1988",
note = ""}
%
@article{Bar88,
AUTHOR = "J. Barton",
TITLE = "Superlinear performance of the {SAXPY} kernel on the {Silicon}
{Graphics} {4D120/GTX}",
JOURNAL = "Personal communication",
YEAR = "1988",
note = ""}
%
@article{GM85,
AUTHOR = "D. H. Grit and J. R. McGraw",
TITLE = "Programming Divide and Conquer for a {MIMD} Machine",
JOURNAL = "Software Practice and Experience",
YEAR = "1985",
month = " January",
volume = "15",
number = "1",
pages = "41-53",
note = ""}
@article{Amd67,
AUTHOR = "G. Amdahl",
TITLE = "Validity of the single processor approach to achieving large
scale computing capabilities",
JOURNAL = "Proceedings AFIPS Comp. Conf.",
YEAR = "1967",
volume = "30",
pages = "483-485",
note = ""}
-------------------------------------------------------------------------
\bibitem{Kuma88c}
Kumar, Vipin, Ramesh, K., and Nageshwara, Rao V.,
``Parallel Best-First Search of State-Space Graphs: A Summary of Results",
{\em AAAI 1988} : 122-127.
\bibitem{Kuma89a}
Kumar, Vipin, and Nageshwara, Rao V.,
``Load Balancing on the Hypercube Architecture",
{\em Proc. 4th Conf. on Hypercubes, Concurrent Computers and Applications},
March 1989.
\bibitem{Nels90b}
Nelson, Peter C., and Toptsis, Anestis A.,
``Artificial Intelligence, Parallel Processing, and Bidirectional
Heuristic Search",
Submitted
\bibitem{Nels90c}
Nelson, Peter C., and Toptsis, Anestis A.,
``Superlinear Speedup in Multiprocessor Bidirectional Heuristic
State-Space Search",
Submitted
\bibitem{Nels90d}
Nelson, Peter C., and Toptsis, Anestis A.,
``Wave-Shaping in Parallel Bidirectional Heuristic State-Space Search",
Submitted
\bibitem{Rao87a}
Rao, Nageshwara, V., Kumar, Vipin, and Ramesh, K.,
``A Parallel Implementation of Iterative-Deepening-A*",
{\em AAAI 1987}, 178-182.
\bibitem{Rao87d}
Rao, Nageshwara, V., and Kumar, Vipin, and Ramesh, K.,
``Parallel Heuristic Search on a Shared Memory Multiprocessor",
{\em Tech. Report. AI Lab, TR-87-45},
Univ. of Texas at Austin, January 1987.
\bibitem{Rao88b}
Rao, Nageshwara, V., and Kumar, Vipin,
``Concurrent Insertions and Deletions in a Priority Queue",
{\em 1988 Inter. Conf. on Parallel Processing}, Vol. 3, (1988) : 207-210.
-------------------------------------------------------------------------
In the Conpar90-Vapp 4 Proceedings you'll find
M.Cosnard, J.L-Philippe
Achieving Superlinear Speedups for the Multiple Polinomial Quadratic
Sieve Factoring Algorithm on a Distributed Memory Multiprocessor
p.863 - 874
The Proceedings have been printed at Springer Verlag, Germany and
may be ordered by bookshops or with a 30% reduction directly at
*********************************************************
* Institut fuer Informatik | phone : +41 61 321 99 67 *
* Mittlere Strasse 142 | fax : +41 61 321 99 15 *
* 4000 Basel | email : waser@urz.unibas.ch *
* Switzerland *
***************************************************************
-------------------------------------------------------------------------
Li & Wah "How to cope with anomalies in parallel approximate
branch-andf-bound search". AAAI-84 (National Conf. on AI).
McBurney & Sleep "Transputer based experimanets with the ZAPP
architecture" In PARLE-87 (Parallel Architectures and Languages,
Europe) conference, Springer LNCS 258.
The former is a theoretical analysis, the latter gives some
practical results. In both cases the superlinear is speedup is
due to a single processor doing a long search of a tree with no
solutions, while in a multiprocessor system this tree is
searched in parallel with another in which a solution is easily
found.
-------------------------------------------------------------------------
David P. Helmbold, and Charles E. McDowell,"Modeling Speedup (n) Greater
than n", IEEE Trans. on Parallel and Distributed Sys., Apr. 1990,
pp. 250 - 256.