charlie@jupiter.ucsc.edu (Charlie McDowell) (08/22/88)
Over the past several years I recall having read or heard of several instances of "superlinear" speedup using parallel processors. If you know any instances of "superlinear" performance measures on a parallel machine, I would appreciate hearing about them. They can be either short unpublished results or pointers to articles in which the results are mentioned. Please respond by email to charlie@jupiter.ucsc.edu. Charlie McDowell Computer and Information Sciences University of California, Santa Cruz (408) 429-4772
aboulang@BBN.COM (Albert Boulanger) (08/24/88)
See the two short communications in Parallel Computing 3 (1986): "Superlinear speedup of an efficient sequential algorithm is not possible", V Faber, et al., pp 259-260 & "Parallel efficiency can be greater than unity", D. Parkinson, pp 261-262. Albert Boulanger BBN Labs Inc. ABoulanger@bbn.com (arpa) Phone: (617)873-3891
gusciora@sam.cs.cmu.edu (George Gusciora) (12/12/88)
Why is everyone so surprised about superlinear speedup? Sometimes an n-processor machine is just better suited for a particular algorithm than that machine with less than n processors. Example: I've got a neural network program running on the Warp (10 processor MIMD machine). The algorithm basically uses 9 cells to perform the error propagation and uses one cell to update the connections. Each of the 9 cell achieves about 8 MFLOPS and the 10th cell is well underutilized (giving a total performance of about 72 MFLOPS out of a possible 100). Now, if I ran this on a 5-cell Warp, only 4 cells would be performing error propagation (instead of 9) and I'd get only about 32 MFLOPS out of a possible 50. The net result is that the program runs more than twice as fast with just twice as many processors. Superlinear speedup. Several programs behave like this. That was a simple example. It's easy to say "I'm really comparing a 9-processor machine with a 4-processor machine (because the last processor is so underutilized) and a 9/4 speedup is expected." So what. Nearly every alogrithm on any machine wastes machine cycles. I picked the neural network example just to illustrate this. Superlinear speedup is only impossible when both machines are running at full 100 percent utilization. - George Gusciora Carnegie Mellon University (412) 268 - 7553 gusciora@sam.cs.cmu.edu
anand@amax.npac.syr.edu (Anand Rangachari) (12/12/88)
This may seem obvious but in order to measure the speedup of a program it is neccessay to be able to run the given program with differing numbers of processors. I have seen some programs developed on the Encore Multimax here and have been apalled to find that the number of processors used for some programs is hardwired in. ie nothing less than major surgery would be needed to do a proper speedup measurement. Also, I note that when the Sandia group achieved their prodigious speedups, they were unable to measure the single processor case. Two reasons were mentioned for this. Firstly with a speedup of say 500 on 1000 processors a program taking 10 min to run will need 5000 min on a single processor. Secondly it may not be possible to fit the entire problem on a single processor in the case of distributed machines with a limited memory for each processor. R. Anand Internet: anand@amax.npac.syr.edu Bitnet: ranand@sunrise
eugene@eos.arc.nasa.gov (Eugene Miya) (12/13/88)
In article <3841@hubcap.UUCP> gusciora@sam.cs.cmu.edu (George Gusciora) writes: >Why is everyone so surprised about superlinear speedup? The above coming from a person at the University where it was originally written up. "Speed up," let's face it, is a dumb measure. Users, given less than half a chance would not care about the n-1 cases leading up to maximum performance, they only care about the nth, "highest performance case." Only computer scientists care immediately about the progression. A couple of years ago, I can't remember the paper, someone wrote that computists don't induct: we don't completely study all there is know about 2-processors before moving to 3-processors, etc. We don't learn a whole which is transferable between cases. George also noted the WARP. The best I've heard for a 10 processor system is 85 MFLOPS on a matrix multiply (at CMU). The problem isn't parallelism, it's the sequential stuff (within and without the "parallel streams.") Another gross generalization from --eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov resident cynic at the Rock of Ages Home for Retired Hackers: "Mailers?! HA!", "If my mail does not reach you, please accept my apology." {uunet,hplabs,ncar,decwrl,allegra,tektronix}!ames!aurora!eugene "Send mail, avoid follow-ups. If enough, I'll summarize." If you want a cute little exercise work a week without using the phrase "parallel processing." Then excise the other common terms: massive parallelism, multiprocess, multitask, etc. Have fun. You will learn something.
toulouse@iro.umontreal.ca (Michel Toulouse) (11/08/90)
Does anybody have references to papers that claim to get theoretical or empirical superlinear speedup (speedup superior to the number of processors) for a parallel algorithm or experiment on a parallel machine. I am actually collecting such references, once I will have a good core of it I will mail it on this news group. Thank's Michel Toulouse Universite de Montreal email: toulouse@iro.umontreal.ca
toulouse@iro.umontreal.ca (Michel Toulouse) (11/21/90)
Two weeks ago, I sent a message on this news group asking for references about papers claiming superlinear speedup. Since then, I have received many references and also enjoy the reading of very interesting comments. My first plan was to take a close look to every papers and add my owns comments. But this may take a lost of time and many of those who send me references and comments did manifest their interest in receiving this list of references. So I do it now with this simple comment. The traditional argument against superlinear speedup is that "if a processus Q is running on p processors for t time, then there is a sequential machine that will simulate Q in at most (kpt) + k' time steps for k and k' being constants. So the speedup is by a factor of p only". As it was mention in many comments, there is superlinear based on avantages like more physical memory in the parallel machine or obtained by heuristically guided search. They are not in my point of view convincing arguments against the traditional scepticism. In the first case it is clearly not convincing since you can always add more memory to the sequential machine. In the second case if you partition the search space equally among the p processors, you can achieve very big speedup (as mention in one of the comments I received). But I think this, from a theoretical point of view, is meaningless since the speedup depend only on factors like the partition of the search space, the heuristic used, your random function, etc. However, relate to heuristics in search space, this paper: "The performance of cooperative processes", B.A. Huberman, Physica D42, 1990 38--47, claim superlinear speedup not based on partition of search space but on "hints" that cooperative processes might used to eliminate ininteresting subspaces. The paper give a formal explanation to this possible superlinear speedup. Will it be a counter-argument to the argument above, I don't know for the moment. The volume 42 of Physica D concentrate only on nonlinear phenomena, the discussion about emergent computation is traited extensively, I found many interesting papers, one about a complete characterizations of Turing machines with a restrained number of states (no problem you still can teach that the halting problem is undecidable), also many papers on neural computation. I think this review of Physica over emergent computation is a very serious work (as it is almost always the case in this Journal). At least (for me) this number introduce an approach to parallel computation that might challenge more seriously the argument against superlinear speedup. As I said before, I didn't have time to look at every paper, I'm anxious to find challenger to the traditional argument. I thank's very much every one who send me references and comments, I'm still very interesting to received any comments on this subject or relate one like emergent computation. Michel Toulouse Universite de Montreal email:toulouse@iro.umontreal.ca ------------------------------------------------------------------------- @inproceedings{RAO88fsttcs, AUTHOR = "Nageshwara Rao, V. and Kumar, Vipin", TITLE = "Superlinear Speedup in State-Space Search", BOOKTITLE = "Proceedings of the 1988 Foundation of Software Technology and Theoretcal Computer Science", YEAR = "December 1988", Note = "Lecture Notes in Computer Science number 338, Springer Verlag" } @techreport{RAO90suplin, AUTHOR = "V. Nageshwara Rao and Vipin Kumar", TITLE = "On the Efficicency of Parallel Depth-First Search", YEAR = "1990", INSTITUTION = "Tech Report 89-41, Computer Science Department, University of Minnesota", NOTE = "Submitted for publication" } ------------------------------------------------------------------------- @article{lihudak89 ,key={LiKai} ,author={Li, K. and Hudak, P.} ,title={Memory Coherence in Shared Virtual Memory Systems} ,journal={ACM Transactions on Computer Systems} ,volume=7 ,number=4 ,month=Nov ,year=1989 ,pages={321-359} } @phdthesis{liphd ,key={li} ,author={Li, K.} ,title={Shared Virtual Memory on Loosely Coupled Multiprocessors} ,school={Yale University, Department of Computer Science} ,year=1986 } ------------------------------------------------------------------------- Charlie McDowell and myself (David Helmbold) have a paper in the IEEE transactions on Parallel and Distributed Systems, Vol 1, No. 2 (april 1990) titled "Modeling Speedup (n) greater than n. Our references include: @techreport{sandia, AUTHOR = "Benner, R.E. and J.L. Gustafson and R.E. Montry", TITLE = "Development and analysis of scientific application programs on a 1024-processor hypercube", INSTITUTION = "Sandia National Laboratories", YEAR = "1988", MONTH = "February", NUMBER = "SAND 88-0317", ADDRESS = "Albuquerque, N.M.", note = ""} @article{Gus88, AUTHOR = "Gustafson, J.L.", TITLE = "Reevaluating {A}mdahl's Law", JOURNAL = "CACM", YEAR = "1988", volume = "31", number = "5", pages = "532-533", month = "May", note = ""} @article{EZL89, AUTHOR = "D.L. Eager and J. Zahorjan and E.D. Lazowska", TITLE = "Speedup versus Efficiency in Parallel Systems", JOURNAL = "IEEE Trans. on Comp.", YEAR = "1989", volume = "38", number = "3", pages = "408-423", month = "March", note = ""} @inproceedings{SF89, AUTHOR = "C.B. Stunkel and W.K. Fuchs", TITLE = "Analysis of Hypercube cache performance using address traces generated by {TRAPEDS}", BOOKTITLE = "Proc. 1989 International Conference on Parallel Processing", YEAR = "1989", pages = "I-33--I-40", note = ""} % @article{LS84, AUTHOR = "T. Lai and S. Sahni", TITLE = "Anomalies in parallel branch-and-bound algorithms", JOURNAL = "CACM", YEAR = "1984", volume = "27", number = "6", pages = "594-602", month = "June", note = ""} % Ohio State and Univ of Minnesota @article{Wei82, AUTHOR = "B. W. Weide", TITLE = "Modeling unusual behavior of parallel algorithms", JOURNAL = "IEEE Trans. on Computers", YEAR = "1982", volume = "C-31", number = "11", pages = "1126-1130", month = "November", note = ""} % Ohio State @article{LW86, AUTHOR = "G.-J. Li and B. W. Wah", TITLE = "Coping with anomalies in parallel branch-and-bound algorithms", JOURNAL = "IEEE Trans. on Comp.", YEAR = "1986", volume = "C-35", number = "6", pages = "568-573", month = "June", note = ""} % was Purdue now Univ of Illinois @inproceedings{PH88, AUTHOR = "B. R. Preiss and V. C. Hamacher", TITLE = "Semi-static dataflow", BOOKTITLE = "Proc. 1988 Inter. Conf. on Parallel Processing", YEAR = "1988", pages = "127-134", note = ""} % Univ of Waterloo and Univ of Toronto @article{FLW86, AUTHOR = "V. Faber and O. M. Lubeck and A. B. White Jr.", TITLE = "Superlinear speedup of an efficient sequential algorithm is not possible", JOURNAL = "Parallel Computing", YEAR = "1986", volume = "3", pages = "259-260", note = ""} % LANL @article{Par86, AUTHOR = "D. Parkinson", TITLE = "Parallel efficiency can be greater than unity", JOURNAL = "Parallel Computing", YEAR = "1986", volume = "3", pages = "261-262", note = ""} % DAP support unit, Queen Mary College, London @book{GSS87, AUTHOR = "E. F. Gehringer and D. P Siewiorek and Z. Segall", TITLE = "Parallel processing, the {Cm}* experience", PUBLISHER = "Digital press", pages = "227-231", YEAR = "1987"} @inproceedings{Li88, AUTHOR = "K. Li", TITLE = "{IVY}: a shared virtual memory system for parallel computing", BOOKTITLE = "Proc. 1988 Inter. Conf. on Parallel Processing", YEAR = "1988", pages = "94-101", note = ""} % Princeton @inproceedings{MG85, AUTHOR = "R. Mehrotra and E. F. Gehringer", TITLE = "Superlinear speedup through randomized algorithms", BOOKTITLE = "Proc. 1985 Inter. Conf. on Parallel Processing", YEAR = "1985", pages = "291-300", note = ""} % North Carolina State @article{Kor82, AUTHOR = "W. A. Kornfeld", TITLE = "Combinatorially implosive algorithms", JOURNAL = "CACM", YEAR = "1982", volume = "25", number = "10", pages = "734-738", month = "October", note = ""} % MIT @techreport{Fla87, AUTHOR = "H. P. Flatt", TITLE = "Further comments on a model for parallel processing", INSTITUTION = "IBM PASC G320-3503", YEAR = "1987", note = ""} @article{Miy88, AUTHOR = "E. Miya", TITLE = "Suggestion on Superlinear speed up terminology", JOURNAL = "Network news posting", YEAR = "1988", volume = "comp.parallel", number = "318", month = "December", note = ""} % NASA AMES @inproceedings{Smi81, AUTHOR = "B. J. Smith", TITLE = "Architecture and applications of the {HEP} multiprocessor computer", BOOKTITLE = "Real-Time signal processing IV", YEAR = "1981", volume = "298", pages = "241-248", note = ""} % @article{San86, AUTHOR = "J. Sanguinetti", TITLE = "Performance of a message-based multiprocessor", JOURNAL = "Computer", YEAR = "1986", volume = "19", number = "9", month = "September", pages = "47-55", note = ""} % just slightly superunity on an {ELXSI}, due to reduced {OS} overhead @article{Min70, AUTHOR = "M. Minsky", TITLE = "Form and Content in Computer Science, {ACM Turing Lecture}", JOURNAL = "J. ACM", YEAR = "1970", volume = "17", pages = "197-215", note = ""} % the log(n) Minsky conjecture @article{Lip88, AUTHOR = "J. Lipkis", TITLE = "Superlinear performance of a monte carlo code on the {NYU} {Ultracomputer}", JOURNAL = "Personal communication", YEAR = "1988", note = ""} % @article{Bar88, AUTHOR = "J. Barton", TITLE = "Superlinear performance of the {SAXPY} kernel on the {Silicon} {Graphics} {4D120/GTX}", JOURNAL = "Personal communication", YEAR = "1988", note = ""} % @article{GM85, AUTHOR = "D. H. Grit and J. R. McGraw", TITLE = "Programming Divide and Conquer for a {MIMD} Machine", JOURNAL = "Software Practice and Experience", YEAR = "1985", month = " January", volume = "15", number = "1", pages = "41-53", note = ""} @article{Amd67, AUTHOR = "G. Amdahl", TITLE = "Validity of the single processor approach to achieving large scale computing capabilities", JOURNAL = "Proceedings AFIPS Comp. Conf.", YEAR = "1967", volume = "30", pages = "483-485", note = ""} ------------------------------------------------------------------------- \bibitem{Kuma88c} Kumar, Vipin, Ramesh, K., and Nageshwara, Rao V., ``Parallel Best-First Search of State-Space Graphs: A Summary of Results", {\em AAAI 1988} : 122-127. \bibitem{Kuma89a} Kumar, Vipin, and Nageshwara, Rao V., ``Load Balancing on the Hypercube Architecture", {\em Proc. 4th Conf. on Hypercubes, Concurrent Computers and Applications}, March 1989. \bibitem{Nels90b} Nelson, Peter C., and Toptsis, Anestis A., ``Artificial Intelligence, Parallel Processing, and Bidirectional Heuristic Search", Submitted \bibitem{Nels90c} Nelson, Peter C., and Toptsis, Anestis A., ``Superlinear Speedup in Multiprocessor Bidirectional Heuristic State-Space Search", Submitted \bibitem{Nels90d} Nelson, Peter C., and Toptsis, Anestis A., ``Wave-Shaping in Parallel Bidirectional Heuristic State-Space Search", Submitted \bibitem{Rao87a} Rao, Nageshwara, V., Kumar, Vipin, and Ramesh, K., ``A Parallel Implementation of Iterative-Deepening-A*", {\em AAAI 1987}, 178-182. \bibitem{Rao87d} Rao, Nageshwara, V., and Kumar, Vipin, and Ramesh, K., ``Parallel Heuristic Search on a Shared Memory Multiprocessor", {\em Tech. Report. AI Lab, TR-87-45}, Univ. of Texas at Austin, January 1987. \bibitem{Rao88b} Rao, Nageshwara, V., and Kumar, Vipin, ``Concurrent Insertions and Deletions in a Priority Queue", {\em 1988 Inter. Conf. on Parallel Processing}, Vol. 3, (1988) : 207-210. ------------------------------------------------------------------------- In the Conpar90-Vapp 4 Proceedings you'll find M.Cosnard, J.L-Philippe Achieving Superlinear Speedups for the Multiple Polinomial Quadratic Sieve Factoring Algorithm on a Distributed Memory Multiprocessor p.863 - 874 The Proceedings have been printed at Springer Verlag, Germany and may be ordered by bookshops or with a 30% reduction directly at ********************************************************* * Institut fuer Informatik | phone : +41 61 321 99 67 * * Mittlere Strasse 142 | fax : +41 61 321 99 15 * * 4000 Basel | email : waser@urz.unibas.ch * * Switzerland * *************************************************************** ------------------------------------------------------------------------- Li & Wah "How to cope with anomalies in parallel approximate branch-andf-bound search". AAAI-84 (National Conf. on AI). McBurney & Sleep "Transputer based experimanets with the ZAPP architecture" In PARLE-87 (Parallel Architectures and Languages, Europe) conference, Springer LNCS 258. The former is a theoretical analysis, the latter gives some practical results. In both cases the superlinear is speedup is due to a single processor doing a long search of a tree with no solutions, while in a multiprocessor system this tree is searched in parallel with another in which a solution is easily found. ------------------------------------------------------------------------- David P. Helmbold, and Charles E. McDowell,"Modeling Speedup (n) Greater than n", IEEE Trans. on Parallel and Distributed Sys., Apr. 1990, pp. 250 - 256.