kale@m.cs.uiuc.edu (L. V. Kale') (12/06/88)
Here is one more way / reason that leads to the famed superlinear speedups. We observed superlinear speedups on some simple divide-and-conquer programs running on the "Chare Kernel" system (**) of ours on the iPSC/2 hypercubes. Further investigation showed the cause of this to be memory usage. In particular, the memory allocation scheme we used has the property that with larger number of allocations (with subsequent deallocations), its perfromance tends to degrade somewhat. (One has to search more for the right size block due to fragmentation) With increasing number of processors, the number of allocations done on each processor decreases, thus reducing the time spent in memory management. As our dynamic load balancing scheme is very good :-) we get the computation nicely distributed, and presto: superlinear speedup. In a sense, this is simply a consequence of having more memory. But then, that is part of having more processors. The question is having paid (for P processors) real dollar costs P times more than one processor, can one get speedup more than P. And in this case, the answer is yes... I won't get into the controversy about whether superlinear speedup are real or not. Just reporting an observation, for now. ** The Chare Kernel Language for Parallel Programming: A perspective" L.V. Kale and W.W. Shu , TR UIUCDCS-R-88-1451, (if interested in receiving copies of this and related papers, write to me: L.V. Kale, Dept. of Computer Science, University of Illinois 1304 W. Springfield Ave., Urbana, IL-61801) or kale@cs.uiuc.edu
kale@m.cs.uiuc.edu (L. V. Kale') (12/07/88)
Wait a second.. There was a parsing ambiguity in my earlier posting about superlinear speedups. The Tech Report is not about such speedups, rather about the chare kernel, a "machine independent parallel programming system". (More on that later). The superlinear speedup in specific cases was a very recent observation while running programs on this system on an Intel hypercube with 64 processors. I have no papers documenting the superlinear speedups, yet. (In fact, I am not sure there will ever be.) L.V. Kale (kale@cs.uiuc.edu)
franco%alberta.uucp@RELAY.CS.NET (Franco Carlacci) (12/07/88)
You can also try R. Janben " A note on superlinear speedup", Parrallel Computing 4 (1987) 211-213. franco $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ $ ob. quote. $ $ Franco Carlacci " never spit in the well, $ $ Dept. of Computing Science you may have to drink from $ $ U. of. Alberta it one day" $ $ franco@alberta.uucp or alberta!franco $ $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
dfk@romeo.cs.duke.edu (David F. Kotz) (12/07/88)
In article <3755@hubcap.UUCP>, robison@m.cs.uiuc.edu (Arch Robison) writes: > I'm looking for references (either pro or con) on the following claim: > > A p-processor parallel processor can never exhibit speedup > of greater than p. Try these, all short: D. Parkinson, "Parallel Efficiency can be greater than unity", Parallel Computing 3 (1986) 261-262. V. Faber, Olaf M. Lubeck and Andrew B. White, Jr, "Superlinear speedup of an efficient sequential algorithm is not possible", Parallel Computing 3 (1986) 259-260. "Isomorphic Computers, Ltd", International Journal of Parallel Programming, Vol 16, No 2, 1987, pp 179-182. David Kotz Department of Computer Science, Duke University, Durham, NC 27706 ARPA: dfk@cs.duke.edu CSNET: dfk@duke UUCP: decvax!duke!dfk
mmh@uunet.UU.NET (Matthew Huntbach) (12/09/88)
In article <3755@hubcap.UUCP>, robison@m.cs.uiuc.edu (Arch Robison) writes: > I'm looking for references (either pro or con) on the following claim: > > A p-processor parallel processor can never exhibit speedup > of greater than p. Depends on the algorithm. In parallel heuristic search superlinear speedup can happen easily. The reference I have to hand is: G-J.Li and B.W.Wah. How to cope with anomalies in parallel approximate branch-and-bound algorithms. In National Conference on Artificial Intelligence (AAAI-84) pp.212-215. though I think there was something in a recent CACM, possibly by the same authors. As a simple demonstration, suppose you are searching a tree with one branch which your heuristic tells you is promising, but which after a lot of computation (say 11 time units) does not yield a result, and another branch which your heuristic tells you is less promising but which does in fact yield a result after a small amount of computation (say 1 time unit). On a single processor you could search the whole large branch first before turning to the small branch and finding your solution: total time 12 units. With two processors you could assign one branch to each processor. Let us assume there is a time delay of one unit shipping the less promising branch to a remote processor, and a time delay of another unit getting the result back. You would still get you result in 3 units of time. So 2 processors give you a speedup of 4 Q.E.D.
halldors@paul.rutgers.edu (Magnus M Halldorsson) (12/10/88)
In article <3801@hubcap.UUCP> mcvax!ivax.doc.imperial.ac.uk!mmh@uunet.UU.NET (Matthew Huntbach) writes: > Depends on the algorithm. In parallel heuristic search superlinear > speedup can happen easily. >... > As a simple demonstration, suppose you are searching a tree with one branch > which your heuristic tells you is promising, but which after a lot of > computation (say 11 time units) does not yield a result, and another branch > which your heuristic tells you is less promising but which does in fact yield > a result after a small amount of computation (say 1 time unit). > > On a single processor you could search the whole large branch first > before turning to the small branch and finding your solution: total time > 12 units. But you can always simulate the parallel processors by traversing the tree breadth-first. That will require a stack and extra memory, but no more than the memory used by all the parallel processors combined. For this algorithm, the parallel version would be an example of AND-parallelism, while for the standard sequential algorithm, it would be an example of OR-parallelism. The moral is, you can only get superlinear speedup if the corresponding sequential algorithm you're comparing with is less optimal. Magnus
vera@portia.stanford.edu (James Vera) (12/10/88)
In article <3801@hubcap.UUCP> mcvax!ivax.doc.imperial.ac.uk!mmh@uunet.UU.NET (Matthew Huntbach) writes: >In article <3755@hubcap.UUCP>, robison@m.cs.uiuc.edu (Arch Robison) writes: >> I'm looking for references (either pro or con) on the following claim: >> >> A p-processor parallel processor can never exhibit speedup >> of greater than p. > >Depends on the algorithm. In parallel heuristic search superlinear >speedup can happen easily. The reference I have to hand is: >[...] >As a simple demonstration, suppose you are searching a tree with one branch >which your heuristic tells you is promising, but which after a lot of >computation (say 11 time units) does not yield a result, and another branch >which your heuristic tells you is less promising but which does in fact yield >a result after a small amount of computation (say 1 time unit). > >On a single processor you could search the whole large branch first >before turning to the small branch and finding your solution: total time >12 units. > >With two processors you could assign one branch to each processor. Let us >assume there is a time delay of one unit shipping the less promising >branch to a remote processor, and a time delay of another unit getting the >result back. You would still get you result in 3 units of time. > >So 2 processors give you a speedup of 4 Q.E.D. One has to be careful in comparing apples to oranges. Here you are comparing two different algorithms. Just because you run the same program on each of your many processors as you ran on your uniprocessor does not mean that your global algorithm is consistent. In the case described above, the uniprocessor is running an algorithm that says "exhaustively search a branch until you find an answer or it fails to yield a result", while the multiprocessor is running an algorithm that says "search N (the number of processors) branches simultaneously until your find an answer replacing branches that fail to yield a result with new branches." To realistically compare the performance increase of the multiprocessor over the uniprocessor you should run the second algorithm on the uniprocessor. This could be done by specifying N and looping over N branches one level at a time. -- James S. Vera | Internet |Standard Disclaimers Stanford University|vera@portia.stanford.edu |Blah Blah Blah Blah Bellcore |vera2%mruxb@bellcore.arpa|vvv My Cutesy Quote vvv "When I was young it seemed that life was so wonderful..." - Supertramp
gld@cunixd.cc.columbia.edu (Gary L Dare) (12/10/88)
In article <3801@hubcap.UUCP> mmh@ivax.doc.imperial.ac.uk wrote: >As a simple demonstration,.... [had to edit out excessive included text] >So 2 processors give you a speedup of 4 Q.E. Wait! You still have to search your promising branch thoroughly on the first processor; that's 11 steps, not 1. The fastest turnaround time if you have each branch ALREADY residing on seperate processors is 11 if they are searched simultaneously; master-slave, it's 13.-- ~~~~~~~~~~~~~~~~~~~~~~~~ je me souviens ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Gary L. Dare > gld@eevlsi.ee.columbia.EDU > gld@cunixd.cc.columbia.EDU "Free Canada, Trade Mulroney!" > gld@cunixc.BITNET
mgv@uunet.UU.NET (Marco Valtorta) (12/10/88)
If you search breadth-first, instead of depth-first, you will return a solution in 2 units of time on a serial machine. Should I conclude that the speedup in parallel search is always less than 1 ? :-) Marco Valtorta usenet: ...!ncrae!usceast!mgv Department of Computer Science csnet: mgv@cs.scarolina.edu University of South Carolina tel.: (1)(803)777-4641 Columbia, SC 29208 tlx: 805038 USC U.S.A. fax: (1)(803)777-3065 usenet from Europe: ...!mcvax!uunet!ncrlnk!ncrae!usceast!mgv
mmh@uunet.uu.net (Matthew Huntbach) (12/13/88)
What I was assuming was that you have an algorithm where the any solution is acceptable and on the whole a depth-first search leads to solutions quickest. It just happens that in this particular example, the heuristics lead you down the wrong path. I assumed the two processors still searched depth-first, so a completely breadth-first search would not give a quicker solution. I agree that bringing in parallelism could be said to have altered the algorithm by introducing some breadth-first search into what was a purely depth-first search. But this is what is generally accepted by the term "superlinear speedup"
sru@uunet.UU.NET (Steve Rush) (12/14/88)
There has been a lot of discussion recently about superlinear speedup so I thought I would add my personal experiences eg. In article <3756@hubcap.UUCP>, kale@m.cs.uiuc.edu (L. V. Kale') writes: > Here is one more way / reason that leads to the famed superlinear speedups. The Background : I am doing work on graphics algorithms using Transputers, and one of my algorithms is for rotating images which are stored using 'stripcodes' ('objects' are defined using the coordinates of the left end, along with a colour and a length). The algorithm rotates this about a user specified point and then regenerates the new strips. To parallelize this, the screen is split into horizontal bands with one band for each processor. For efficiency, each processor has an array of linked lists, one for each scanline of the image. Results : This program took more than 4 times as long to run on one processor as it did on 4 processors (that is the definition of superlinear speedup isn't it?). The reason for this (I had to explain it to my supervisor somehow) is that the newly generated strips had to be added to the correct linked list, and merged with the strips already there. In the single processor system this required searching through the whole list, while in the 4 processor system it had a smaller list to search. The time spent at the end of the algorithm, sending the strips back to the correct processors plus the time to merge the strips came out less than the time that was saved by inserting the strips in the shorter lists. Although you can always claim that my single processor version was not efficient, the principle still holds (and I claim that the program WAS efficient for a single processor). (The only complaints I can think of are:- "a linked list is not efficient". Well, I can't think of anything else that is better, that wouldn't still show the behaviour I have described. OR "You could have the 4 linked lists per scanline on the single processor". This is cheating by running different algorithms on the two systems, in this case I should have 4 lists per scanline on every processor, which takes you back to the case above.) Has anyone got any other comments? Have I missed something obvious? I would be interrested if anyone has anything to say. (P.S. Sorry about the length of this posting). Steve R --- Steve Rush JANET: sru@uk.ac.exeter.cs Computer Science Dept. UUCP: ukc!expya!sru University of Exeter
vera@portia.stanford.edu (James Vera) (12/15/88)
In article <3881@hubcap.UUCP> mcvax!cs.exeter.ac.uk!sru@uunet.UU.NET (Steve Rush) writes: > [...] >OR "You could have the 4 linked lists per scanline on the single >processor". This is cheating by running different algorithms on the two >systems, in this case I should have 4 lists per scanline on every >processor, which takes you back to the case above.) > [...] > Steve R > >--- >Steve Rush JANET: sru@uk.ac.exeter.cs >Computer Science Dept. UUCP: ukc!expya!sru >University of Exeter No, this would not be cheating. It is you who is running different algorithms. In the multiprocessor case you are running an algorithm which says "search 4 linked lists simultaneously", and in uniprocessor case you are running a different global algorithm, which says "search 1 linked list". It does not seem possible that n processors could run faster than n times one processor for a given global algorithm. One need only time slice the runs of the n processors on the 1 processor, ignoring the very real effects of cache effectiveness, etc ... -- James S. Vera | Internet |Standard Disclaimers Stanford University|vera@portia.stanford.edu |Blah Blah Blah Blah Bellcore |vera2%mruxb@bellcore.arpa|vvv My Cutesy Quote vvv "When I was young it seemed that life was so wonderful..." - Supertramp
bzs@EDDIE.MIT.EDU (Barry Shein) (12/19/88)
I think it's unfortunate to immediately toss out non-CPU effects, it often tosses out the baby with the bathwater. A while back, using parallel make on an Encore Multimax, I observed a 50X speedup over doing exactly the same build on a Vax750. Sheer processor comparison only predicted about a 15X speedup (ie. a 6 NS32332 Encore, about 12 MIPS, vs a 750, about .75MIPS.) I think you would find this sort of result on any decent MIMD (both of them :-) I suppose you can argue that this is somehow cheating (?) but watching a compile finish in 2 minutes instead of 100 minutes impressed me (this was before I worked at Encore and I really had to build the 100+ C module systems on both systems for practical reasons, when I saw the time difference I ran it again on both systems several times to see if it was just an accident, nope, repeatedly about 50X difference.) It is an unfortunate theory that dismisses that extra 3X speedup, no? The whole system is parallelized, not just the ALUs (see next pp as to why I say ALUs and not CPUs.) As to the breadth-first problem where one might simulate this on a uni-processor again we run into a practical problem. If the search runs depth-first on N processors then the only way you can simulate it on a uni is by saving and restoring a fair amount of context or using a processor and algorithm which, for example, keeps a lot of state in registers on the uni, probably requiring dozens of registers being available which a parallel system couldn't take advantage of in the same context. On several CPUs one does have dozens of registers available. Some of this has to be taken into account. So, in short, although it sounds good I suspect that these super-linear speedups will be impossible to shoot down on a similar uniprocessor, in practice. I realize that's unsatisfying to those who like to ignore reality but it seems such practicalities are what people trying to do real computations are concerned with. I'm always astounded at the energy people will put into "theoretical" arguments that parallel processing doesn't work (or doesn't work very well, or is too complicated to utilize) while people around them are sitting there making it work just fine. -Barry Shein, ||Encore||
aglew@mcdurb.Urbana.Gould.COM (12/19/88)
>It does not seem possible that n processors could run faster than n >times one processor for a given global algorithm. One need only time >slice the runs of the n processors on the 1 processor, ignoring the >very real effects of cache effectiveness, etc ... And as I am fond of pointing out, timeslicing => context switch costs, and reduction of context switch costs is one very real source of linear super-unitary speedup.
david@june.cs.washington.edu (David Callahan) (12/20/88)
In article <3935@hubcap.UUCP> aglew@mcdurb.Urbana.Gould.COM writes: >>It does not seem possible that n processors could run faster than n >>times one processor for a given global algorithm. One need only time >>slice the runs of the n processors on the 1 processor, ignoring the >>very real effects of cache effectiveness, etc ... > >And as I am fond of pointing out, timeslicing => context switch costs, >and reduction of context switch costs is one very real source of >linear super-unitary speedup. Yeah, but setting a lower bound on time-slice size will put a upper bound on context switch overhead as a percentage of execution time. Many of the examples discussed recently have not given any arguments to beleive that the parallel algorithm is asymptotically super-linear but seem to exploit increases in fast-memory, such as registers. How many examples of super-linear speedup can be explained by increases in memory-bandwidth due to more registers and more cache rather than increases in CPU horsepower? David Callahan D D maintaining multiple sets of registers which
aglew@mcdurb.Urbana.Gould.COM (12/21/88)
>How many examples of super-linear speedup can be explained >by increases in memory-bandwidth due to more registers >and more cache rather than increases in CPU horsepower? > >David Callahan Probably many, if not most. But then, look - we know that the approximate shape of the curve that describes size of fast memory vs. performance, right? Ie. for small amounts of fast memory the ratio performance/memory is super-unitary; actually, this curve goes up quite a way if the fast memory speed remains constant. 'Course, at some point it turns back over, or else caches won't work. So, more processors => more fast memory, and if we're at the right point in the curve we get a super-unitary payoff. But, you say, why not put all that memory on a single processor? OK - but then the memory speed falls off, typically by a logarithmic factor (it's easier to put small amounts of fast memory on lots of processors than it is to put a large amount of memory on a single processor). Oh, you earlier argument about limiting context switch frequency has too limited a concept of "context". There is also the context, within a single job, like when you switch from processing matrix A to matrix B, and then back again. Such context switches are part of your algorithm, not artifacts of the multiprogramming environment you are working in. I have earlier posted a small conceptual "proof" that super-unitary speedup is possible in "context-rich" problems. In that "proof" it is possible to play around with the scaling factors to see under what circumstances super-unitary speedup is possible. For example, if fast memory speed stays constant for all sizes, super-unitary sppedup is much less likely than if fast memory speed scales logarithmically.