hankd@ee.ecn.purdue.edu (Hank Dietz) (05/04/90)
In article <1990May1.154558.24009@cs.rochester.edu> crowl@cs.rochester.edu (Lawrence Crowl) writes: >In article <49622@lanl.gov> ryg@lanl.gov (Richard S Grandy) writes: >>I've got a glossy from SGI that shows the POWER center performance on >>LINPACK (100x100, coded) as: >> 1 CPU 3.8 DP MFLOPS >> 4 CPU 16 DP MFLOPS >> 8 CPU 28 DP MFLOPS >>Does this mean than with 4 cpus you really get GREATER than a linear speedup?? > >A common trap is measuring parallel speedup is to use a "loaded" CPU. For >instance, the single processor might have been serving interrupts, clock >icons, etc. The result is that the performance for the single processor case >is artificially low. After taking care, the numbers usually turn out slightly >less than linear. The second trap is measuring with a relatively coarse >clock. For speedup that is near linear, barely missing a clock tick can lead >to measurements that indicate superlinear speedup. ^^^^^^^^^^^ Not really. Superlinear means better than O(n) for n processors... hence, even though the case of n=1 might be a bit strange, even k*n| k>1 isn't superlinear. For example, HJ Siegel has pointed out that for appropriate pipelined SIMD machines (read: PASM) the overlap of the CU and PEs can achieve an apparent speedup of 2*n... but that is O(n), i.e., linear. I do research in optimizing/parallelizing compilers and the first benchmarks I did all yielded >n speedup. Why? BECAUSE WHEN A PROGRAM IS PARALLELIZED, IT ISN'T THE SAME PROGRAM -- it simply produces the same results. In other words, it can be transformed to take advantage of various side benefits of parallel hardware & execution: [1] Increased register, cache, and page frame space: if 1 processor can keep K things in registers, N processors might keep K*N in registers. The program is different because it doesn't spill registers, etc. [2] Different management of cache and page frames: for example, using the LRU policy, having N separate K-element LRU queues can yield better performance than one N*K-element queue. We tend to forget that even if the same space is available, a management policy change can dramatically alter performance. Proof left to the reader. ;-) [3] Introduction of constants and other code simplifications: for example, with N processors a loop involving a[i] might not need the usual indexing operation, because for each processor i is a compile-time constant. This is rather like unraveling every loop, but unraveling loops can have negative impacts on the memory reference pattern which might not surface relative to a parallel machine. This also happens in non-looping code. [4] In general, parallelization transformations restructure code so that locality is enhanced, but only in true parallel execution. This is due to the fact that live ranges overlapping in time on a parallel machine does not imply that they interfere. This is also hard to explain in detail, so I guess it must be intuitively obvious. ;-) It has to do primarily with the effects of asynchrony.... etc. Note that, for example, the benfits due to, for example, [1] and [3] do not interfere -- hence their benefits can actually multiply. In summary, you can get greater than O(n) speedup -- true superlinear speedup -- because of such combined effects. My claim is, however, that you really are not comparing equivalent programs. -hankd@ecn.purdue.edu
sccowan@watmsg.uwaterloo.ca (S. Crispin Cowan) (05/04/90)
In article <1990May3.203405.23456@ecn.purdue.edu> hankd@ee.ecn.purdue.edu (Hank Dietz) writes: >In article <1990May1.154558.24009@cs.rochester.edu> crowl@cs.rochester.edu (Lawrence Crowl) writes: >>In article <49622@lanl.gov> ryg@lanl.gov (Richard S Grandy) writes: >>>I've got a glossy from SGI that shows the POWER center performance on >>>LINPACK (100x100, coded) as: >>> 1 CPU 3.8 DP MFLOPS >>> 4 CPU 16 DP MFLOPS >>>Does this mean than with 4 cpus you really get GREATER than a linear speedup?? >I do research in optimizing/parallelizing compilers and the first benchmarks >I did all yielded >n speedup. Why? BECAUSE WHEN A PROGRAM IS PARALLELIZED, >IT ISN'T THE SAME PROGRAM -- it simply produces the same results. In other >words, it can be transformed to take advantage of various side benefits of >parallel hardware & execution: This all looks like a complicated way of noticing the (somewhat subtle) point that a parallel machine with n processors _also_ has something approximating n times the bandwidth to memory, because it has n times the cache space with n fetch cycles and n centres of locality. > -hankd@ecn.purdue.edu ---------------------------------------------------------------------- (S.) Crispin Cowan, CS grad student, University of Waterloo Office: DC3548 x3934 Home phone: 570-2517 Post Awful: 60 Overlea Drive, Kitchener, N2M 1T1 UUCP: watmath!watmsg!sccowan Domain: sccowan@watmsg.waterloo.edu "You can have peace. Or you can have freedom. Don't ever count on having both at once." -Lazarus Long "You can't separate peace from freedom because no one can be at peace unless he has his freedom." -Malcolm X
csimmons@jewel.oracle.com (Charles Simmons) (06/02/90)
In article <1990May3.203405.23456@ecn.purdue.edu>, hankd@ee.ecn.purdue.edu (Hank Dietz) writes: > Not really. Superlinear means better than O(n) for n processors... hence, > even though the case of n=1 might be a bit strange, even k*n| k>1 isn't > superlinear. For example, HJ Siegel has pointed out that for appropriate > pipelined SIMD machines (read: PASM) the overlap of the CU and PEs can > achieve an apparent speedup of 2*n... but that is O(n), i.e., linear. I'm curious as to what extent researchers have observed real superlinear speedups. For example, consider a chess program. On an old-fashioned single processor architechture, the program will examine multiple alternatives one at a time. The results of each alternative can trim the amount of searching required in subsequent alternatives. (The alpha-beta cutoffs become closer together.) Now we might imagine that a version of the program written for a parallel machine might be able to benefit from examining multiple alternatives in parallel. For example, if one alternative is highly advantageous, it might cause the alpha-beta cutoff values to get lots closer lots faster. -- Chuck
sls@beaner.cs.wisc.edu (Steve Scott) (06/05/90)
In article <1990Jun2.080658.12651@oracle.com> csimmons@oracle.com writes: >I'm curious as to what extent researchers have observed real superlinear >speedups. For example, consider a chess program. On an old-fashioned >single processor architechture, the program will examine multiple >alternatives one at a time. The results of each alternative can trim >the amount of searching required in subsequent alternatives. (The >alpha-beta cutoffs become closer together.) Now we might imagine that >a version of the program written for a parallel machine might be able >to benefit from examining multiple alternatives in parallel. For example, >if one alternative is highly advantageous, it might cause the alpha-beta >cutoff values to get lots closer lots faster. > >-- Chuck Not that this is related to architecture, but alpha-beta search actually displays just the opposite sort of behavior. Since pruning is so effective, exploring multiple branches of the search tree in parallel results in a tremendous amount of wasted search because the cutoffs from previous branches are not available yet. Many people (myself included) have attempted to alter the algorithm in various ways to lessen this affect, but no one has been able to approach even linear speedup. --Steve