moler@dana.UUCP (09/22/87)
Here is the eqn|troff source for a note about Amdahl's Law that I wrote while at the Intel hypercube group. --Cleve Moler Dana Computers, Sunnyvale, CA na.moler@score.stanford.edu, or ...!hplabs!dana!moler .ds xl Amdahl\'s Law .hy 0 .uf 0 .ta .5i 1.5i .EQ delim @@ .EN .tr ~ .de pp .sp .ti +.3i .. .de hd 'sp 2 .if \\n%-1 'tl '\*(xl, page %' .if !\\n%-1 'tl '\n(mo/\n(dy/\n(yr' 'sp 3 .if !\\n%-1 'ns .. .de fo 'bp .. .wh 0 hd .wh -7 fo \ .sp 6 .ce 2 A Closer Look at Amdahl's Law .sp 2 .ce 3 Cleve Moler Intel Scientific Computers Beaverton, Oregon .sp 2 .pp Amdahl's Law, which dates from 1967 [1], simply says that a parallel computation cannot run any faster than its inherently sequential portion. If 5 percent of an algorithm cannot be parallelized, than the parallel computation cannot run more than 20 times faster than the the sequential computation, no matter how many processors are used. When expressed is such terms, the law is clearly valid. .pp But Amdahl's Law only talks about a single computation. In this note we take a closer look at Amdahl's Law by considering a sequence of problems of increasing size. We argue that the sequential portion of an algorithm may actually decrease as the problem size increases, and hence that parallel computation can become increasingly efficient. .pp To make these notions a little more precise, let @T sub p@ = time to solve a problem on @p@ processors and let the \fIspeedup\fR be @S sub p ~ = ~ { T sub 1 } over { T sub p } @ The goal of parallel computation is to approach linear speedup, @ S sub p ~~ approx ~~ p @. Let @ alpha @ denote the Amdahl fraction, which is the fraction of algorithm which is ``not parallelizable''. Then simple algebra produces @T sub p ~ = ~ { 1 ~-~ alpha } over p T sub 1 ~~+~~ alpha T sub 1 @ and @ S sub p ~ = ~ p over { 1 ~+~ (p - 1) alpha } @ @ <= ~~ 1 over alpha @ for all @ p @ This is Amdahl's Law. .pp Now consider a sequence of problems of increasing size, @ n @. The parameter @ n @ might represent the length of a text to be analyzed, or the number of atoms in a chemical compound, or the number of cells in a finite element mesh. It should not be confused with the number of processors, @ p @ . We claim that for many algorithms, the Amdahl fraction depends upon the problem size and so we make the following definition: An \fIeffective\fR parallel algorithm has .br @ alpha (n) ~~ -> ~~ 0 @ as @ n ~~ -> ~~ inf @ This does not mean that the absolute time spent in the non-parallel part of the computation goes to zero, but only that it becomes negligible when compared to the overall time. Now @ S sub p @ = @ p over { 1 ~+~ (p - 1) alpha (n) } @ Hence, for effective algorithms, if we fix @ p @, and let @ n ~~ -> ~~ inf @ @ S sub p ~~ -> ~~ p @ .pp This says that for a fixed machine, linear speedup can be approached by using appropriate algorithms on larger and larger problems. Speedup is limited more by memory size than by the number of processors. .pp Do effective algorithms exist? In our experience with distributed memory, message passing, multiprocessors and medium to coarse grained parallelism, we have found that most algorithms for engineering and scientific problems are effective by our definition. .pp We offer one simple example to illustrate these ideas. Let @A@ be a large matrix, distributed by rows across multiple processors, and @x@ be a vector, stored in one processor. We wish to compute the matrix-vector product @ y ~~=~~ Ax @ The following algorithm accomplishes the task. Send @x@ to all other processors. All processors simultaneously: for @i ~ \(*e @ my rows @y sub i ~~=~~ A sub i cdot x @ Accumulate @y@ in destination processor. .pp To analyze the computation time and the resulting speedup, let @n@ be the order of matrix and let @p@ be the number of processors. Furthermore, let @m@ be the maximum number of rows in any processor. Since we are not assuming that @n@ is exactly divisible by @p@, we have @ m ~ = ~ \(lc ~ n/p ~ \(rc ~~ <= ~~ n/p ~+~ 1 @ We must include in the Amdahl fraction all of the parallel overhead associated with distributing the data @x@ and accumulating the result @y@, as well as the load imbalance which results from possibly unequal number of rows on each processor. .pp The amount of work is easily estimated. If we assume that the processor interconnect topology contains a hypercube, then the complexity of the communication tasks grows like @ log ~ p @. Other topologies would have a different dependence upon @p@, but would lead to the same general conclusions. .ta 2.5i Step Work Send @x@ @ n ~ log ~ p @ @m@ dot products @ m ~ n @ Accumulate @y@ @ n ~ log ~ p @ Total @O ( m ~ n ) ~ + ~ O ( n ~ log ~ p ) @ = @ O ( { n sup 2 } over p ) ~ + ~ O ( n ) ~ + ~ O ( n ~ log ~ p ) @ The first term, which dominates for large @n@, describes the parallel portion of the computation. The second term accounts for the load imbalance and the third term describes the communication costs. It is not quite accurate to say these latter terms correspond to the ``sequential'' portion of the algorithm, but they do represent any losses of efficiency in the parallel computation. .pp It is not hard to see that @ alpha ( n ) ~~=~~ O( { 1 ~+~ log ~ p } over n ) ~~ -> ~~ 0 @ as @ n ~~ -> ~~ inf @ Hence, the distributed matrix-vector multiplication algorithm is effective. .in .25i .ti -.25i [1] G. M. Amdahl, ``Validity of the single processor approach to achieving large scale computing capabilities,'' \fIProc. AFIPS Comput. Conf.,\fR, vol. 30, 1967. .in -.25i