[comp.hypercube] A Closer Look at Amdahl's Law Report source

moler@dana.UUCP (09/22/87)
Here is the eqn|troff source for a note about Amdahl's Law that I 
wrote while at the Intel hypercube group.
   --Cleve Moler
     Dana Computers, Sunnyvale, CA
     na.moler@score.stanford.edu, or ...!hplabs!dana!moler

.ds xl Amdahl\'s Law
.hy 0
.uf 0
.ta .5i 1.5i
.EQ
delim @@
.EN
.tr ~ 
.de pp
.sp
.ti +.3i
..
.de hd
'sp 2
.if \\n%-1 'tl '\*(xl, page %'
.if !\\n%-1 'tl '\n(mo/\n(dy/\n(yr'
'sp 3
.if !\\n%-1 'ns
..
.de fo
'bp
..
.wh 0 hd
.wh -7 fo
\
.sp 6
.ce 2
A Closer Look at Amdahl's Law
.sp 2
.ce 3
Cleve Moler
Intel Scientific Computers
Beaverton, Oregon
.sp 2
.pp
Amdahl's Law, which dates from 1967 [1], simply says that
a parallel computation cannot run any faster than its inherently
sequential portion.  If 5 percent of an algorithm cannot be
parallelized, than the parallel computation cannot run more than
20 times faster than the the sequential computation, no matter
how many processors are used.  When expressed is such terms,
the law is clearly valid.
.pp
But Amdahl's Law only talks about a single computation.
In this note we take a closer look at Amdahl's Law by considering
a sequence of problems of increasing size.  We argue that
the sequential  portion of an algorithm may actually
decrease as the problem size increases, and hence that parallel
computation can become increasingly efficient.
.pp
To make these notions a little more precise, let

          @T sub p@  =  time to solve a problem on @p@ processors

and let the \fIspeedup\fR be

          @S sub p ~ = ~ { T sub 1 } over { T sub p } @

The goal of parallel computation is to approach linear speedup,
@ S sub p ~~ approx ~~ p @.
Let @ alpha @ denote the Amdahl fraction, which is the
fraction of algorithm which is ``not parallelizable''.
Then simple algebra produces

          @T sub p ~ = ~ { 1 ~-~ alpha } over p T sub 1 ~~+~~ alpha T sub 1 @

and

          @ S sub p  ~ = ~ p over { 1 ~+~ (p - 1) alpha } @

                    @ <=  ~~  1 over alpha @  for all @ p @

This is Amdahl's Law.
.pp
Now consider a sequence of problems of increasing size, @ n @.
The parameter @ n @ might represent the length of a text to
be analyzed, or the number of atoms in a chemical compound, or
the number of cells in a finite element mesh.  It should not be
confused with the number of processors, @ p @ .
We claim that for many algorithms, the Amdahl fraction
depends upon the problem size and so we make
the following definition:

          An \fIeffective\fR parallel algorithm has
.br
                 @ alpha (n) ~~ -> ~~ 0 @  as  @ n ~~ -> ~~ inf @

This does not mean that the absolute time spent in the non-parallel
part of the computation goes to zero, but only that it becomes
negligible when compared to the overall time.  Now

           @ S sub p @  =  @ p over { 1 ~+~ (p - 1) alpha (n) } @

Hence, for effective algorithms, if we fix @ p @, and let @ n ~~ -> ~~ inf @

           @ S sub p ~~ -> ~~ p @

.pp
This says that for a fixed machine, linear speedup can be approached
by using appropriate algorithms on larger and larger problems.
Speedup is limited more by memory size than by the number of processors.
.pp
Do effective algorithms exist?
In our experience with distributed memory, message passing, multiprocessors
and medium to coarse grained parallelism, we have found that most
algorithms for engineering and scientific problems are effective
by our definition.
.pp
We offer one simple example to illustrate these ideas.
Let @A@ be a large matrix, distributed by rows across multiple processors,
and @x@ be a vector, stored in one processor.
We wish to compute the matrix-vector product

          @ y ~~=~~ Ax @

The following algorithm accomplishes the task.

          Send @x@ to all other processors.
    
          All processors simultaneously:
             for @i ~ \(*e  @ my rows 
                 @y sub i ~~=~~ A sub i cdot x @ 
    
          Accumulate @y@ in destination processor.
.pp
To analyze the computation time and the resulting speedup, let
@n@ be the order of matrix and let @p@ be the number of processors.
Furthermore, let @m@ be the maximum number of rows in any processor.
Since we are not assuming that @n@ is exactly divisible by @p@,
we have

          @ m  ~ = ~ \(lc ~ n/p ~ \(rc ~~ <= ~~ n/p ~+~ 1 @

We must include in the Amdahl fraction all of the
parallel overhead associated with distributing the data @x@
and accumulating the result @y@, as well as
the load imbalance which results from
possibly unequal number of rows on each processor. 
.pp
The amount of work is easily estimated.  If we assume that the
processor interconnect topology contains a hypercube, then
the complexity of the communication tasks grows like @ log ~ p @.
Other topologies would have a different dependence upon @p@, but
would lead to the same general conclusions.

.ta 2.5i
            Step          	  Work
          Send @x@       	@ n ~ log ~ p @
          @m@ dot products	@ m ~ n @
          Accumulate @y@	@ n ~ log ~ p @

          Total         	@O ( m ~ n ) ~ + ~ O ( n ~ log ~ p ) @

             	= @ O ( { n sup 2 } over p ) ~ + ~ O ( n ) ~ + ~ O ( n ~ log ~ p ) @

The first term, which dominates for large @n@, describes the parallel
portion of the computation.  The second term accounts for the load
imbalance and the third term describes the communication costs.
It is not quite accurate to say these latter terms correspond to
the ``sequential'' portion of the algorithm, but they do represent any
losses of efficiency in the parallel computation.
.pp
It is not hard to see that

          @ alpha ( n )  ~~=~~ O( { 1 ~+~ log ~ p } over n ) ~~ -> ~~ 0 @ as @ n ~~ -> ~~ inf @

Hence, the distributed matrix-vector multiplication algorithm is effective.


.in .25i
.ti -.25i
[1]  G. M. Amdahl, ``Validity of the single processor approach to
achieving large scale computing capabilities,''
\fIProc. AFIPS Comput. Conf.,\fR, vol. 30, 1967.
.in -.25i