kumar@uunet.UU.NET (Vipin Kumar) (03/30/88)
In article <1223@hubcap.UUCP>, uiucdcs!pur-ee!uiucdcsm.cs.uiuc.edu!grunwald@uunet.uu.net writes: > Basically, they took problems well suited to parallelism (wave equation, > finite difference methods for beam stress and one other one) and did good > implementations. > > For a fixed problem size, they got from 550 -> 650 speedup. If you fix the > problem size *per processor* (i.e. the problem gets bigger as you throw > more processesors at it), the ``scaled speedup'' peaks at 1020. > > This is for total execution time -- loading, running & unloaded. > > The scaled speedup measure is justified because many problems sizes are > simply constrained by time, not by the problem. In those cases, you just > want the thing done within a reasonable time. If you can run a larger version > of the problem in that time, you're happy about it. > > The 550 -> 650 speedup indicates that the serial code was about .15%, which > is much better than Amdahl conjectured to be possible. > Clearly, it is fair to increase the problem size to test speedup on more processor. If the problem size is fixed, then the speedup will at best saturate at some point. (Each program has some (however small) fixed sequential component, which puts a ceiling on the obtainable speedup). Actually for many parallel algorithms, it is possible to obtain linear speedup for arbitrarily large N (the number of processors) by simply increasing the problem size. A more interesting question to ask is: how fast should the problem size (measured in terms of sequential execution time) grow as a function of N to maintain linear speedup (i.e., constant efficiency)? In other words, what is the iso-efficiency function of the parallel formulation? (If the problem size needs to grow as f(N) to maintain constant efficiency, then the iso-efficiency function of the parallel formulation is f(N).) The iso-efficiency function precisely denotes the scalability of the parallel formulation. For example, if the iso-efficiency function is linear (i.e., if the problem size only needs to grow linearly with the number of processors to maintain constant efficiency), then the parallel formulation is highly scalable. Parallel formulations of many problems with data-level parallelism fall in this category. On the other hand, if the iso-efficiency function is exponential, then the formulation is not good for large N (because the problem size would have to be extremely large to effectively utilize parallel machines with large N). For example, the iso-efficiency function of a commonly used parallel formulation of quick sort (divide the list sequentially, and then sort each sublist in parallel) is exponential. As another example, the iso-efficiency function of the parallel formulation of 0/1 Knapsack problem presented by Lee, Shragowitz and Sahni in 1987 parallel processing conf. (pages 699-706) is O(N log N) which is almost linear; hence this parallel formulation is highly scalable. If we know the iso-efficiency function of the parallel formulation developed at Sandia Labs, then we could predict how well it would do on larger hypercubes. Vipin ----- PS. The concept of iso-efficiency is introduced in the following papers: Parallel Depth First Search on the Ring Architecture by Vipin Kumar, V. Nageshwara Rao and K. Ramesh. To appear in the proc. of 1988 Parallel Processing Conference. Parallel Depth-First on Multiprocessors, Part II: Analysis by Vipin Kumar and V. Nageshwar Rao. To appear in International Journal of Parallel programming. The second paper presents a work distribution scheme (with applications in depth-first search and divide-and-conquer algorithms) that has O(N log N) iso-efficiency function for suitable parallel architectures. Using this scheme, we get linear speedup for over 100 processors on BBN Butterfly. Another formulation presented in the paper gives linear speedup on the Intel hypercube for 100 processors.
kruskal@mimsy.UUCP (Clyde P. Kruskal) (04/06/88)
In article <1252@hubcap.UUCP> ut-sally!kumar@uunet.UU.NET (Vipin Kumar) writes: > >Clearly, it is fair to increase the problem size to test speedup >on more processor. If the problem size is fixed, then >the speedup will at best saturate at some point. (Each program >has some (however small) fixed sequential component, which >puts a ceiling on the obtainable speedup). >Actually for many parallel algorithms, it is possible to obtain linear speedup >for arbitrarily large N (the number of processors) >by simply increasing the problem size. > >A more interesting question to ask is: how fast should >the problem size (measured in terms of sequential execution time) >grow as a function of N to maintain linear speedup (i.e., constant >efficiency)? In other words, what is the iso-efficiency function >of the parallel formulation? (If the problem size needs to grow as >f(N) to maintain constant efficiency, then the iso-efficiency function >of the parallel formulation is f(N).) > Discussion of iso-efficiency deleted Basically I agree with everything written here ... but as long as the issue has arisen I want to add some historical information and talk about some relevant recent work. The idea of how fast the efficiency grows being an important concept appears in my thesis [1]. This was actually joint work with Allan Gottlieb and later appeared in JACM [2]. The idea also appears in a paper of Vitter and Simon [3]. Some practical aspects are discussed in a paper from a NATO summer school [4]; in particular it talks about Amdahl's Law and other alleged bounds on parallel speedup. Recently, Larry Rudolph, Marc Snir, and I have returned to this subject and obtain the following (theoretical) results [5]: We define an algorithm to be in the class PE (parallel efficient) if the parallel time for a problem of size N on P processors is T(P,N) = O(t(N)/P + P^k) , where t(N) is the time of the ("best") sequential algorithm. I.e., PE captures the problems with parallel algorithms that achieve optimal speedup over the optimal sequential time for all problems whose execution times are relatively (polynomially) large with respect to the number of processors. Notice that it is sequential time t(N) we compare to rather problem size N; this is relevant for problems that have super polynomial executions times. (Actually, for other reasons, we define PE somewhat differently but this definition is equivalent.) We argue that this is the concept of parallel performance that users of parallel machine desire. We prove that the class PE is invariant under the shared memory models (PRAMs), independent of whether the machines are synchronous or asynchronous, SIMD or MIMD, or have small or large latency for the processors to communicate. We also show that PE is invariant under the completely connected memory module machines for probabilistic algorithms. We define the class PE* of problems with almost efficient algorithms. This class is invariant across PRAMs and down to the memory module machines with the ability to route quickly such as Hypercube and Butterfly machines. The paper is quite long: there are other definitions and results, but enough said for now. [1] C. P. Kruskal, Upper and Lower Bounds on the Performance of Parallel Algorithms, Ph.D. Thesis, NYU, 1981. [2] A. Gottlieb and C. P. Kruskal, Complexity results for permuting data and other computations on parallel processors. Journal of the Association for Computing Machinery 31 [3] J. S. Vitter and R. A. Simons, New classes for parallel complexity: a study of unification and other complete problems for P. IEEE Transactions on Computers TC-35 (1986) 403-418. (1984) 193-209. [4] C. P. Kruskal, Performance bounds on parallel processors: an optimistic view. Control Flow and Data Flow: Concepts of Distributed Programming, NATO ASI series, Series F: Computer and System Sciences, Vol. 14, Manfred Broy (ed.), Springer-Verlag (1985) 331-344. [5] C. P. Kruskal, L. Rudolph, and M. Snir, A complexity theory of efficient parallel algorithms, Technical Report (RC 13572 (#60702) 3/4/88), IBM T. J. Watson Research Center, 1988. An extended abstract will appear in ICALP, July 88 (Intl. Conf. on Automata Lang. and Programming).
kumar@uunet.UU.NET (Vipin Kumar) (04/12/88)
In article <1317@hubcap.UUCP>, kruskal@mimsy.UUCP (Clyde P. Kruskal) writes: > > > We define an algorithm to be in the class PE (parallel efficient) > if the parallel time for a problem of size N on P processors is > T(P,N) = O(t(N)/P + P^k) , > where t(N) is the time of the ("best") sequential algorithm. > I.e., PE captures the problems with parallel algorithms > that achieve optimal speedup over the optimal sequential time > for all problems whose execution times are relatively (polynomially) > large with respect to the number of processors. Notice that > it is sequential time t(N) we compare to rather problem size N; this > is relevant for problems that have super polynomial executions times. > (Actually, for other reasons, we define PE somewhat differently > but this definition is equivalent.) We argue that this is the concept > of parallel performance that users of parallel machine desire. > > We prove that the class PE is invariant under the shared > memory models (PRAMs), independent of whether the machines > are synchronous or asynchronous, SIMD or MIMD, or have small > or large latency for the processors to communicate. > We also show that PE is invariant under the completely > connected memory module machines for probabilistic algorithms. > According to the above definition of PE, any parallel algorithm with polynomial iso-efficiency function is a parallel efficient algorithm. This definition of PE appears to be too weak. The ideal parallel formulation should have a linear iso-efficiency function (O(P), where P is the number of processors). In references [1] and [2], we have 3 different parallel formulations of depth-first search on the ring architecture with iso-efficiency functions of O(k^P), O(P^3 log P), O(P^2 log P), respectively. Clearly the first formulation is not in PE (its iso-efficiency function is exponential). The second and third formulations are in PE, but the third formulation is far superior to the second, as it has a better iso-efficiency function. (See [1] and [2]). The choice of architecture is also very important. For a shared memory machine (with message combining), we have a parallel depth-first search formulation whose iso-efficiency function is O(P log P), which is close to linear (and is significantly better than O(P^2 log P)). I think that the users of parallel machines should look for parallel formulations whose iso-efficiency functions are linear (or almost linear). Vipin ----- References: [1] Parallel Depth First Search on the Ring Architecture by Vipin Kumar, V. Nageshwara Rao and K. Ramesh. To appear in the proc. of 1988 Parallel Processing Conference. [2] Parallel Depth-First on Multiprocessors, Part II: Analysis by Vipin Kumar and V. Nageshwar Rao. To appear in International Journal of Parallel programming.