[comp.parallel] Breakthrough at Sandia? + Iso-efficiency function

kumar@uunet.UU.NET (Vipin Kumar) (03/30/88)

In article <1223@hubcap.UUCP>, uiucdcs!pur-ee!uiucdcsm.cs.uiuc.edu!grunwald@uunet.uu.net writes:
> Basically, they took problems well suited to parallelism (wave equation,
> finite difference methods for beam stress and one other one) and did good
> implementations.
> 
> For a fixed problem size, they got from 550 -> 650 speedup. If you fix the
> problem size *per processor* (i.e. the problem gets bigger as you throw
> more processesors at it), the ``scaled speedup'' peaks at 1020.
> 
> This is for total execution time -- loading, running & unloaded.
> 
> The scaled speedup measure is justified because many problems sizes are
> simply constrained by time, not by the problem. In those cases, you just
> want the thing done within a reasonable time. If you can run a larger version
> of the problem in that time, you're happy about it.
> 
> The 550 -> 650 speedup indicates that the serial code was about .15%, which
> is much better than Amdahl conjectured to be possible.
> 

Clearly, it is fair to increase the problem size to test speedup
on more processor.  If the problem size is fixed, then 
the speedup will at best saturate at some point. (Each program
has some (however small) fixed sequential component, which
puts a ceiling on the obtainable speedup).
Actually for many parallel algorithms, it is possible to obtain linear speedup
for arbitrarily large N (the number of processors)
by simply increasing the problem size.

A more interesting question to ask is: how fast should
the problem size (measured in terms of sequential execution time) 
grow as a function of N  to maintain linear speedup (i.e., constant
efficiency)?  In other words, what is the iso-efficiency function
of the parallel formulation? (If the problem size needs to grow as
f(N) to maintain constant efficiency, then the iso-efficiency function
of the parallel formulation is f(N).) 
 

The iso-efficiency function precisely denotes the scalability   
of the parallel formulation.
For example, if the iso-efficiency function is linear 
(i.e., if the problem size only needs to grow linearly
with the number of processors to maintain constant efficiency),
then the parallel formulation is highly scalable.  Parallel formulations
of many problems with data-level parallelism fall in this category.
On the other hand, if the iso-efficiency function is exponential, 
then the formulation is not good for large N (because the
problem size would have to be extremely large to effectively
utilize parallel machines with large N). For example,
the iso-efficiency function of a commonly used parallel 
formulation of quick sort (divide the list sequentially, 
and then sort each sublist in parallel) is exponential.
As another example, the iso-efficiency function of the
parallel formulation of 0/1 Knapsack problem presented by 
Lee, Shragowitz and Sahni in 1987 parallel processing conf. 
(pages 699-706)  is O(N log N) which is almost linear; 
hence this parallel formulation is highly scalable.

If we know the iso-efficiency function of the parallel
formulation developed at Sandia Labs, then we could predict
how well it would do on larger hypercubes.  
 
Vipin
----- 
 
 
PS.  The concept of iso-efficiency is introduced in the
following papers:
 
Parallel Depth First Search on the Ring Architecture
by Vipin Kumar, V. Nageshwara Rao and K. Ramesh.
To appear in the proc. of 1988 Parallel Processing Conference.
 
 
Parallel Depth-First on Multiprocessors, Part II: Analysis
by Vipin Kumar and V. Nageshwar Rao.
To appear in International Journal of Parallel programming.
 
The second paper presents a work distribution scheme (with
applications in depth-first search and divide-and-conquer
algorithms) that has O(N log N) iso-efficiency function
for suitable parallel architectures.  Using this scheme,
we get linear speedup for over 100 processors on BBN Butterfly.
Another formulation presented in the paper gives linear speedup 
on the Intel hypercube for 100 processors. 

kruskal@mimsy.UUCP (Clyde P. Kruskal) (04/06/88)

In article <1252@hubcap.UUCP> ut-sally!kumar@uunet.UU.NET (Vipin Kumar) writes:
>
>Clearly, it is fair to increase the problem size to test speedup
>on more processor.  If the problem size is fixed, then 
>the speedup will at best saturate at some point. (Each program
>has some (however small) fixed sequential component, which
>puts a ceiling on the obtainable speedup).
>Actually for many parallel algorithms, it is possible to obtain linear speedup
>for arbitrarily large N (the number of processors)
>by simply increasing the problem size.
>
>A more interesting question to ask is: how fast should
>the problem size (measured in terms of sequential execution time) 
>grow as a function of N  to maintain linear speedup (i.e., constant
>efficiency)?  In other words, what is the iso-efficiency function
>of the parallel formulation? (If the problem size needs to grow as
>f(N) to maintain constant efficiency, then the iso-efficiency function
>of the parallel formulation is f(N).) 
> 
	Discussion of iso-efficiency deleted


Basically I agree with everything written here ...
but as long as the issue has arisen I want to add some
historical information and talk about some relevant recent work. 

The idea of how fast the efficiency grows being an important
concept appears in my thesis [1].  This was actually joint work
with Allan Gottlieb and later appeared in JACM [2].
The idea also appears in a paper of Vitter and Simon [3].
Some practical aspects are discussed in a paper
from a NATO summer school [4]; in particular it talks
about Amdahl's Law and other alleged bounds on parallel speedup.

Recently, Larry Rudolph, Marc Snir, and I have returned
to this subject and obtain the following (theoretical) results [5]:

We define an algorithm to be in the class PE (parallel efficient)
if the parallel time for a problem of size N on P processors is
		T(P,N) = O(t(N)/P + P^k) ,
where t(N) is the time of the ("best") sequential algorithm.
I.e., PE captures the problems with parallel algorithms
that achieve optimal speedup over the optimal sequential time
for all problems whose execution times are relatively (polynomially)
large with respect to the number of processors.  Notice that
it is sequential time t(N) we compare to rather problem size N; this
is relevant for problems that have super polynomial executions times.
(Actually, for other reasons, we define PE somewhat differently
but this definition is equivalent.)  We argue that this is the concept
of parallel performance that users of parallel machine desire.

We prove that the class PE is invariant under the shared
memory models (PRAMs), independent of whether the machines
are synchronous or asynchronous, SIMD or MIMD, or have small
or large latency for the processors to communicate.
We also show that PE is invariant under the completely
connected memory module machines for probabilistic algorithms.

We define the class PE* of problems with almost efficient algorithms.
This class is invariant across PRAMs and down to the memory module
machines with the ability to route quickly such as Hypercube
and Butterfly machines.

The paper is quite long: there are other definitions and results,
but enough said for now.

[1]	C. P. Kruskal, Upper and Lower Bounds on the Performance
	of Parallel Algorithms, Ph.D. Thesis, NYU, 1981.

[2]	A. Gottlieb and C. P. Kruskal,
	Complexity results for permuting data and other
	computations on parallel processors.
	Journal of the Association for Computing Machinery 31
	
[3]	J. S. Vitter and R. A. Simons,
	New classes for parallel complexity: a study of unification and other
	complete problems for P.
	IEEE Transactions on Computers TC-35
	(1986) 403-418.
	(1984) 193-209.
	
[4]	C. P. Kruskal,
	Performance bounds on parallel processors: an optimistic view.
	Control Flow and Data Flow: Concepts of Distributed Programming,
	NATO ASI series, Series F: Computer and System Sciences, Vol. 14,
	Manfred Broy (ed.), Springer-Verlag (1985) 331-344.
	
[5]	C. P. Kruskal, L. Rudolph, and M. Snir,
	A complexity theory of efficient parallel algorithms,
	Technical Report (RC 13572 (#60702) 3/4/88),
	IBM T. J. Watson Research Center, 1988.

	An extended abstract will appear in ICALP, July 88
	(Intl. Conf. on Automata Lang. and Programming).

kumar@uunet.UU.NET (Vipin Kumar) (04/12/88)

In article <1317@hubcap.UUCP>, kruskal@mimsy.UUCP (Clyde P. Kruskal) writes:
> 
> 
> We define an algorithm to be in the class PE (parallel efficient)
> if the parallel time for a problem of size N on P processors is
> 		T(P,N) = O(t(N)/P + P^k) ,
> where t(N) is the time of the ("best") sequential algorithm.
> I.e., PE captures the problems with parallel algorithms
> that achieve optimal speedup over the optimal sequential time
> for all problems whose execution times are relatively (polynomially)
> large with respect to the number of processors.  Notice that
> it is sequential time t(N) we compare to rather problem size N; this
> is relevant for problems that have super polynomial executions times.
> (Actually, for other reasons, we define PE somewhat differently
> but this definition is equivalent.)  We argue that this is the concept
> of parallel performance that users of parallel machine desire.
> 
> We prove that the class PE is invariant under the shared
> memory models (PRAMs), independent of whether the machines
> are synchronous or asynchronous, SIMD or MIMD, or have small
> or large latency for the processors to communicate.
> We also show that PE is invariant under the completely
> connected memory module machines for probabilistic algorithms.
> 

According to the above definition of PE, any parallel algorithm
with polynomial iso-efficiency function is a parallel efficient algorithm.
This definition of PE appears to be too weak.  The ideal parallel
formulation should have a linear iso-efficiency function (O(P), where
P is the number of processors).

In references [1] and [2], we have  3 different parallel formulations 
of depth-first search on the ring architecture with iso-efficiency
functions of O(k^P), O(P^3 log P), O(P^2 log P), respectively.
Clearly the first formulation is not in PE (its iso-efficiency
function is exponential). The second and third formulations are
in PE, but the third formulation is far superior to the
second, as it has a better iso-efficiency function. (See [1] and [2]).

The choice of architecture is also very important.  For a shared
memory machine (with message combining), we have a parallel
depth-first search formulation whose iso-efficiency function
is O(P log P), which is close to linear (and is significantly
better than O(P^2 log P)).


I think that the users of parallel machines should look for
parallel formulations whose iso-efficiency functions are
linear (or almost linear).


Vipin
-----

References:

[1] Parallel Depth First Search on the Ring Architecture
by Vipin Kumar, V. Nageshwara Rao and K. Ramesh.
To appear in the proc. of 1988 Parallel Processing Conference.


[2] Parallel Depth-First on Multiprocessors, Part II: Analysis
by Vipin Kumar and V. Nageshwar Rao.
To appear in International Journal of Parallel programming.