[comp.parallel] Summary: Difficult Scheduling Examples

carl@lemon.lcs.mit.edu (Carl Waldspurger) (07/03/90)

Here is a summary of the responses to my request for difficult
scheduling problems.  Many thanks to all those who responded.

		- Carl


My original post:

> Hi.  I'm interested in techniques for managing computational
> resources in parallel systems, such as load-balancing policies
> and user-level scheduling mechanisms.
> 
> I would like to start amassing a collection of parallel
> programs that are difficult to schedule; i.e. programs that
> perform poorly using conventional scheduling techniques, and
> programs that require sophisticated partitioning and scheduling
> mechanisms in order to perform well.
> 
> If you have any parallel programs that meet this description,
> or can provide me with pointers into relevant literature, I would
> like to hear from you.
> 
> Please send e-mail to carl@lemon.lcs.mit.edu; I will post a
> follow-up summary of any information that I receive.


*** Response #1

From: Gary L Dare <gld@cunixd.cc.columbia.edu>
Subject: Re: Difficult Scheduling Examples Wanted

Check out any literature on parallel circuit simulators.  Highly
irregular, quite dependant on topology of a specific design.  A real
nasty for any regular, parallel machine structure or scheduler.

gld

~~~~~~~~~~~~~~~~~~~~~~~ Je me souviens ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Gary L. Dare				"An independant Quebec
> gld@cunixD.cc.columbia.EDU		 in a strong Canada!"
> gld@cunixc.BITNET			 - Yvon Deschamps, comedian


*** Response #2

From: reiher@onyx.jpl.nasa.gov (Peter Reiher)
Subject: Re: Difficult Scheduling Examples Wanted

Are you familiar with the concepts of virtual time and Time Warp?  
These terms refer to an unusual form of synchronization based entirely on
rollback, with no blocking.  So far, the method has been used with great
success in parallelizing discrete event simulations.  Other synchronization
methods have not been very successful in dealing with this sort of program.
If you need them, I can get you many references on the Time Warp Operating
System, developed at JPL.


			Peter Reiher
			reiher@onyx.jpl.nasa.gov
			. . . cit-vax!elroy!jato!jade!reiher

** Response #3

From: Richard A. Huff <rahuff@hplabsz.hpl.hp.com>
Subject: Scheduling

I'm not sure if this is the type of problem you're looking for,
(since "scheduling" is a little ambiguous), but try this:

Given a simple triply nested matrix multiply loop nest,
how can we get the compiler to determine an appropriate blocking
strategy to minimize the cache<-->memory traffic?  The amount of
blocking should depend on the floating point add/multiply latencies
and the number of available fp registers.

This might not appear to be a scheduling problem, but do note that
a) block matrix methods are REQUIRED for decent performance on CPU's
   with caches
b) the blocking and scheduling algorithms needs to be tightly coupled

In the end, we'll probably just write some runtime libraries in
hand coded assembly, but it would be nice to get the compiler to
do some of this automatically.

Enjoy,

Richard Huff
rahuff@hplabsz.hpl.hp.com


** Response #4

From: schwan%wayward@gatech.edu (Karsten Schwan)
Subject: difficult parallel programs

Here is one reference of my own...basically from the domain of
optimization methods...what they have in common are dynamically
decomposed domains....Karsten

@inproceedings{Sch88b,
	Key="Schwan and Blake",
	Author="Karsten Schwan and John Gawkowski and Ben Blake",
	Title="Process and Workload Migration for a Parallel 
               Branch-And-Bound Algorithm on a Hypercube Multicomputer",
	Booktitle="Third Conference on Hypercube Concurrent Computers
                   and Applications, Pasadena, CA", 
	Organization="ACM, Jet Propulsion Laboratories",
	Month="Jan.",
	year=1988,
	pages="1520-1530"}

@article{Sch89a,
	Key="Schwan et al.",
	Author="Karsten Schwan and Ben Blake and Win Bo and John Gawkowski",
	Title="Global Data and Control in Multicomputers: Operating System 
               Primitives and Experimentation with a Parallel
               Branch-and-Bound Algorithm",
	Journal="Concurrency: Practice and Experience, Wiley and Sons",
	Pages="191-218",
	Year=1989,
	Month="Dec."}


** Response #5

From: Lynn Sutherland <lynn@noah.arc.ab.ca>
Subject: Unbalanced Programs

I am working on a Master's thesis that compares a number of load
balancing strategies.  Instead of using real programs, I decided
to generate sample programs that exhibit varying amounts of load
imbalance.  The programs that I generate are basically trees with
different average branching factors and amount of "skewedness" and
density.  I am not sure how close these trees model real programs,
but can argue empirically that they are similar to typical chess 
or production system programs.

If you get a list of "typical unbalanced programs" I would appreciate
a copy.  Just so I know what kinds of programs my system is missing.

Lynn Sutherland
Alberta Research Council
lynn@noah.arc.ab.ca


** Response #6

From: george%avocet@cs.utah.edu (Lal George)
Subject: Re: Difficult Scheduling Examples Wanted

I have submitted a paper to ICPP related to the management of 
computational resources. I enclose the abstract in case this should be
of interest to you.

\begin{center} \bf Abstract \end{center}
{\sl
	An efficient scheduling strategy for shared memory
multiprocessors is described. The rapid dissemination of tasks to
available processors and ready queues is crucial to the performance of
any parallel system. Such overheads determine the attainable speedup
and performance of the system. Poor techniques used to address this
can lead to severe degradation in performance particularly with high
processor counts. This work has been conducted in the context of a
parallel functional language - CoF, where the parallelism is usually
fine grained and the efficient assignment of tasks to processors even
more important. In such systems, observing strict queue semantics
(i.e., FIFO) is not essential. This allows for very efficient
algorithms such  as that described here. On the BBN GP1000, our
technique was superior in performance to the centralized queue and has
the potential of performing well on a fully configured GP1000. 
}

** End of Responses