[comp.parallel] Opinions on Debugging Parallel Programs

utter@tcgould.tn.cornell.edu (Paula Sue Utter) (03/16/89)

One of the hot topics of research today is how to debug parallel
programs, both on shared memory multiprocessors and on distributed
memory machines.  Often it seems these debugger systems are developed
more for ease of implementation rather than for providing maximum
utility and ease of use.  What I'd like to do is get some opinions on
just what sort of features a good debugger system for parallel programs
should provide.
 
 
The kind of information I'm looking for includes:
 
Is it harder to write and debug new parallel programs, or to parallelize
"dusty deck" serial programs?
 
What are the most common bugs you have encountered during parallel
programming development and production runs (e.g., unintentional change
to a shared variable, etc.).
 
What methods have you used in your attempts to debug parallel programs?
Of these, which were most successful?
 
What types of tools do you think would be helpful in developing and
debugging parallel programs?  (For example, would it be helpful to
observe sequential execution within each process executing in parallel?)
 
What important events or features should be displayed when representing
parallel program execution?  (Some suggestions might be synchronization
mechanisms, interprocess communication patterns, updates to shared
variables, etc.)
 
When working with parallel programs, people often employ graphical
representations that reflect their mental model of the problem at hand.
Could you give me a verbal description of the way you envision such
things as:
          Parallel execution
          Interprocess communication
          Synchronization schemes
 
Since many people now include performance evaluation and improvement as
part of the debugging process when dealing with parallel programs, what
type of information would be useful in this area?
 
If you have used any existing parallel debugger systems, either
commercial or experimental, could you name them and give me some
feedback on their usefulness?
 
 
I'd really appreciate any opinions you have on this matter.  Please send
responses to utter@tcgould.tn.cornell.edu.  If I get enough responses,
I'll compile the results and post them here.  Thanks in advance.
 
Sue Utter
Technology Integration Group
Cornell National Supercomputer Facility

hammonds@riacs.edu (Steve Hammond) (03/17/89)

>One of the hot topics of research today is how to debug parallel
>programs, both on shared memory multiprocessors and on distributed
>memory machines.  Often it seems these debugger systems are developed
>more for ease of implementation rather than for providing maximum
>utility and ease of use.  What I'd like to do is get some opinions on
>just what sort of features a good debugger system for parallel programs
>should provide.
> 
> 
>The kind of information I'm looking for includes:
> 
>Is it harder to write and debug new parallel programs, or to parallelize
>"dusty deck" serial programs?

It is not clear to me how you measure "harder".  Do you mean harder
to get *something* running or harder to squeeze that last cpu second
out of the code?  I have written parallel algorithms from scratch
and parallelized sequential codes (not really dusty deck stuff since
it was pretty well written fortran code and the application was
ammenable to ||'sm.  If good "software engineering" techniques
are applied then neither are very hard to program, i.e., get working
code.  To me, the hard part is the thought that goes into problem
partitioning and algorithm design before your hands ever touch the
keyboard.

The best way that I have found to get code ( || and sequential)
is to get a kernel working and then incrementally add pieces to
it until you have worked up a running system.  I have just
finished coding an iterative solver for large sparse linear
systems (arising from discretized PDE's) on a sequent balance 21000.
Now I moving to the connection machine.

>What are the most common bugs you have encountered during parallel
>programming development and production runs (e.g., unintentional change
>to a shared variable, etc.).

The most common bug that I have run into is a synchronization
problem (on the sequent, an MIMD machine), one process modifying
a shared variable before it should.  It is difficult to explain
in just a few lines so I will leave it at that.

>What methods have you used in your attempts to debug parallel programs?
>Of these, which were most successful?

On the sequent, I used pdbx.  It really wasn't that helpful
because most of the errors I tried to find were due to
timing and often I could not find the error since breakpoints
set at the end of procedures synchronized the code and made it
run differently than under normal operating conditions.
Mostly I just started littering my code with barriers until
the problem went away and then I would start removing them
until the problem surfaced again.  That pointed to the timing problem
which usually resulted from probelm partitioning, etc.

>What types of tools do you think would be helpful in developing and
>debugging parallel programs?  (For example, would it be helpful to
>observe sequential execution within each process executing in parallel?)

I think a useful tool would be something that captured the order
of "events" to make a MIMD program have a repeatable order of
execution.  When I am debugging I want a deterministic sequence
of events.  For example, I want processes to finish tasks in the same order.
I believe something like this was being worked on at U. Rochester.
I think that one of the people involved was Tom LeBlanc if you
want to check into it.  It was being developed on their 128 node butterfly.
Anyone know the status of this?

>Since many people now include performance evaluation and improvement as
>part of the debugging process when dealing with parallel programs, what
>type of information would be useful in this area?

Perhaps for a shared memory machine one would be interested in
bus contention or hot spots in memory.

>If you have used any existing parallel debugger systems, either
>commercial or experimental, could you name them and give me some
>feedback on their usefulness?

I have used pdbx.  It is not truly useful.  It does give one
the capability to stop all processes and let them execute
one at a time.  Basically, it was just dbx running with
multiple processes.

>Sue Utter
>Technology Integration Group
>Cornell National Supercomputer Facility


    Steve


-- 

 Steve Hammond  * Parallel Systems Division * RIACS * NASA Ames Research Center

midkiff@uicsrd.csrd.uiuc.edu (Sam Midkiff) (03/18/89)

>
>I think a useful tool would be something that captured the order
>of "events" to make a MIMD program have a repeatable order of
>execution.  When I am debugging I want a deterministic sequence
>of events.  For example, I want processes to finish tasks in the same order.
>I believe something like this was being worked on at U. Rochester.
>I think that one of the people involved was Tom LeBlanc if you
>want to check into it.  It was being developed on their 128 node butterfly.
>Anyone know the status of this?
>

Todd Allen and Sanjoy Ghosh, with Profs. David Padua and Perry Emrath at CSRD, 
U of Illinois have been working on a system which captures potential dependence
violations and allows them to be analyzed after the program execution.  Existing
synchronization is taken into account to reduce the amount of trace information
needed for the post-execution analyses.  Two papers covering this work are:


Todd R. Allen and David Padua, "Debugging Parallel Fortran on a Shared Memory 
Machine", Proc. of 1987 Int'l. Conf. on Parallel Processing, St. Charles, IL,
pages 721-727. 

Perry Emrath and Sanjoy Ghosh and David Padua, "Event Synchronization Analysis 
for Debugging Parallel Programs", Subm. for publ. to 1989 Int'l. Conf. on 
Parallel Processing, St. Charles, IL

Both are also CSRD tech reports.  The authors can be reached at CSRD, 104 S 
Wright St., 305 Talbot Lab, U of Il., Urbana, IL., 61802

news%glimmer%twitch@att.att.com (03/30/89)

In article <4821@hubcap.UUCP>, hammonds@riacs (Steve Hammond) writes:
>I think a useful tool would be something that captured the order
>of "events" to make a MIMD program have a repeatable order of
>execution.  When I am debugging I want a deterministic sequence
>of events.  For example, I want processes to finish tasks in the same order.
>I believe something like this was being worked on at U. Rochester.
>I think that one of the people involved was Tom LeBlanc if you
>want to check into it.  It was being developed on their 128 node butterfly.
>Anyone know the status of this?

BBN now has an event logger and display system for the Butterfly,
called gist.  You capture events by logging them locally in each
processor.  Then collect the logs and post-process them with a nice
interactive graphical front-end that lets you display selected event
types, selected processors, various time grain, etc.  Don't know if it
builds on the Rochester work or is home grown.

>>If you have used any existing parallel debugger systems, either
>>commercial or experimental, could you name them and give me some
>>feedback on their usefulness?

Though I haven't used the Buttterfly version, we built a simulator for
the Monarch that used a similar mechanism, enhanced to catch things
like switch and memory contention (and optionally make them into
events).  Very nice for elucidating algorithms and problems with their
memory referencing behavior.

Our view (well, one view here) on debugging parallel programs is that
it takes three steps:

1.  Debug it on one processor.
2.  Debug it on two processors.
3.  Run it on N processors.  Very few bugs show up here, though you
may run into hot spots and strange N-processor race effects.
--
/jr
jr@bbn.com or bbn!jr
C'mon big money!