[comp.parallel] Parallel Debugging Survey Summary

utter@tcgould.tn.cornell.edu (Paula Sue Utter) (05/20/89)
At last, here is a summary of the responses I received to my posting of
several questions concerning parallel debugging.  I got 19 responses;
several were from people working in the area of debugging.  I summarized
the answers to the various questions rather than posting individual
responses.  Since I'm fairly new to posting to newsgroups, I wasn't sure
if this was proper etiquette (hopefully it is).
 
I'd like to thank everyone who took the time to respond.  Here at the
Cornell National Supercomputer Facility we're facing the task of trying
to provide truly useful parallel debugging tools to our users.  It was
great to get input from experienced parallel programmers about the kind
of features they've noticed were needed.
 
********************************************************************
********************************************************************
>Is it harder to write and debug new parallel programs, or to parallelize
>"dusty deck" serial programs?
 
The general consensus is that it is harder to parallelize "dusty deck"
serial programs, for a number of reasons.  The primary problem is that
the serial programs were designed with no thought to parallelism.
Additionally, one has to deal with bad programming practice and often
very little knowledge of what the code really does.
 
******************************************************************
******************************************************************
>What are the most common bugs you have encountered during parallel
>programming development and production runs (e.g., unintentional change
>to a shared variable, etc.).
 
For shared memory machines, the most common bug involves errors in
access to shared variables.  Either a value is private when it should be
shared, or shared when it should be private.  This situation can be a
symptom of either improper data sharing or faulty synchronization around
critical sections.
 
Bugs reported for distributed memory machines involve errors in message
traffic, such as mismatched messages, where a process is expecting one
type of message but receives another.
 
A difficulty with debugging parallel programs is that the symptom
of the problem may show up on another processor, rather than on
the one where the problem actually occurred.  Also, the symptom
can be masked by the action of another processor before the error
can be investigated with a debugger.
 
******************************************************************
******************************************************************
>What methods have you used in your attempts to debug parallel programs?
>Of these, which were most successful?
 
The most common method mentioned was to serialize particular routines
which are normally run in parallel; thus the internal synchronization
within the routine can be tested.
 
Other methods employed were the use of print statements, partial
correctness checking techniques, code splitting, and tracking data being
passed (in message- passing envirnoment).  Various debuggers were also
mentioned; a summary of user experience will be given in a following
question.
 
Generally, it was recommended that during program development, the
parallel program should always be run on only one processor initially,
then two, then up to the desired number of processors.
The importance of designing good test cases with the goal of exercising
all possible execution paths was also pointed out.
 
******************************************************************
******************************************************************
>What types of tools do you think would be helpful in developing and
>debugging parallel programs?  (For example, would it be helpful to
>observe sequential execution within each process executing in parallel?)
 
A tool which allows examination of the sequence of events which occurred
during execution would be helpful.  Events involving interprocess
communications (either shared variable access or message passing) appear
to be most interesting.  Being able to debug a single process within a
multi-process program is would also be useful.  Another idea involved
code splitting tools, which aid in creating good partial execution
environments.
 
As a general comment, it was noted that any type of tool should
be scalable, i.e., it should allow debugging of programs running on
any number of processors.
 
******************************************************************
******************************************************************
>What important events or features should be displayed when representing
>parallel program execution?  (Some suggestions might be synchronization
>mechanisms, interprocess communication patterns, updates to shared
>variables, etc.)
 
As mentioned in the previous question, interprocess communications
is of primary interest.
 
******************************************************************
******************************************************************
> When working with parallel programs, people often employ graphical
> representations that reflect their mental model of the problem at hand.
> Could you give me a verbal description of the way you envision such
> things as:
>           Parallel execution
>           Interprocess communication
>           Synchronization schemes
 
 
One suggestion was to represent parallel execution by showing different
pools of data associated with each processor, along with the
communications lines between processors.  Interprocess communications
would then be illustrated by sending messages down those lines.
Synchronization could be displayed as the organization of those messages
in time.
 
Another suggestion from someone in Europe utilized Petri nets
as a means of modeling synchronization.
 
******************************************************************
******************************************************************
>Since many people now include performance evaluation and improvement as
>part of the debugging process when dealing with parallel programs, what
>type of information would be useful in this area?
 
Suggestions included bus contention, memory access hot spots, detecting
excessive spin waiting at synchronization points, and various message-
passing statistics (number of messages sent and received, on both a
global and per processor basis, memory channel utilization, receive
queue statistics, delay times between message send and message receive,
and 'snapshots' of message loading and process loading across
message-driven systems).  Probably the most basic requirement would
be access to a good clock.
 
******************************************************************
******************************************************************
>If you have used any existing parallel debugger systems, either
>commercial or experimental, could you name them and give me some
>feedback on their usefulness?
 
Various debuggers were mentioned:
 
PDBX on the Sequent was described as being "just dbx running with
multiple processes".  It isn't of much help with timing errors,
apparently, because setting breakpoints perturbates the normal sequence
of events which one is trying to examine.
 
DECON on the Intel iPSC/2 was mentioned as available, but had not been
used.
 
Parasight is a debugger developed by Ilya Gertner, Ziya Aral, and Greg
Schaffer at Encore for shared memory multiprocessors.  According to the
source, there are "*many* new and "neat" ideas in here which make
debugging parallel programs not only doable, but easy".
 
BBN supplies the GIST system, which is based on some of LeBlanc and
Mellor-Crummey's work at U. of Rochester.  Events are recorded by
individual processors.  A post-processor then merges the logs into a
single execution history which can be graphically displayed.
Apparently, this system is especially useful for performance debugging.
 
An experimental system is being developed at CSRD, U. of Illinois, by
Sanjoy Ghosh, Todd Allen, Dave Padua, and Perry Emrath which provides
execution traces in areas where potential nondeterminism has been
detected.  References for this system are listed in the accompanying
bibliography.
 
Pi (Process inspector), developed by T. A. Cargill, was also mentioned.
The system was described as being a "killer symbolic debugger which
features a far more functional UI than dbx".  References are included
in the next section.
 
******************************************************************
*          References
******************************************************************
 
There two very good places to start looking for information about
parallel debuggers.  One is a survey done by Charlie McDowell and
others at University of California, Santa Cruz.  The second is
proceedings of a workshop held at the U. of Wisconsin in May, 1988,
on parallel and distributed debugging.  Below are the references
for these:
 
Charles E. McDowell and David P. Helmbold and Anil K. Sahai.
A Survey of Debugging Tools for Concurrent Programs.
University of California, Santa Cruz, Technical Report UCSC-CRL-87-22,
Computer Research Laboratory, December, 1987.
 
%J Proceedings of the ACM SIGPLAN and SOGOPS Workshop on Parallel
and Distributed Debugging in SIGPLAN Notices
%V 24
%N 1
%P 89-99
%C Madison, Wisconsin
%D January, 1989
 
Below are other references various people sent in:
 
Todd R.  Allen and David Padua, "Debugging Parallel Fortran on a Shared
Memory Machine", Proc.  of 1987 Int'l.  Conf.  on Parallel Processing,
St.  Charles, IL, pages 721-727.
 
Perry Emrath and Sanjoy Ghosh and David Padua, "Event Synchronization
Analysis for Debugging Parallel Programs", Subm.  for publ.  to 1989
Int'l.  Conf.  on Parallel Processing, St.  Charles, IL
 
Both are also CSRD tech reports.  The authors can be reached at CSRD,
104 S Wright St., 305 Talbot Lab, U of Il., Urbana, IL., 61802
 
Common bugs are addressed in the following article:
James R. McGraw and Timothy S. Axelrod, "Exploiting Multiprocessors:
Issues and Options" in Robert G Babb II, Programming Parallel
Processors, Addison Wesley, 1988.
 
"A Debugger for Parallel Processes," J.H.  Griffin, H.J.  Wasserman,
L.P.  McGavran, _Software Practice & Experience_, Vol.  18, No.  12, pp.
1179- 1190, Dec.  1988.
 
"Proceedings of the ACM SIGPLAN and SIGOPS Workshop on Parallel and
Distributed Debugging," May 5-6, 1988, University of Wisconsin.  _ACM
SIGPLAN Notices_, Vol.  24, No.  1, 1989.
 
"Pi: A Case Study in Object-Oriented Programming," T.A. Cargill, _OOPSLA
1986 Proceedings_, September, 1986.
 
"The Feel of Pi," T.A. Cargill, _Proceedings Winter USENIX Meeting_,
Denver CO, Jan. 1986.
 
Taylor, Richard N.  1983.  A general-purpose algorithm for analyzing
concurrent programs.  CACM 26/5 (May): 362-376.  Ibid.  1983.
Complexity of analyzing the synchronisation structure of concurrent
programs.  Acta Informatica, 19, 57-83.
 
Taylor, Richard N. and Leon J. Osterweil.  1980.  Anomaly detection in
concurrent software by static data flow analysis.  IEEE Trans. Soft. Eng.
SE-6/3 (May):  265-277.
 
Gordon, Aaron J.  and Raphael A.  Finkel.  1986.  TAP: A tool to find
timing errors in distributed programs.  In Proceedings of the Workshop
on Software Testing, Banff, Canada, July 1986, 154-163.  IEEE Computer
Society Press.