utter@tcgould.tn.cornell.edu (Paula Sue Utter) (05/20/89)
At last, here is a summary of the responses I received to my posting of several questions concerning parallel debugging. I got 19 responses; several were from people working in the area of debugging. I summarized the answers to the various questions rather than posting individual responses. Since I'm fairly new to posting to newsgroups, I wasn't sure if this was proper etiquette (hopefully it is). I'd like to thank everyone who took the time to respond. Here at the Cornell National Supercomputer Facility we're facing the task of trying to provide truly useful parallel debugging tools to our users. It was great to get input from experienced parallel programmers about the kind of features they've noticed were needed. ******************************************************************** ******************************************************************** >Is it harder to write and debug new parallel programs, or to parallelize >"dusty deck" serial programs? The general consensus is that it is harder to parallelize "dusty deck" serial programs, for a number of reasons. The primary problem is that the serial programs were designed with no thought to parallelism. Additionally, one has to deal with bad programming practice and often very little knowledge of what the code really does. ****************************************************************** ****************************************************************** >What are the most common bugs you have encountered during parallel >programming development and production runs (e.g., unintentional change >to a shared variable, etc.). For shared memory machines, the most common bug involves errors in access to shared variables. Either a value is private when it should be shared, or shared when it should be private. This situation can be a symptom of either improper data sharing or faulty synchronization around critical sections. Bugs reported for distributed memory machines involve errors in message traffic, such as mismatched messages, where a process is expecting one type of message but receives another. A difficulty with debugging parallel programs is that the symptom of the problem may show up on another processor, rather than on the one where the problem actually occurred. Also, the symptom can be masked by the action of another processor before the error can be investigated with a debugger. ****************************************************************** ****************************************************************** >What methods have you used in your attempts to debug parallel programs? >Of these, which were most successful? The most common method mentioned was to serialize particular routines which are normally run in parallel; thus the internal synchronization within the routine can be tested. Other methods employed were the use of print statements, partial correctness checking techniques, code splitting, and tracking data being passed (in message- passing envirnoment). Various debuggers were also mentioned; a summary of user experience will be given in a following question. Generally, it was recommended that during program development, the parallel program should always be run on only one processor initially, then two, then up to the desired number of processors. The importance of designing good test cases with the goal of exercising all possible execution paths was also pointed out. ****************************************************************** ****************************************************************** >What types of tools do you think would be helpful in developing and >debugging parallel programs? (For example, would it be helpful to >observe sequential execution within each process executing in parallel?) A tool which allows examination of the sequence of events which occurred during execution would be helpful. Events involving interprocess communications (either shared variable access or message passing) appear to be most interesting. Being able to debug a single process within a multi-process program is would also be useful. Another idea involved code splitting tools, which aid in creating good partial execution environments. As a general comment, it was noted that any type of tool should be scalable, i.e., it should allow debugging of programs running on any number of processors. ****************************************************************** ****************************************************************** >What important events or features should be displayed when representing >parallel program execution? (Some suggestions might be synchronization >mechanisms, interprocess communication patterns, updates to shared >variables, etc.) As mentioned in the previous question, interprocess communications is of primary interest. ****************************************************************** ****************************************************************** > When working with parallel programs, people often employ graphical > representations that reflect their mental model of the problem at hand. > Could you give me a verbal description of the way you envision such > things as: > Parallel execution > Interprocess communication > Synchronization schemes One suggestion was to represent parallel execution by showing different pools of data associated with each processor, along with the communications lines between processors. Interprocess communications would then be illustrated by sending messages down those lines. Synchronization could be displayed as the organization of those messages in time. Another suggestion from someone in Europe utilized Petri nets as a means of modeling synchronization. ****************************************************************** ****************************************************************** >Since many people now include performance evaluation and improvement as >part of the debugging process when dealing with parallel programs, what >type of information would be useful in this area? Suggestions included bus contention, memory access hot spots, detecting excessive spin waiting at synchronization points, and various message- passing statistics (number of messages sent and received, on both a global and per processor basis, memory channel utilization, receive queue statistics, delay times between message send and message receive, and 'snapshots' of message loading and process loading across message-driven systems). Probably the most basic requirement would be access to a good clock. ****************************************************************** ****************************************************************** >If you have used any existing parallel debugger systems, either >commercial or experimental, could you name them and give me some >feedback on their usefulness? Various debuggers were mentioned: PDBX on the Sequent was described as being "just dbx running with multiple processes". It isn't of much help with timing errors, apparently, because setting breakpoints perturbates the normal sequence of events which one is trying to examine. DECON on the Intel iPSC/2 was mentioned as available, but had not been used. Parasight is a debugger developed by Ilya Gertner, Ziya Aral, and Greg Schaffer at Encore for shared memory multiprocessors. According to the source, there are "*many* new and "neat" ideas in here which make debugging parallel programs not only doable, but easy". BBN supplies the GIST system, which is based on some of LeBlanc and Mellor-Crummey's work at U. of Rochester. Events are recorded by individual processors. A post-processor then merges the logs into a single execution history which can be graphically displayed. Apparently, this system is especially useful for performance debugging. An experimental system is being developed at CSRD, U. of Illinois, by Sanjoy Ghosh, Todd Allen, Dave Padua, and Perry Emrath which provides execution traces in areas where potential nondeterminism has been detected. References for this system are listed in the accompanying bibliography. Pi (Process inspector), developed by T. A. Cargill, was also mentioned. The system was described as being a "killer symbolic debugger which features a far more functional UI than dbx". References are included in the next section. ****************************************************************** * References ****************************************************************** There two very good places to start looking for information about parallel debuggers. One is a survey done by Charlie McDowell and others at University of California, Santa Cruz. The second is proceedings of a workshop held at the U. of Wisconsin in May, 1988, on parallel and distributed debugging. Below are the references for these: Charles E. McDowell and David P. Helmbold and Anil K. Sahai. A Survey of Debugging Tools for Concurrent Programs. University of California, Santa Cruz, Technical Report UCSC-CRL-87-22, Computer Research Laboratory, December, 1987. %J Proceedings of the ACM SIGPLAN and SOGOPS Workshop on Parallel and Distributed Debugging in SIGPLAN Notices %V 24 %N 1 %P 89-99 %C Madison, Wisconsin %D January, 1989 Below are other references various people sent in: Todd R. Allen and David Padua, "Debugging Parallel Fortran on a Shared Memory Machine", Proc. of 1987 Int'l. Conf. on Parallel Processing, St. Charles, IL, pages 721-727. Perry Emrath and Sanjoy Ghosh and David Padua, "Event Synchronization Analysis for Debugging Parallel Programs", Subm. for publ. to 1989 Int'l. Conf. on Parallel Processing, St. Charles, IL Both are also CSRD tech reports. The authors can be reached at CSRD, 104 S Wright St., 305 Talbot Lab, U of Il., Urbana, IL., 61802 Common bugs are addressed in the following article: James R. McGraw and Timothy S. Axelrod, "Exploiting Multiprocessors: Issues and Options" in Robert G Babb II, Programming Parallel Processors, Addison Wesley, 1988. "A Debugger for Parallel Processes," J.H. Griffin, H.J. Wasserman, L.P. McGavran, _Software Practice & Experience_, Vol. 18, No. 12, pp. 1179- 1190, Dec. 1988. "Proceedings of the ACM SIGPLAN and SIGOPS Workshop on Parallel and Distributed Debugging," May 5-6, 1988, University of Wisconsin. _ACM SIGPLAN Notices_, Vol. 24, No. 1, 1989. "Pi: A Case Study in Object-Oriented Programming," T.A. Cargill, _OOPSLA 1986 Proceedings_, September, 1986. "The Feel of Pi," T.A. Cargill, _Proceedings Winter USENIX Meeting_, Denver CO, Jan. 1986. Taylor, Richard N. 1983. A general-purpose algorithm for analyzing concurrent programs. CACM 26/5 (May): 362-376. Ibid. 1983. Complexity of analyzing the synchronisation structure of concurrent programs. Acta Informatica, 19, 57-83. Taylor, Richard N. and Leon J. Osterweil. 1980. Anomaly detection in concurrent software by static data flow analysis. IEEE Trans. Soft. Eng. SE-6/3 (May): 265-277. Gordon, Aaron J. and Raphael A. Finkel. 1986. TAP: A tool to find timing errors in distributed programs. In Proceedings of the Workshop on Software Testing, Banff, Canada, July 1986, 154-163. IEEE Computer Society Press.