utter@tcgould.tn.cornell.edu (Paula Sue Utter) (03/16/89)
One of the hot topics of research today is how to debug parallel programs, both on shared memory multiprocessors and on distributed memory machines. Often it seems these debugger systems are developed more for ease of implementation rather than for providing maximum utility and ease of use. What I'd like to do is get some opinions on just what sort of features a good debugger system for parallel programs should provide. The kind of information I'm looking for includes: Is it harder to write and debug new parallel programs, or to parallelize "dusty deck" serial programs? What are the most common bugs you have encountered during parallel programming development and production runs (e.g., unintentional change to a shared variable, etc.). What methods have you used in your attempts to debug parallel programs? Of these, which were most successful? What types of tools do you think would be helpful in developing and debugging parallel programs? (For example, would it be helpful to observe sequential execution within each process executing in parallel?) What important events or features should be displayed when representing parallel program execution? (Some suggestions might be synchronization mechanisms, interprocess communication patterns, updates to shared variables, etc.) When working with parallel programs, people often employ graphical representations that reflect their mental model of the problem at hand. Could you give me a verbal description of the way you envision such things as: Parallel execution Interprocess communication Synchronization schemes Since many people now include performance evaluation and improvement as part of the debugging process when dealing with parallel programs, what type of information would be useful in this area? If you have used any existing parallel debugger systems, either commercial or experimental, could you name them and give me some feedback on their usefulness? I'd really appreciate any opinions you have on this matter. Please send responses to utter@tcgould.tn.cornell.edu. If I get enough responses, I'll compile the results and post them here. Thanks in advance. Sue Utter Technology Integration Group Cornell National Supercomputer Facility
hammonds@riacs.edu (Steve Hammond) (03/17/89)
>One of the hot topics of research today is how to debug parallel >programs, both on shared memory multiprocessors and on distributed >memory machines. Often it seems these debugger systems are developed >more for ease of implementation rather than for providing maximum >utility and ease of use. What I'd like to do is get some opinions on >just what sort of features a good debugger system for parallel programs >should provide. > > >The kind of information I'm looking for includes: > >Is it harder to write and debug new parallel programs, or to parallelize >"dusty deck" serial programs? It is not clear to me how you measure "harder". Do you mean harder to get *something* running or harder to squeeze that last cpu second out of the code? I have written parallel algorithms from scratch and parallelized sequential codes (not really dusty deck stuff since it was pretty well written fortran code and the application was ammenable to ||'sm. If good "software engineering" techniques are applied then neither are very hard to program, i.e., get working code. To me, the hard part is the thought that goes into problem partitioning and algorithm design before your hands ever touch the keyboard. The best way that I have found to get code ( || and sequential) is to get a kernel working and then incrementally add pieces to it until you have worked up a running system. I have just finished coding an iterative solver for large sparse linear systems (arising from discretized PDE's) on a sequent balance 21000. Now I moving to the connection machine. >What are the most common bugs you have encountered during parallel >programming development and production runs (e.g., unintentional change >to a shared variable, etc.). The most common bug that I have run into is a synchronization problem (on the sequent, an MIMD machine), one process modifying a shared variable before it should. It is difficult to explain in just a few lines so I will leave it at that. >What methods have you used in your attempts to debug parallel programs? >Of these, which were most successful? On the sequent, I used pdbx. It really wasn't that helpful because most of the errors I tried to find were due to timing and often I could not find the error since breakpoints set at the end of procedures synchronized the code and made it run differently than under normal operating conditions. Mostly I just started littering my code with barriers until the problem went away and then I would start removing them until the problem surfaced again. That pointed to the timing problem which usually resulted from probelm partitioning, etc. >What types of tools do you think would be helpful in developing and >debugging parallel programs? (For example, would it be helpful to >observe sequential execution within each process executing in parallel?) I think a useful tool would be something that captured the order of "events" to make a MIMD program have a repeatable order of execution. When I am debugging I want a deterministic sequence of events. For example, I want processes to finish tasks in the same order. I believe something like this was being worked on at U. Rochester. I think that one of the people involved was Tom LeBlanc if you want to check into it. It was being developed on their 128 node butterfly. Anyone know the status of this? >Since many people now include performance evaluation and improvement as >part of the debugging process when dealing with parallel programs, what >type of information would be useful in this area? Perhaps for a shared memory machine one would be interested in bus contention or hot spots in memory. >If you have used any existing parallel debugger systems, either >commercial or experimental, could you name them and give me some >feedback on their usefulness? I have used pdbx. It is not truly useful. It does give one the capability to stop all processes and let them execute one at a time. Basically, it was just dbx running with multiple processes. >Sue Utter >Technology Integration Group >Cornell National Supercomputer Facility Steve -- Steve Hammond * Parallel Systems Division * RIACS * NASA Ames Research Center
midkiff@uicsrd.csrd.uiuc.edu (Sam Midkiff) (03/18/89)
> >I think a useful tool would be something that captured the order >of "events" to make a MIMD program have a repeatable order of >execution. When I am debugging I want a deterministic sequence >of events. For example, I want processes to finish tasks in the same order. >I believe something like this was being worked on at U. Rochester. >I think that one of the people involved was Tom LeBlanc if you >want to check into it. It was being developed on their 128 node butterfly. >Anyone know the status of this? > Todd Allen and Sanjoy Ghosh, with Profs. David Padua and Perry Emrath at CSRD, U of Illinois have been working on a system which captures potential dependence violations and allows them to be analyzed after the program execution. Existing synchronization is taken into account to reduce the amount of trace information needed for the post-execution analyses. Two papers covering this work are: Todd R. Allen and David Padua, "Debugging Parallel Fortran on a Shared Memory Machine", Proc. of 1987 Int'l. Conf. on Parallel Processing, St. Charles, IL, pages 721-727. Perry Emrath and Sanjoy Ghosh and David Padua, "Event Synchronization Analysis for Debugging Parallel Programs", Subm. for publ. to 1989 Int'l. Conf. on Parallel Processing, St. Charles, IL Both are also CSRD tech reports. The authors can be reached at CSRD, 104 S Wright St., 305 Talbot Lab, U of Il., Urbana, IL., 61802
news%glimmer%twitch@att.att.com (03/30/89)
In article <4821@hubcap.UUCP>, hammonds@riacs (Steve Hammond) writes: >I think a useful tool would be something that captured the order >of "events" to make a MIMD program have a repeatable order of >execution. When I am debugging I want a deterministic sequence >of events. For example, I want processes to finish tasks in the same order. >I believe something like this was being worked on at U. Rochester. >I think that one of the people involved was Tom LeBlanc if you >want to check into it. It was being developed on their 128 node butterfly. >Anyone know the status of this? BBN now has an event logger and display system for the Butterfly, called gist. You capture events by logging them locally in each processor. Then collect the logs and post-process them with a nice interactive graphical front-end that lets you display selected event types, selected processors, various time grain, etc. Don't know if it builds on the Rochester work or is home grown. >>If you have used any existing parallel debugger systems, either >>commercial or experimental, could you name them and give me some >>feedback on their usefulness? Though I haven't used the Buttterfly version, we built a simulator for the Monarch that used a similar mechanism, enhanced to catch things like switch and memory contention (and optionally make them into events). Very nice for elucidating algorithms and problems with their memory referencing behavior. Our view (well, one view here) on debugging parallel programs is that it takes three steps: 1. Debug it on one processor. 2. Debug it on two processors. 3. Run it on N processors. Very few bugs show up here, though you may run into hot spots and strange N-processor race effects. -- /jr jr@bbn.com or bbn!jr C'mon big money!