masticol@lento.rutgers.edu (Steve Masticola) (07/31/90)
The following are the responses to my query (about a month ago) asking for anecdotes about deadlock and deadlock avoidance in practice. Several who replied included all or part of the original posting, so I won't reproduce it here. - Steve Masticola (masticol@athos.rutgers.edu) -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- From: Richard P. LaRowe <rpl@cs.duke.edu> In article <9499@hubcap.clemson.edu> you write: >I'd like to get in touch with people who have made written programs >that deadlocked (or failed to terminate because some thread could not >synchronize). What I'd like to know is: > >- The language and synchronization primitives you used; >- A description of the problem; >- How you resolved the problem; >- Whether you think your solution might be part of a general set of > programming conventions that could be used to write programs that do > not deadlock. > >Thanks for your help! I'll summarize responses and post them here. > >- Steve Masticola (masticol@athos.rutgers.edu) My work involves alot of OS kernel hacking on a BBN GP1000 (running DUnX, our own kernel based on BBN's nX). Aside from an occasional stupid mistake (e.g. forgetting to unlock something I should have), most of my problems have been related to one of two things: 1) sequential bugs ... my favorite is when I incorrectly indexed into an array, accidently setting a spinlock in the process (this little beauty took a week to track down). This was written in C, by the way. A language that checked subscripts would have made the problem alot easier to diagnose. I've had a few problems of this variety. What makes them difficult to track down is that I immediately start looking for race conditions and such, completely ignoring the "simple sequential stuff". 2) Problems with processor interrupt priorities. Things get real messy when you have to obtain locks when running at different priorities. For example, some VM operations require that remote processors flush their TLB's. The processor requesting the flush is necessarily running at a high priority (processing a page fault), and also holds some locks (can't just go mucking around with the VM data structures). If another processor tries locking something the first already has, while it is running at high (low, depending how you look at it) priority, it won't receive the flush interrupt from the first, and deadlock results. Luckily, the set of such possiblities is fairly small, and is mostly resolved by optimism and checking. In other words, there are certain places where a failed lock will check to see if someone else is waiting for him (e.g., check to see if there's a node waiting for him to respond to some interrupt). Seems like it's all working now ... 8-) I don't suspect either of these stories lends itself immediately to any programming conventions. I suppose the approach used in #2 could be applied just about anywhere, but there are probably cleaner solutions for cases in which only locks are being dealt with (e.g. ordered resources, etc.). I'd be interested in hearing about you're doing with this info. Sounds like it could be leading to (or part of) some interesting research ... -Rick LaRowe (rpl@cs.duke.edu) -=-=- From: Duke Briscoe <briscoe-duke@cs.yale.edu> In article <9499@hubcap.clemson.edu> you write: >I'd like to get in touch with people who have made written programs >that deadlocked (or failed to terminate because some thread could not >synchronize). What I'd like to know is: > >- The language and synchronization primitives you used; >- A description of the problem; >- How you resolved the problem; Actually, we very rarely had problems with deadlock, which I think points out the efficacy of the methods we used, although to some extent it reflects the nature of our particular programs. The language was an object-oriented language called Possum which was implemented on top of BBN Butterfly Lisp. I designed and implemented the language 4 years ago, and I wrote a paper about it, but I didn't adequately pursue getting the paper published because I had left MITRE where the work was done to become a grad student here at Yale. There were two classes of objects which had locks, one class had a single lock for the whole object, the other had a lock for each instance variable of an object. Using multiple inheritance, we could mix any other class with one or both of these lockable classes. However, our programming discipline would never use both types of locks for a single object. The locks could either be locked to allow multiple readers or a single writer. We usually used programming constructs which clearly delimited the code text over which a lock was held; this avoided problems with having certain flows of control forget to unlock a lock. More primitive locking constructs could be used if you couldn't bound the duration of holding a lock within one textual block. >- Whether you think your solution might be part of a general set of > programming conventions that could be used to write programs that do > not deadlock. When locking more than one instance variable (iv) of an object, we followed a discipline of only acquiring locks to the ivs in the order in which they were named in the class declaration. We changed the ordering in the class declaration according to what the needs of all the object's methods were. This is a technique which I think has been called the "ordered resource" method; you could probably find a reference in an operating system textbook. It worked effectively for us since it turned out that we rarely needed for a task to keep locks on the instance variables of more than one object at a time, and we were always able to order the instance variables in a way that would satisfy our needs. This may be typical of object oriented programs, but I guess it might not work out so well for all cases. Duke Briscoe briscoe@cs.yale.edu -=-=- From: rsb@mcc.com (Richard S. Brice) If you are looking for examples on deadlock, I suggest you teach a class on system simulation. Ask students to simulate almost any kind of system that can deadlock and you'll get all the examples you need. Even those who know about and understand deadlock danger, avoidance, detection etc. will fall into the trap sooner or later. -=-=- From: rsb@mcc.com (Richard S. Brice) I assume that you've read the classic papers on deadlock avoidance, prevention, detection etc which flooded the literature in the late 60's and early 70's. I'd be surprised to discover there's much new to learn, unleas you're looking for implementation techniques for some of those ideas which are simple and efficient. -=-=- From: pcg@compsci.aberystwyth.ac.uk I'd like to get in touch with people who have made written programs that deadlocked (or failed to terminate because some thread could not synchronize). What I'd like to know is: - The language and synchronization primitives you used; - A description of the problem; - How you resolved the problem; - Whether you think your solution might be part of a general set of programming conventions that could be used to write programs that do not deadlock. Written in C with sempahores A driver for a kind of filesystem under UNIX. It keeps a cache of blocks, some of them data blocks, some of them index blocks. When a data block needs writing to the disc, the index block mapping it must be consulted; if the cache is full of "dirty" blocks, you cannot read in the data block. The solution is either to keep separate caches for data and index blocks, or to have an emergency reserve of one or two block slots. It is a classic resource deadlock, like the famous one involving spooling input and output on the same partition. The only way out is either separate resource pools, either static, or more dynamically with a banker algorithm. -=-=- From: John Vaudin <arthur@flame.warwick.ac.uk> Hi, I have been writing a simple store and forward message passing router for an 8x8 array of Transputers. The idea was to link the router in with other C programs to provide message passing. I encountered two deadlock problems doing this. The first was that the router itself deadlocked. This was cured by using a routing algorithm that ensured no cycles existed in the router. I used a very simple rule, there are probably better ways of doing it.. The array is square and four connected. My routing rules always route north/south THEN east/west. This means no node ever routes a message coming in from the east/west onto its north/south outputs. That plus the fact that messages never get routed back the way they came means that there are no cycles hence no deadlock. The other problem was that the router interacted with the programs it was linked with such that the two together deadlocked. This was caused by dependancies being generated by the router because of the sequential way it delivers messages. That is if a process is not ready to recieve an incoming message then any messages behind it in the router's buffer cannot be delivered, thus any processes waiting for those messages are now waiting for the process that the first message in the queue is bein sent to. The approach here was to get rid of the FIFO queue inherant in the router. Now all messages are recieved by a buffer process on each node which can check to see if the process it is being sent to is waiting or not. If so the message is delivered, otherwise it is discarded and a resend message is sent back to the sender. In this way messages can always be delivered to a waiting process regardless of any other processes on that node. I think getting rid of cycles is pretty standard if it can be done, but it is not always feasible. The rest of the time I just use buffers ( with a non FIFO behaviour) which essentially stretches any cycles, and allows them to be detected ie full buffers. I am not very happy with the system I have built. It does work, but it uses a variety of techniques to do so in a rather ad hoc fashion. I have failed to find any useful help on how to write deadlock free programs. If you have any good references please let me know. John Vaudin C.S. Dept. Warwick University U.K. arthur@uk.ac.warwick.cs -=-=- Sender: xeroxmailer!ALEX_HASSAN@ail.co.uk Steve, I write programs regularly using the Strand88 programming language. Strand88 is interesting from the point of view of synchronisation for the following reasons: - When writing a Strand program, no explicit synchronisation code is required. This is because the language is based on data flow via single-assignment variables. - It follows that it is not possible to write a syntactically correct Strand program that deadlocks or experiences any other form of synchronisation error. - In fact, what Strand does is to represent synchronisation errors as program errors - i.e. they are generally obvious and can be tested for by a simple syntax checker, rather than being ephemeral quantities resulting in obscure behaviour dependent on timing. Example: The following Strand code implements a producer-consumer pair of processes. Note that the only explicit reference to the parallel machine is the @ notation used to post procedure calls to the appropriate processor. The other required communication is handled implicitly. ------------------------------------------------------ -compile(free). -exports([pc/1]). % This procedure starts the two processes, returning the final result pc(R) :- produce(100, L)@1, % on processor 1 consume(L, R)@2. % on processor 2 % This procedure generates a list of N integers, 1 to N produce(N, L) :- N > 0 | L := [N|L1], N1 is N - 1, produce(N1, L1). produce(0, L) :- L := []. % This procedure consumes a list of integers, summing the elements consume([N|L], R) :- R is R1 + N, consume(L, R1). consume([], R) :- R := 0. ------------------------------------------------------ I think it will be interesting to see what responses you get to your request, as it is often quite hard to convince people how nasty synchronisation problems can be - that is until they encounter one! If you want to know more about Strand88 here is our US office address: Strand Software Technologies Inc., 15220 N. W. Greenbrier Parkway, Suite 350, Beaverton, OR 97006. Tel: 503-690-9830. Strand88 is currently implemented for shared memory machines (Sequent, Encore), distributed memory machines (various Transputer machines, iPSC/2) and workstation networks (Sun). All Strand programs are portable at the binary level and can interface to C and FORTRAN. Regards, Alex Hassan (alex@ail.co.uk), Strand Software Technologies. -=-=- From: David F. Kotz <dfk@cs.duke.edu> Well, I have experienced deadlock many times. - The language and synchronization primitives you used; I am working in a BBN GP1000 using the Uniform System, a library of memory and process management functions to make programming the GP1000 easier than in pure Mach (or previously, in pure Chrysalis). The US provides a simple spinlock mechanism that you can use for synchronization, and the hardware/OS provides atomic FetchAndOp functions. Also, the US has constructs (called Generators) to create multiple tasks and wait for them to complete. It is trivial to write a shared-memory program using these synchronization mechanisms that will not deadlock and will do what you want. But it will also probably run slowly. To get higher performance you need to forget about using Generators except to start a task on each node, and then use your own methods to do fine-grained synchronizations. This is how I program and how I tend to get deadlock. I use shared memory locations as synchronization variables both as plain locks and in other ways (counters, state flags, data structures). - A description of the problem; I don't have any specific problem in mind. My errors usually fall into two categories: blatant mistakes and race conditions. In the mistake category there is the common mistake of forgetting to unlock a lock under certain circumstances. More subtle problems involve counters: some processes increment the counter, and some decrement the counter, at different times and at multiple and widely-separated parts of the code, and in time. The counter's value may go up and down. But in some cases I forget to decrement or forget to increment, or perhaps mistakenly do so because some logic errors in examining other state variables. The counter is left in an incorrect state. Deadlock may occur when a process waits for the counter to go to zero, say, but it never does because it's off by one; if the other processes are waiting for this one, then you have deadlock. These can arise due to incredibly complicated series of events that are hard to track. The real problem is an overly-complex system: a large asynchronous program where processes are working in several parts of the code simultaneously, synchronized through many data structures and state variables. The exact interplay of these variables is difficult to track in the programmer's head, and in fact would be nearly impossible for anyone else (especially compilers) to figure out. - How you resolved the problem; Usually the problem was in detecting the error: deadlock is often not observed until long after the perpetrator has left. The counter is -1, and a process expects it to fall from postive down to zero, and all other processes are idle. Now what? Long ago, some process did an extra decrement on the variable, but when, and why? Often I put in some tracing to track the changes to this variable, and see what's going on. The solution is very specific to the problem and often really nasty. The fix is generally simple: fix the logic error. - Whether you think your solution might be part of a general set of programming conventions that could be used to write programs that do not deadlock. No, because my synchronizations are always really hairy and I use ad hoc methods. Some programming constructs or conventions could help, but as with most, they may be easy, but performance suffers. You want higher performance, you play the game. In many ways, it is the same as correct sequential programming: you think about preconditions and postconditions, invariants, etc. But I have never found a regular way think through a set of operations to convince myself that it works and will not deadlock. This is especially difficult as several synchronization mechanisms begin to interact with each other.... Good luck. -=-=- From: David F. Kotz <dfk@cs.duke.edu> I suppose I didn't mean fine-grained task sizes; probably a poor choice of words. I was thinking of it in terms of synchronization: ie, more locks with small scope instead of fewer locks with large scope. Also, using other concurrency mechanisms (such as fetch&add) to allow greater concurrency than might be available with simple locked critical sections. For example, a recent application started out with one large critical section, controlled by a lock. Worked fine, but was slow. So I opened it up by allowing more processes into the code at one time, using "finer-grained" locks and other methods to control the concurrency. Now all of a sudden it was more complicated and prone to mistakes - including deadlock. [Then I started hacking on the local/remote memory accesses, by manually replicating and distributing data to all processes. This added another level of complexity and more mistakes.] Certainly, automated management of the NUMA memory issues would have helped. (A little bit of bias shows here, since I am part of a group here that is studying this.) Of course, it may not have done as well as I did, but would have been easier. dave
hawley@uunet.UU.NET (David John Hawley) (07/31/90)
In article <9917@hubcap.clemson.edu> masticol@lento.rutgers.edu (Steve Masticola) compiled: >Sender: xeroxmailer!ALEX_HASSAN@ail.co.uk > >I write programs regularly using the Strand88 programming language. Strand88 >is interesting from the point of view of synchronisation for the following >reasons: > >- When writing a Strand program, no explicit synchronisation code is required. >This is because the language is based on data flow via single-assignment >variables. > >- It follows that it is not possible to write a syntactically correct Strand >program that deadlocks or experiences any other form of synchronisation error. Explicit or not, the syncronization code is still there. More importantly, the dependencies are still there. I program in KL1, a sibling of Strand. I personally testify that both ... >- In fact, what Strand does is to represent synchronisation errors as program >errors - i.e. they are generally obvious and can be tested for by a simple >syntax checker, rather than being ephemeral quantities resulting in obscure >behaviour dependent on timing. ... stupid mistakes (forgetting to instantiate a variable, close a stream, etc) and more erudite mistakes (race conditions) are possible in KL1. I don't think Strand can be so different. I would be delighted if someone would correct me. >I think it will be interesting to see what responses you get to your request, >as it is often quite hard to convince people how nasty synchronisation problems >can be - that is until they encounter one! I agree whole-heartedly ;-) On a more positive note, I also believe that the concurrent logic languages have a lot to offer in terms of simplicity and clarity, compared to traditional ''sequential + concurrent_primitives'' approach. --------------------------- David Hawley, ICOT, 4th Lab csnet: hawley%icot.jp@relay.cs.net uucp:{enea,inria,mit-eddie,ukc}!icot!hawley ICOT, 1-4-28 Mita, Minato-ku, Tokyo 108 JAPAN. TEL/FAX {81-3-456-}2514/1618