[comp.parallel] Deadlock: Summary of responses

masticol@lento.rutgers.edu (Steve Masticola) (07/31/90)

The following are the responses to my query (about a month ago) asking
for anecdotes about deadlock and deadlock avoidance in practice.
Several who replied included all or part of the original posting, so I
won't reproduce it here.

- Steve Masticola (masticol@athos.rutgers.edu)

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

From: Richard P. LaRowe <rpl@cs.duke.edu>

In article <9499@hubcap.clemson.edu> you write:
>I'd like to get in touch with people who have made written programs
>that deadlocked (or failed to terminate because some thread could not
>synchronize). What I'd like to know is:
>
>- The language and synchronization primitives you used;
>- A description of the problem;
>- How you resolved the problem;
>- Whether you think your solution might be part of a general set of
>  programming conventions that could be used to write programs that do
>  not deadlock. 
>
>Thanks for your help! I'll summarize responses and post them here.
>
>- Steve Masticola (masticol@athos.rutgers.edu)

My work involves alot of OS kernel hacking on a BBN GP1000 (running
DUnX, our own kernel based on BBN's nX).  Aside from an occasional
stupid mistake (e.g. forgetting to unlock something I should have),
most of my problems have been related to one of two things:

1) sequential bugs ... my favorite is when I incorrectly indexed
   into an array, accidently setting a spinlock in the process
   (this little beauty took a week to track down).  This was written 
   in C, by the way. A language that checked subscripts would have
   made the problem alot easier to diagnose.  I've had a few problems 
   of this variety.  What makes them difficult to track down is that
   I immediately start looking for race conditions and such, 
   completely ignoring the "simple sequential stuff".  

2) Problems with processor interrupt priorities.  Things get real
   messy when you have to obtain locks when running at different
   priorities.  For example, some VM operations require that remote
   processors flush their TLB's.  The processor requesting the
   flush is necessarily running at a high priority (processing a
   page fault), and also holds some locks (can't just go mucking
   around with the VM data structures).  If another processor tries
   locking something the first already has, while it is running
   at high (low, depending how you look at it) priority, it won't
   receive the flush interrupt from the first, and deadlock results.

   Luckily, the set of such possiblities is fairly small, and is
   mostly resolved by optimism and checking.  In other words, there
   are certain places where a failed lock will check to see if someone
   else is waiting for him (e.g., check to see if there's a node
   waiting for him to respond to some interrupt).  Seems like it's
   all working now ... 8-)

I don't suspect either of these stories lends itself immediately to
any programming conventions.  I suppose the approach used in #2
could be applied just about anywhere, but there are probably cleaner
solutions for cases in which only locks are being dealt with (e.g.
ordered resources, etc.).

I'd be interested in hearing about you're doing with this info. 
Sounds like it could be leading to (or part of) some interesting research ...

-Rick LaRowe (rpl@cs.duke.edu)

-=-=-

From: Duke Briscoe <briscoe-duke@cs.yale.edu>

In article <9499@hubcap.clemson.edu> you write:
>I'd like to get in touch with people who have made written programs
>that deadlocked (or failed to terminate because some thread could not
>synchronize). What I'd like to know is:
>
>- The language and synchronization primitives you used;
>- A description of the problem;
>- How you resolved the problem;

Actually, we very rarely had problems with deadlock, which I think
points out the efficacy of the methods we used, although to some
extent it reflects the nature of our particular programs.  The
language was an object-oriented language called Possum which was
implemented on top of BBN Butterfly Lisp.  I designed and implemented
the language 4 years ago, and I wrote a paper about it, but I didn't
adequately pursue getting the paper published because I had left MITRE
where the work was done to become a grad student here at Yale.  There
were two classes of objects which had locks, one class had a single
lock for the whole object, the other had a lock for each instance
variable of an object.  Using multiple inheritance, we could mix any
other class with one or both of these lockable classes.  However, our
programming discipline would never use both types of locks for a
single object.  The locks could either be locked to allow multiple
readers or a single writer.  We usually used programming constructs
which clearly delimited the code text over which a lock was held; this
avoided problems with having certain flows of control forget to unlock
a lock.  More primitive locking constructs could be used if you
couldn't bound the duration of holding a lock within one textual
block.

>- Whether you think your solution might be part of a general set of
>  programming conventions that could be used to write programs that do
>  not deadlock. 

When locking more than one instance variable (iv) of an object, we
followed a discipline of only acquiring locks to the ivs in the order
in which they were named in the class declaration.  We changed the
ordering in the class declaration according to what the needs of all
the object's methods were.  This is a technique which I think has been
called the "ordered resource" method; you could probably find a
reference in an operating system textbook.  It worked effectively for
us since it turned out that we rarely needed for a task to keep locks
on the instance variables of more than one object at a time, and we
were always able to order the instance variables in a way that would
satisfy our needs.  This may be typical of object oriented programs,
but I guess it might not work out so well for all cases.

Duke Briscoe
briscoe@cs.yale.edu

-=-=-

From: rsb@mcc.com (Richard S. Brice)

If you are looking for examples on deadlock, I suggest you teach a class
on system simulation.  Ask students to simulate almost any kind of
system that can deadlock and you'll get all the examples you need.  Even
those who know about and understand deadlock danger, avoidance, detection
etc. will fall into the trap sooner or later.

-=-=-

From: rsb@mcc.com (Richard S. Brice)

I assume that you've read the classic papers on deadlock avoidance, 
prevention, detection etc which flooded the literature in the late
60's and early 70's.  I'd be surprised to discover there's much
new to learn, unleas you're looking for implementation techniques
for some of those ideas which are simple and efficient.

-=-=-

From: pcg@compsci.aberystwyth.ac.uk

   I'd like to get in touch with people who have made written programs
   that deadlocked (or failed to terminate because some thread could not
   synchronize). What I'd like to know is:

   - The language and synchronization primitives you used;
   - A description of the problem;
   - How you resolved the problem;
   - Whether you think your solution might be part of a general set of
     programming conventions that could be used to write programs that do
     not deadlock. 

Written in C with sempahores

A driver for a kind of filesystem under UNIX. It keeps a cache of blocks,
some of them data blocks, some of them index blocks. When a data block
needs writing to the disc, the index block mapping it must be consulted;
if the cache is full of "dirty" blocks, you cannot read in the data
block.

The solution is either to keep separate caches for data and index
blocks, or to have an emergency reserve of one or two block slots.

It is a classic resource deadlock, like the famous one involving
spooling input and output on the same partition. The only way out is
either separate resource pools, either static, or more dynamically with
a banker algorithm.

-=-=-

From: John Vaudin <arthur@flame.warwick.ac.uk>

Hi,
	I have been writing a simple store and forward message passing router 
for an 8x8 array of Transputers. The idea was to link the router in with 
other C programs to provide message passing. 

I encountered two deadlock problems doing this. The first was that the router
itself deadlocked. This was cured by using a routing algorithm that ensured
no cycles existed in the router. I used a very simple rule, there are 
probably better ways of doing it.. The array is square and four connected.
My routing rules always route north/south THEN east/west. This means no
node ever routes a message coming in from the east/west onto its north/south
outputs. That plus the fact that messages never get routed back the way they
came means that there are no cycles hence no deadlock. 

The other problem was that the router interacted with the programs it was
linked with such that the two together deadlocked. This was caused by 
dependancies being generated by the router because of the sequential 
way it delivers messages. That is if a process is not ready to recieve
an incoming message then any messages behind it in the router's buffer
cannot be delivered, thus any processes waiting for those messages are
now waiting for the process that the first message in the queue is bein
sent to. The approach here was to get rid of the FIFO queue inherant in
the router. Now all messages are recieved by a buffer process on each
node which can check to see if the process it is being sent to is waiting
or not. If so the message is delivered, otherwise it is discarded and
a resend message is sent back to the sender. In this way messages can
always be delivered to a waiting process regardless of any other processes
on that node. 

I think getting rid of cycles is pretty standard if it can be done, but
it is not always feasible. The rest of the time I just use buffers ( with
a non FIFO behaviour) which essentially stretches any cycles, and allows
them to be detected ie full buffers. 

I am not very happy with the system I have built. It does work, but it
uses a variety of techniques to do so in a rather ad hoc fashion.
I have failed to find any useful help on how to write deadlock free programs. 
If you have any good references please let me know.


John Vaudin C.S. Dept. Warwick University U.K.  

arthur@uk.ac.warwick.cs

-=-=-

Sender: xeroxmailer!ALEX_HASSAN@ail.co.uk

Steve,

I write programs regularly using the Strand88 programming language. Strand88  
is interesting from the point of view of synchronisation for the following  
reasons:

- When writing a Strand program, no explicit synchronisation code is required.  
This is because the language is based on data flow via single-assignment  
variables.

- It follows that it is not possible to write a syntactically correct Strand  
program that deadlocks or experiences any other form of synchronisation error.

- In fact, what Strand does is to represent synchronisation errors as program  
errors - i.e. they are generally obvious and can be tested for by a simple  
syntax checker, rather than being ephemeral quantities resulting in obscure  
behaviour dependent on timing.

Example:

The following Strand code implements a producer-consumer pair of processes.  
Note that the only explicit reference to the parallel machine is the @ notation  
used to post procedure calls to the appropriate processor. The other required  
communication is handled implicitly.

------------------------------------------------------
-compile(free).
-exports([pc/1]).

% This procedure starts the two processes, returning the final result
pc(R) :-
	produce(100, L)@1,	% on processor 1
	consume(L, R)@2.	% on processor 2

% This procedure generates a list of N integers, 1 to N
produce(N, L) :-
	N > 0 |
	L := [N|L1], N1 is N - 1, produce(N1, L1).

produce(0, L) :-
	L := [].

% This procedure consumes a list of integers, summing the elements
consume([N|L], R) :-
	R is R1 + N, consume(L, R1).

consume([], R) :-
	R := 0.
------------------------------------------------------

I think it will be interesting to see what responses you get to your request,  
as it is often quite hard to convince people how nasty synchronisation problems  
can be - that is until they encounter one!

If you want to know more about Strand88 here is our US office address:
Strand Software Technologies Inc., 15220 N. W. Greenbrier Parkway, Suite  
350, Beaverton, OR 97006. Tel: 503-690-9830.

Strand88 is currently implemented for shared memory machines (Sequent, Encore),  
distributed memory machines (various Transputer machines, iPSC/2) and workstation  
networks (Sun). All Strand programs are portable at the binary level and  
can interface to C and FORTRAN.

Regards,

Alex Hassan (alex@ail.co.uk),
Strand Software Technologies.

-=-=-

From: David F. Kotz <dfk@cs.duke.edu>

Well, I have experienced deadlock many times. 

- The language and synchronization primitives you used;
I am working in a BBN GP1000 using the Uniform System, a library of
memory and process management functions to make programming the GP1000
easier than in pure Mach (or previously, in pure Chrysalis). The US
provides a simple spinlock mechanism that you can use for
synchronization, and the hardware/OS provides atomic FetchAndOp
functions. Also, the US has constructs (called Generators) to create
multiple tasks and wait for them to complete.

It is trivial to write a shared-memory program using these
synchronization mechanisms that will not deadlock and will do what you
want. But it will also probably run slowly. To get higher performance
you need to forget about using Generators except to start a task on
each node, and then use your own methods to do fine-grained
synchronizations. This is how I program and how I tend to get
deadlock. I use shared memory locations as synchronization variables
both as plain locks and in other ways (counters, state flags, data
structures).

- A description of the problem;
I don't have any specific problem in mind. My errors usually fall into
two categories: blatant mistakes and race conditions. In the mistake
category there is the common mistake of forgetting to unlock a lock
under certain circumstances. More subtle problems involve counters:
some processes increment the counter, and some decrement the counter,
at different times and at multiple and widely-separated parts of the
code, and in time. The counter's value may go up and down. But in some
cases I forget to decrement or forget to increment, or perhaps
mistakenly do so because some logic errors in examining other state
variables. The counter is left in an incorrect state. Deadlock may
occur when a process waits for the counter to go to zero, say, but it
never does because it's off by one; if the other processes are waiting
for this one, then you have deadlock. These can arise due to
incredibly complicated series of events that are hard to track.

The real problem is an overly-complex system: a large asynchronous
program where processes are working in several parts of the code
simultaneously, synchronized through many data structures and state
variables.  The exact interplay of these variables is difficult to
track in the programmer's head, and in fact would be nearly impossible
for anyone else (especially compilers) to figure out.

- How you resolved the problem; 
Usually the problem was in detecting the error: deadlock is often not
observed until long after the perpetrator has left. The counter is -1,
and a process expects it to fall from postive down to zero, and all
other processes are idle. Now what? Long ago, some process did an
extra decrement on the variable, but when, and why? Often I put in
some tracing to track the changes to this variable, and see what's
going on. The solution is very specific to the problem and often
really nasty. The fix is generally simple: fix the logic error. 

- Whether you think your solution might be part of a general set of
  programming conventions that could be used to write programs that do
  not deadlock. 
No, because my synchronizations are always really hairy and I use ad
hoc methods. Some programming constructs or conventions could help,
but as with most, they may be easy, but performance suffers. You want
higher performance, you play the game. In many ways, it is the same as
correct sequential programming: you think about preconditions and
postconditions, invariants, etc. But I have never found a regular way
think through a set of operations to convince myself that it works and
will not deadlock. This is especially difficult as several
synchronization mechanisms begin to interact with each other....

Good luck.

-=-=-

From: David F. Kotz <dfk@cs.duke.edu>

I suppose I didn't mean fine-grained task sizes; probably a poor
choice of words. I was thinking of it in terms of synchronization: ie,
more locks with small scope instead of fewer locks with large scope.
Also, using other concurrency mechanisms (such as fetch&add) to allow
greater concurrency than might be available with simple locked
critical sections. For example, a recent application started out with
one large critical section, controlled by a lock.  Worked fine, but
was slow. So I opened it up by allowing more processes into the code
at one time, using "finer-grained" locks and other methods to control
the concurrency. Now all of a sudden it was more complicated and
prone to mistakes - including deadlock. [Then I started hacking on the 
local/remote memory accesses, by manually replicating and distributing
data to all processes. This added another level of complexity and more
mistakes.] 

Certainly, automated management of the NUMA memory issues would have
helped. (A little bit of bias shows here, since I am part of a group
here that is studying this.) Of course, it may not have done as well
as I did, but would have been easier. 

dave

hawley@uunet.UU.NET (David John Hawley) (07/31/90)

In article <9917@hubcap.clemson.edu> masticol@lento.rutgers.edu (Steve Masticola) compiled:

>Sender: xeroxmailer!ALEX_HASSAN@ail.co.uk
>
>I write programs regularly using the Strand88 programming language. Strand88  
>is interesting from the point of view of synchronisation for the following  
>reasons:
>
>- When writing a Strand program, no explicit synchronisation code is required.
>This is because the language is based on data flow via single-assignment
>variables.
>
>- It follows that it is not possible to write a syntactically correct Strand  
>program that deadlocks or experiences any other form of synchronisation error.

Explicit or not, the syncronization code is still there.
More importantly, the dependencies are still there.

I program in KL1, a sibling of Strand.
I personally testify that both ...

>- In fact, what Strand does is to represent synchronisation errors as program
>errors - i.e. they are generally obvious and can be tested for by a simple  
>syntax checker, rather than being ephemeral quantities resulting in obscure  
>behaviour dependent on timing.

... stupid mistakes (forgetting to instantiate a variable, close a stream, etc)
and more erudite mistakes (race conditions) are possible in KL1.
I don't think Strand can be so different.
I would be delighted if someone would correct me.

>I think it will be interesting to see what responses you get to your request,
>as it is often quite hard to convince people how nasty synchronisation problems
>can be - that is until they encounter one!

I agree whole-heartedly ;-)

On a more positive note, I also believe that the concurrent logic languages
have a lot to offer in terms of simplicity and clarity, compared to
traditional ''sequential + concurrent_primitives'' approach.

---------------------------
David Hawley, ICOT, 4th Lab
csnet: hawley%icot.jp@relay.cs.net uucp:{enea,inria,mit-eddie,ukc}!icot!hawley
ICOT, 1-4-28 Mita, Minato-ku, Tokyo 108 JAPAN. TEL/FAX {81-3-456-}2514/1618