[comp.sys.isis] How to report ISIS bugs

ken@gvax.cs.cornell.edu (Ken Birman) (08/01/89)

It would be nice to really believe that ISIS V1.2 is totally bug
free, but I'm not the naive sort... so, some of you will probably
run into ISIS problems.  The purpose of this message is to summarize
the procedure for reporting them in a way that maximizes my chances
of being able to reproduce (and hence fix) whatever you saw.

First, some comments on what sorts of bugs a system like ISIS has
had.  ISIS is rather big, as you may have noticed -- as big as a
small operating system kernel by now.  Only a very small part of this
code is concerned with the "core" of the system -- the implementation
of the protocols, etc.  Almost everything else is concerned with the
toolkit interface.

Actually, I guess that about half my bugs turn out to be bugs in the
OS or compiler.  The SUN and HP compilers have been recent sources of
problems -- both sometimes generate bad code, although the things
that cause this are not particularly easy to come up with.  The Apollo
OS is a particularly flakey version of UNIX, at least under release
10.1.  New releases of operating systems tend to cause us a lot of grief.

The other half of the problems are usually mine.

The average bug has turned out to be of the "minor oversight" or "typo"
variety, mainly in the toolkit.  For example, freeing a message twice.

By now, protos (the protocols implementation) is fairly solid.  However,
in going from V1.1 to V1.2 we made some very basic changes in the way that
the change of group membership protocol works when there are a lot of 
active broadcasts just as a failure or join takes place.  The changes
seem solid, but presumably the odds of a bug turning up in the protocols
process rose when we did this.

Another large change relates to letting your "main program" call ISIS without
having to sit in the "isis_mainloop".  By now, I am fairly confident that
this code is also solid -- but compared to some other parts of ISIS,
it has certainly been used for a shorter period of time.  This would
all be in the clib, and hence problems would affect application programs
directly.

A final change that could cause problems is the switch to address pointers.
Since these pointers often point into messages (and hence become invalid
when the message does) or into the "groups" data structure (and hence
become invalid when you leave the group), there could be code that used
to be sort of correct that is now definitely incorrect.  An example
would be a program that continues to use the address of a group to
which it belonged but has now left.  In V1.1 this would work, in V1.2
you would need to copy the address someplace safe -- the pointer returned
by pg_join ceases to be valid after a pg_delete or pg_leave.
This change is more likely to trigger bugs in your code than in mine.

Say that you think you have run into an ISIS problem.  What should
you do?

The first thing to appreciate is that most things reported to me turn
out to be errors in the program that called ISIS, not in ISIS itself.
Or, confusion about error messages (e.g. when dbx claims the stack is
overwritten, which it does for every SUN LWP program - a completely 
incorrect and misleading diagnostic).  Some people have even shown me
programs of their own in infinite loops and claimed that ISIS is
malfunctioning because their program wasn't replying to broadcasts!

On the other hand, ISIS does have bugs; in V1.2 I know of 1 but
have seen it too rarely to trace it (protos goes into an infinite loop
and this causes the site in question to hang).  How would one report this?
  
Basically, we debug ISIS using a combination of:

1) Scenerios that can reproduce the problem with reasonable probability,
say 1 run in 5 or 10.  Lacking this, it is purely a matter of luck if
we manage to figure out what causes a bug that has only been seen once.

2) isis logs.  See the manual section on this.  Basically, if you
send signal USR2 to any isis process (protos or application process)
it makes a nicely formatted dump file that I can look at to see if
data structures are messed up, etc.  Even when the process is in an
infinite loop!  If the problem involves protos, send logs for all the
sites, not just the one that crashed (i.e. 3.log, 4.log, 11.log...)

Try using "cmd snap" if most of the system is up but it just looks
like things are hung.

3) core images.  Usually, the stack trace from dbx or adb or cdb is
all I need, although those programs all have terrible trouble dealing
with lightweight tasks, so you may not be able to get a trace even
if nothing serious was wrong in the trace.

You can always force a program to dump core by sending it a signal like
BPT.

So, if you see a problem and can't see any simple explanation in
your own code, drop a note to me or to isis-bugs@cs.cornell.edu,
explaining how you produced the problem, showing me the relevant
logs, and giving a stack trace for any core image you ended up
with, assuming you can figure out what program generated it.

Now, if all of this leaves you nervous, let me close on a positive
note.  ISIS is extremely solid, and it usually takes something
really demanding to cause it to crash.  For example, if you bring
sites up and down fast enough (perhaps run isis with -f10 to speed
it all up) while running a high rate of broadcasts (say, a few
copies of grid) you can probably kill the system off within a
few minutes, or at least provoke some sort of a crash at one
of the sites that should have stayed up.   This is the sort of thing
we do here at Cornell, and we fix them as we find them.  

I guess this means that the SDI people shouldn't plan to build their
system on top of the present version of ISIS.  But, the rest of us
wouldn't actually try to put ISIS in a situation exhibiting this sort
of "violent" dynamic behavior (I hope)!

If you use ISIS in less extreme ways, you probably won't find any problems
at all -- this is what most people report, and even if not many of
you post to comp.sys.isis, a good bet is that there are between 75 and
150 sites actively using the system right now.  I know of some very
serious users.    

A year or two from now ISIS will probably be solid even in the most
extreme situations.  We have correctness proofs for the protocols, and
as for implementation bugs -- well, one can always track those down.

Ken Birman

PS: If your group would like to "purchase" a support contract, we
have a company that will contract to identify and correct ISIS problems
for a fee comparable to what software support contracts for other
large systems (database systems and so forth) cost.  Let me know
if you need details on this.