ken@gvax.cs.cornell.edu (Ken Birman) (08/01/89)
It would be nice to really believe that ISIS V1.2 is totally bug free, but I'm not the naive sort... so, some of you will probably run into ISIS problems. The purpose of this message is to summarize the procedure for reporting them in a way that maximizes my chances of being able to reproduce (and hence fix) whatever you saw. First, some comments on what sorts of bugs a system like ISIS has had. ISIS is rather big, as you may have noticed -- as big as a small operating system kernel by now. Only a very small part of this code is concerned with the "core" of the system -- the implementation of the protocols, etc. Almost everything else is concerned with the toolkit interface. Actually, I guess that about half my bugs turn out to be bugs in the OS or compiler. The SUN and HP compilers have been recent sources of problems -- both sometimes generate bad code, although the things that cause this are not particularly easy to come up with. The Apollo OS is a particularly flakey version of UNIX, at least under release 10.1. New releases of operating systems tend to cause us a lot of grief. The other half of the problems are usually mine. The average bug has turned out to be of the "minor oversight" or "typo" variety, mainly in the toolkit. For example, freeing a message twice. By now, protos (the protocols implementation) is fairly solid. However, in going from V1.1 to V1.2 we made some very basic changes in the way that the change of group membership protocol works when there are a lot of active broadcasts just as a failure or join takes place. The changes seem solid, but presumably the odds of a bug turning up in the protocols process rose when we did this. Another large change relates to letting your "main program" call ISIS without having to sit in the "isis_mainloop". By now, I am fairly confident that this code is also solid -- but compared to some other parts of ISIS, it has certainly been used for a shorter period of time. This would all be in the clib, and hence problems would affect application programs directly. A final change that could cause problems is the switch to address pointers. Since these pointers often point into messages (and hence become invalid when the message does) or into the "groups" data structure (and hence become invalid when you leave the group), there could be code that used to be sort of correct that is now definitely incorrect. An example would be a program that continues to use the address of a group to which it belonged but has now left. In V1.1 this would work, in V1.2 you would need to copy the address someplace safe -- the pointer returned by pg_join ceases to be valid after a pg_delete or pg_leave. This change is more likely to trigger bugs in your code than in mine. Say that you think you have run into an ISIS problem. What should you do? The first thing to appreciate is that most things reported to me turn out to be errors in the program that called ISIS, not in ISIS itself. Or, confusion about error messages (e.g. when dbx claims the stack is overwritten, which it does for every SUN LWP program - a completely incorrect and misleading diagnostic). Some people have even shown me programs of their own in infinite loops and claimed that ISIS is malfunctioning because their program wasn't replying to broadcasts! On the other hand, ISIS does have bugs; in V1.2 I know of 1 but have seen it too rarely to trace it (protos goes into an infinite loop and this causes the site in question to hang). How would one report this? Basically, we debug ISIS using a combination of: 1) Scenerios that can reproduce the problem with reasonable probability, say 1 run in 5 or 10. Lacking this, it is purely a matter of luck if we manage to figure out what causes a bug that has only been seen once. 2) isis logs. See the manual section on this. Basically, if you send signal USR2 to any isis process (protos or application process) it makes a nicely formatted dump file that I can look at to see if data structures are messed up, etc. Even when the process is in an infinite loop! If the problem involves protos, send logs for all the sites, not just the one that crashed (i.e. 3.log, 4.log, 11.log...) Try using "cmd snap" if most of the system is up but it just looks like things are hung. 3) core images. Usually, the stack trace from dbx or adb or cdb is all I need, although those programs all have terrible trouble dealing with lightweight tasks, so you may not be able to get a trace even if nothing serious was wrong in the trace. You can always force a program to dump core by sending it a signal like BPT. So, if you see a problem and can't see any simple explanation in your own code, drop a note to me or to isis-bugs@cs.cornell.edu, explaining how you produced the problem, showing me the relevant logs, and giving a stack trace for any core image you ended up with, assuming you can figure out what program generated it. Now, if all of this leaves you nervous, let me close on a positive note. ISIS is extremely solid, and it usually takes something really demanding to cause it to crash. For example, if you bring sites up and down fast enough (perhaps run isis with -f10 to speed it all up) while running a high rate of broadcasts (say, a few copies of grid) you can probably kill the system off within a few minutes, or at least provoke some sort of a crash at one of the sites that should have stayed up. This is the sort of thing we do here at Cornell, and we fix them as we find them. I guess this means that the SDI people shouldn't plan to build their system on top of the present version of ISIS. But, the rest of us wouldn't actually try to put ISIS in a situation exhibiting this sort of "violent" dynamic behavior (I hope)! If you use ISIS in less extreme ways, you probably won't find any problems at all -- this is what most people report, and even if not many of you post to comp.sys.isis, a good bet is that there are between 75 and 150 sites actively using the system right now. I know of some very serious users. A year or two from now ISIS will probably be solid even in the most extreme situations. We have correctness proofs for the protocols, and as for implementation bugs -- well, one can always track those down. Ken Birman PS: If your group would like to "purchase" a support contract, we have a company that will contract to identify and correct ISIS problems for a fee comparable to what software support contracts for other large systems (database systems and so forth) cost. Let me know if you need details on this.