ken@gvax.cs.cornell.edu (Ken Birman) (02/08/91)
A few people have noticed that the ISIS "protos" program sometimes goes into an infinite loop on a machine where it has been running for a few days -- perhaps, only if it has been running for many days. The program consumes a lot of CPU time and other ISIS nodes consider it to be down. The node needs to be shut down with isis -Z or kill and restarted, and in some cases, you can't restart isis at other sites until this has been done (they complain about how the "coordinator for the restart is not responding). I am unable to reproduce this here at Cornell -- or at least, I don't seem to have things set up right to trigger the problem. If you have been seening this, here's how you can help us track it down and fix it: First, can you "trigger" this in any way? If so, I can fix it in half an hour. Assuming not, rebuild protos with the dbx flag (cc -g on the various protos and mlib files and when you link it, i.e. OPTIM=-g in system makefile). Next time this version of protos goes loopy on you, do a kill -USR2 to get a protos logfile output, and then use dbx to attach to the looping image, i.e. dbx bin/protos 17765 (binary name, pid) (dbx) next (always need to do this after attaching) (dbx) where (dbx) cont ^C (after a few seconds) (dbx) where (repeat a few times) If you notice a pattern, use "list" to see what the source looks like in the region of the loop and, if some variable is obviously at fault, maybe print the value using "print". (dbx) list (dbx) print xyzzy Then mail me the output (xxx.logdir/xxx.log and the trace output from dbx -- just select it from your X-window and stuff it into the email). This should be enough to help me fix the problem. The more data you send the better -- log files from other machines might be useful too. Ken