[comp.sys.isis] Bug in protos causes infinite loop

ken@gvax.cs.cornell.edu (Ken Birman) (02/08/91)

A few people have noticed that the ISIS "protos" program sometimes
goes into an infinite loop on a machine where it has been running for
a few days -- perhaps, only if it has been running for many days.
The program consumes a lot of CPU time and other ISIS nodes consider
it to be down.  The node needs to be shut down with isis -Z or kill
and restarted, and in some cases, you can't restart isis at other
sites until this has been done (they complain about how the "coordinator
for the restart is not responding).

I am unable to reproduce this here at Cornell -- or at least, I don't
seem to have things set up right to trigger the problem.  If you have
been seening this, here's how you can help us track it down and fix it:

First, can you "trigger" this in any way?  If so, I can fix it in half an hour.

Assuming not, rebuild protos with the dbx flag (cc -g on the various
protos and mlib files and when you link it, i.e. OPTIM=-g in system
makefile).  Next time this version of protos goes loopy on you, do
a kill -USR2 to get a protos logfile output, and then use dbx to
attach to the looping image, i.e.
        dbx bin/protos 17765  (binary name, pid)
        (dbx) next            (always need to do this after attaching)
        (dbx) where
        (dbx) cont
        ^C                    (after a few seconds)
        (dbx) where
          (repeat a few times)
If you notice a pattern, use "list" to see what the source looks
like in the region of the loop and, if some variable is obviously
at fault, maybe print the value using "print".
        (dbx) list
        (dbx) print xyzzy

Then mail me the output (xxx.logdir/xxx.log and the trace output from
dbx -- just select it from your X-window and stuff it into the email).
This should be enough to help me fix the problem.  The more data
you send the better -- log files from other machines might be useful too.

Ken