jmd@swbatl.UUCP (03) (02/27/90)
I have been using isis for several months without problems on a sun 3/260 platforms however as we near closer to implentation i have ported to the sun 4/390 servers and am getting the following start up probs I can't seem to figure out. Could any of you take a look and point me in the right direction. Below is the log file. help! ****************LOG FILE ********************** Mon Feb 26 13:32:54 1990 ISIS release V1.2, June 1989 Site <calvin> is now coming up, site-id 13, isis_dir <./13.logdir> Detect site-failure after: 60 secs calvin (13/128): -- panic -- isis monitoring process at this site has crashed! PROTOCOLS PROCESS 13/128 INTERNAL DUMP REQUESTED: aborting in panic... Memory mgt: 143 allocs, 5 frees 7536 bytes in use Message counts: 6 allocs 3 frees (3 in use, 1408 bytes on freelist) tasks: scheduler 39730 ctp 39730 runqueue: Site view 0/0: Scope <dorothy.SWBT.COM> = `00000002000000000000000000000000' Scope <NMS> = `00000082000000000000000000000000' Scope <wweast.SWBT.COM> = `00000004000000000000000000000000' Scope <DOSINA> = `00003f7c000000000000000000000000' Scope <wizard.SWBT.COM> = `00000008000000000000000000000000' Scope <toto.SWBT.COM> = `00000010000000000000000000000000' Scope <tinman.SWBT.COM> = `00000020000000000000000000000000' Scope <clion.SWBT.COM> = `00000040000000000000000000000000' Scope <glinda.SWBT.COM> = `00000080000000000000000000000000' Scope <kansas.SWBT.COM> = `00000100000000000000000000000000' Scope <emerald.SWBT.COM> = `00000200000000000000000000000000' Scope <wwwest.SWBT.COM> = `00000400000000000000000000000000' Scope <munchkin.SWBT.COM> = `00000800000000000000000000000000' Scope <scarecrow.SWBT.COM> = `00001000000000000000000000000000' Scope <calvin.SWBT.COM> = `00002000000000000000000000000000' Process group views: root 5f290 Associative store: as_ndelete 0, as_nlocdelete 0 abq: max_priority = 80000000 cbcast data structures: pbufs: pb_itemlist: idlists: piggylists: gbcast data structures: wait1: wait queues: glocks: Failure detector: current view 0/0: slist: incarn: 127/0 failed: `00000000000000000000000000000000' recovered: `00000000000000000000000000000000' not coord, no fork, no fail, no prop, no oprop, not sent_oprop Pending failures: Pending recoveries: Replies wanted: View r_locks: `00000000000000000000000000000000' View w_locks: `00000000000000000000000000000000' View want_w_locks: `00000000000000000000000000000000' clients: [05] <site-monitor>: idle Intersite: Message tank: 0 messages, 0 bytes *********** END OF LOG FILE **************** thanks for your help !! -- James M Doherty Southwestern Bell Telephone Company One Bell Center Suite 11-Y-03 St. Louis. MO. 63101. UUCP: { pyramid, ihnp4, bellcore }...!swbatl!jmd PHON: 314-235-0804 FAX: 314-235-0727 SACK :Serving A Comming KING ! -- James M Doherty - SWBT - Advanced Technology Planning One Bell Center Room 11-Y-03 St. Louis, Mo. 63101 UUCP: { pyramid, ihnp4, bellcore }...!swbatl!jmd PHON: 314-235-0804 FAX: 314-235-0727
ken@gvax.cs.cornell.edu (Ken Birman) (02/28/90)
In article <1200@swbatl.UUCP> jmd@swbatl.UUCP (03) writes: > > I have been using isis for several months without problems >on a sun 3/260 platforms however as we near closer to implentation i have ported >to the sun 4/390 servers and am getting the following start up probs I can't >seem to figure out. Could any of you take a look and point me in the >right direction. Below is the log file. help! > >****************LOG FILE ********************** >Mon Feb 26 13:32:54 1990 >ISIS release V1.2, June 1989 >Site <calvin> is now coming up, site-id 13, isis_dir <./13.logdir> >Detect site-failure after: 60 secs >calvin (13/128): -- panic -- >isis monitoring process at this site has crashed! >... etc (remainder is not relevant to problem) The message "monitoring process at this site has crashed" means that the process called "isis", namely the one that starts the system up, either panicked or died with a core dump after starting protos (who made this log file, which looks healthy) and before telling it if the restart was partial or total. I have seem something like this recently from someone else, but not at Cornell. He actually had a core image from bin/isis that showed the system as having crashed in bcopy() called right after a gethostbyname call. The arguments to the bcopy were completely wrong. My impression was that SUN might have changed the data structure returned by gethostbyname, but that person didn't get back to me on the explanation of the crash; perhaps he discovered an error in his /etc/hosts file that explained the problem. It should be easy to fix this, since the bug can be localized to a single bcopy call (assuming your problem is the same problem). Odd that it doesn't happen at Cornell, though. I would like to fix this, so if you can figure out what went wrong please let me know. Or, we can track it down offline... Ken
tc@oxtrap.aa.ox.com (Tse Chih Chao) (03/02/90)
In article <37894@cornell.UUCP> ken@gvax.cs.cornell.edu (Ken Birman) writes: In article <1200@swbatl.UUCP> jmd@swbatl.UUCP (03) writes: > > I have been using isis for several months without problems >on a sun 3/260 platforms however as we near closer to implentation i have ported >to the sun 4/390 servers and am getting the following start up probs I can't >seem to figure out. Could any of you take a look and point me in the >right direction. Below is the log file. help! > >****************LOG FILE ********************** >Mon Feb 26 13:32:54 1990 >ISIS release V1.2, June 1989 >Site <calvin> is now coming up, site-id 13, isis_dir <./13.logdir> >Detect site-failure after: 60 secs This one may not be vital. It often revives itself. >calvin (13/128): -- panic -- This is the critical one. A timestamp from isis would be very usefull. Another suggestion for the isis group is to add the timestamps for the proto's core dumps in the log file. Although the timestamp of the file will help, but not always, if there are several dumps in the file. >isis monitoring process at this site has crashed! >... etc (remainder is not relevant to problem) I have been working on crash problems for running isis on a DEC 3100 (Ultrix V2.2). After some research/experiments, it now points to Ultrix UDP related or name service problems (not an isis problem). Isis used to crash on "not enough cores", "not getting enough bytes", "panic". I rarely got core files, although I did get it once or twice. What happens was that I got stray messages and Ken told me how to detect them. I doubt that you have the same problem, becuase isis runs fine on our sun 3/50's. If you didn't have a core file, then my suggestion is to invoke isis from the debugger and you can tell where it crashed. If not, the old fashioned printf's and breakpoints should be able to help.
ken@gvax.cs.cornell.edu (Ken Birman) (03/02/90)
In article <TC.90Mar1134550@oxtrap.aa.ox.com> tc@oxtrap.aa.ox.com (Tse Chih Chao) writes: > ... >A suggestion for the isis group is to add the timestamps for >the proto's core dumps in the log file. Although the timestamp of the >file will help, but not always, if there are several dumps in the file. > >I have been working on crash problems for running isis on a DEC 3100 >(Ultrix V2.2). After some research/experiments, it now points >to Ultrix UDP related or name service problems (not an isis problem).... I should probably amplify on Tse Chih's comments. Her system has the odd property that non-ISIS messages are sometimes delivered to ISIS UDP sockets. This happens regardless of the port numbers, and seems to be due to a bug in the Ultrix system or, perhaps, its implementation of something call the TCP domain service. Tse Che discovered, somewhat painfully, that ISIS is not very immune to receiving random garbage on its input channels. With her help we have found some work-arounds for ISIS V2.0, but as she points out this one "feature" provoked quite a range of crashes -- sometimes ISIS couldn't allocate enough memory for the incoming "message", sometimes it couldn't reconstruct it, etc. Usually the shutdown is fairly graceful and hence there is no core image. I have never seen this on a non-Ultrix system, or on Ultrix on anything but the 3100 workstation. Two minor details: the message "Detect failure after: 60 seconds" is just telling you the setting of the "-f" argument to protos, or the default value for this parameter. And, protos logs do include timestamps. If a message is logged after a delay of more than 1 minute since the prior message, there will always be a line "... time is now xx:xx:xx". We'll give some thought to improving our logging facility, although not in time for the V2.0 beta release. Ken
rich@sendai.sendai.ann-arbor.mi.us (K. Richard Magill) (03/02/90)
In article <TC.90Mar1134550@oxtrap.aa.ox.com> tc@oxtrap.aa.ox.com (Tse Chih Chao) writes:
Isis used to crash on "not enough cores", "not getting enough
bytes", "panic". I rarely got core files, although I did get it
once or twice.
to be fair, everything else crashed then too. we were out of swap
space.
tc@oxtrap.aa.ox.com (Tse Chih Chao) (03/02/90)
In article <38015@cornell.UUCP> ken@gvax.cs.cornell.edu (Ken Birman) writes:
I have never seen this on a non-Ultrix system, or on Ultrix on anything
but the 3100 workstation.
To be more specific: This problem only occurs on Ultrix on a DECstation
3100 which runs the name server and news.
And, protos logs do include timestamps.
If a message is logged after a delay of more than 1 minute since the
prior message, there will always be a line "... time is now xx:xx:xx".
We'll give some thought to improving our logging facility, although not
in time for the V2.0 beta release.
My mistake. Protos logs do have timestamps. I was talking about the
monitor's logs.
Tse Chih
rich@sendai.sendai.ann-arbor.mi.us (K. Richard Magill) (03/03/90)
In article <38015@cornell.UUCP> ken@gvax.cs.cornell.edu (Ken Birman) writes:
We'll give some thought to improving our logging facility, although
not in time for the V2.0 beta release.
I've actually been toying with the idea of an isis based syslog
replacement. ie, I'd like my applications to do something very much
like syslog logging but without resorting to tcp communications. I'd
also like to be potentially able to log all error/status messages from
a distributed application in a common log file.
'spose this would be a useful approach? Could it be used for isis
system logging?
ken@gvax.cs.cornell.edu (Ken Birman) (03/03/90)
In article <RICH.90Mar2133250@sendai.sendai.ann-arbor.mi.us> rich@sendai.ann-arbor.mi.us writes: >I've actually been toying with the idea of an isis based syslog... >'spose this would be a useful approach? Could it be used for isis >system logging? Sounds like what I had in mind, and I guess this makes more sense than sending messages to a system console (those of you who run ISIS at boot time presumably know what is bothering me...) If you do this, let us know. Maybe we can make it standard. Ken