[comp.sys.isis] isis start up probs on SUN4

jmd@swbatl.UUCP (03) (02/27/90)

 I have been using isis for several months without problems
on a sun 3/260 platforms however as we near closer to implentation i have ported
to the sun 4/390 servers and am getting the following start up probs I can't 
seem to figure out. Could any of you take a look and point me in the 
right direction. Below is the log file. help!

****************LOG FILE **********************
Mon Feb 26 13:32:54 1990
ISIS release V1.2, June 1989
Site <calvin> is now coming up, site-id 13, isis_dir <./13.logdir>
Detect site-failure after: 60 secs
calvin (13/128): -- panic --
isis monitoring process at this site has crashed!

PROTOCOLS PROCESS 13/128 INTERNAL DUMP REQUESTED: aborting in panic...
Memory mgt: 143 allocs, 5 frees 7536 bytes in use
Message counts: 6 allocs 3 frees (3 in use, 1408 bytes on freelist)

tasks: scheduler 39730 ctp 39730
runqueue: 
Site view 0/0:
Scope <dorothy.SWBT.COM> = `00000002000000000000000000000000'
Scope <NMS> = `00000082000000000000000000000000'
Scope <wweast.SWBT.COM> = `00000004000000000000000000000000'
Scope <DOSINA> = `00003f7c000000000000000000000000'
Scope <wizard.SWBT.COM> = `00000008000000000000000000000000'
Scope <toto.SWBT.COM> = `00000010000000000000000000000000'
Scope <tinman.SWBT.COM> = `00000020000000000000000000000000'
Scope <clion.SWBT.COM> = `00000040000000000000000000000000'
Scope <glinda.SWBT.COM> = `00000080000000000000000000000000'
Scope <kansas.SWBT.COM> = `00000100000000000000000000000000'
Scope <emerald.SWBT.COM> = `00000200000000000000000000000000'
Scope <wwwest.SWBT.COM> = `00000400000000000000000000000000'
Scope <munchkin.SWBT.COM> = `00000800000000000000000000000000'
Scope <scarecrow.SWBT.COM> = `00001000000000000000000000000000'
Scope <calvin.SWBT.COM> = `00002000000000000000000000000000'

Process group views: root 5f290

Associative store: as_ndelete 0, as_nlocdelete 0

abq:
  max_priority = 80000000
cbcast data structures:
  pbufs:
  pb_itemlist:
  idlists:
  piggylists:
gbcast data structures:
  wait1:
  wait queues:
  glocks:

Failure detector: current view 0/0:
  slist:
  incarn: 127/0
  failed: `00000000000000000000000000000000'
  recovered: `00000000000000000000000000000000'
  not coord, no fork, no fail, no prop, no oprop, not sent_oprop
  Pending failures:
  Pending recoveries:
  Replies wanted:
  View r_locks: `00000000000000000000000000000000'
  View w_locks: `00000000000000000000000000000000'
  View want_w_locks: `00000000000000000000000000000000'

clients:
[05] <site-monitor>: idle

Intersite: 
  Message tank: 0 messages, 0 bytes

*********** END OF LOG FILE ****************
thanks for your help !!

-- 
James M Doherty 
Southwestern Bell Telephone Company
One Bell Center Suite 11-Y-03 St. Louis. MO. 63101.
UUCP: { pyramid, ihnp4, bellcore }...!swbatl!jmd
PHON: 314-235-0804 FAX: 314-235-0727
SACK :Serving A	Comming	KING !
-- 
James M Doherty  - SWBT - Advanced Technology Planning
One Bell Center Room 11-Y-03 St. Louis, Mo. 63101
UUCP: { pyramid, ihnp4, bellcore }...!swbatl!jmd
PHON: 314-235-0804 FAX: 314-235-0727

ken@gvax.cs.cornell.edu (Ken Birman) (02/28/90)

In article <1200@swbatl.UUCP> jmd@swbatl.UUCP (03) writes:
>
> I have been using isis for several months without problems
>on a sun 3/260 platforms however as we near closer to implentation i have ported
>to the sun 4/390 servers and am getting the following start up probs I can't 
>seem to figure out. Could any of you take a look and point me in the 
>right direction. Below is the log file. help!
>
>****************LOG FILE **********************
>Mon Feb 26 13:32:54 1990
>ISIS release V1.2, June 1989
>Site <calvin> is now coming up, site-id 13, isis_dir <./13.logdir>
>Detect site-failure after: 60 secs
>calvin (13/128): -- panic --
>isis monitoring process at this site has crashed!
>... etc (remainder is not relevant to problem)

The message "monitoring process at this site has crashed" means that
the process called "isis", namely the one that starts the system up,
either panicked or died with a core dump after starting protos (who
made this log file, which looks healthy) and before telling it if the
restart was partial or total.

I have seem something like this recently from someone else, but not
at Cornell.  He actually had a core image from bin/isis that showed
the system as having crashed in bcopy() called right after a gethostbyname
call.  The arguments to the bcopy were completely wrong.

My impression was that SUN might have changed the data structure returned
by gethostbyname, but that person didn't get back to me on the explanation
of the crash; perhaps he discovered an error in his /etc/hosts file that
explained the problem.  It should be easy to fix this, since the bug
can be localized to a single bcopy call (assuming your problem is the same
problem).  Odd that it doesn't happen at Cornell, though.

I would like to fix this, so if you can figure out what went wrong please
let me know.  Or, we can track it down offline...  Ken

tc@oxtrap.aa.ox.com (Tse Chih Chao) (03/02/90)

  In article <37894@cornell.UUCP> ken@gvax.cs.cornell.edu (Ken Birman) writes:

   In article <1200@swbatl.UUCP> jmd@swbatl.UUCP (03) writes:
   >
   > I have been using isis for several months without problems
   >on a sun 3/260 platforms however as we near closer to implentation i have ported
   >to the sun 4/390 servers and am getting the following start up probs I can't 
   >seem to figure out. Could any of you take a look and point me in the 
   >right direction. Below is the log file. help!
   >
   >****************LOG FILE **********************
   >Mon Feb 26 13:32:54 1990
   >ISIS release V1.2, June 1989
   >Site <calvin> is now coming up, site-id 13, isis_dir <./13.logdir>
   >Detect site-failure after: 60 secs

This one may not be vital.  It often  revives itself.

   >calvin (13/128): -- panic --

This is the critical one.  A timestamp from isis would be very usefull.
Another suggestion for the isis group is to add the timestamps for
the proto's core dumps in the log file.  Although the timestamp of the
file will help, but not always, if there are several dumps in the file.

   >isis monitoring process at this site has crashed!
   >... etc (remainder is not relevant to problem)

I have been working on crash problems for running isis on a DEC 3100
(Ultrix V2.2).  After some research/experiments, it now points
to Ultrix UDP related or name service problems (not an isis problem).
Isis used to crash on "not enough cores", "not getting enough bytes",
"panic".  I rarely got core files, although I did get it once or twice.
What happens was that I got stray messages and Ken told me how to
detect them.  I doubt that you have the same problem, becuase
isis runs fine on our sun 3/50's.  If you didn't have a core file, then
my suggestion is to invoke isis from the debugger and you can tell where
it crashed.  If not, the old fashioned printf's and breakpoints should
be able to help.

ken@gvax.cs.cornell.edu (Ken Birman) (03/02/90)

In article <TC.90Mar1134550@oxtrap.aa.ox.com>
  tc@oxtrap.aa.ox.com (Tse Chih Chao) writes:
> ... 
>A suggestion for the isis group is to add the timestamps for
>the proto's core dumps in the log file.  Although the timestamp of the
>file will help, but not always, if there are several dumps in the file.
>
>I have been working on crash problems for running isis on a DEC 3100
>(Ultrix V2.2).  After some research/experiments, it now points
>to Ultrix UDP related or name service problems (not an isis problem)....

I should probably amplify on Tse Chih's comments.  Her system has
the odd property that non-ISIS messages are sometimes delivered to
ISIS UDP sockets.  This happens regardless of the port numbers, and
seems to be due to a bug in the Ultrix system or, perhaps, its implementation
of something call the TCP domain service.

Tse Che discovered, somewhat painfully, that ISIS is not very immune
to receiving random garbage on its input channels.  With her help we
have found some work-arounds for ISIS V2.0, but as she points out this
one "feature" provoked quite a range of crashes -- sometimes ISIS couldn't
allocate enough memory for the incoming "message", sometimes it couldn't
reconstruct it, etc.  Usually the shutdown is fairly graceful and hence
there is no core image.

I have never seen this on a non-Ultrix system, or on Ultrix on anything
but the 3100 workstation.

Two minor details: the message "Detect failure after: 60 seconds" is
just telling you the setting of the "-f" argument to protos, or the
default value for this parameter.  And, protos logs do include timestamps.
If a message is logged after a delay of more than 1 minute since the
prior message, there will always be a line "... time is now xx:xx:xx".
We'll give some thought to improving our logging facility, although not
in time for the V2.0 beta release.

Ken

rich@sendai.sendai.ann-arbor.mi.us (K. Richard Magill) (03/02/90)

In article <TC.90Mar1134550@oxtrap.aa.ox.com> tc@oxtrap.aa.ox.com (Tse Chih Chao) writes:

   Isis used to crash on "not enough cores", "not getting enough
   bytes", "panic".  I rarely got core files, although I did get it
   once or twice.

to be fair, everything else crashed then too.  we were out of swap
space.

tc@oxtrap.aa.ox.com (Tse Chih Chao) (03/02/90)

In article <38015@cornell.UUCP> ken@gvax.cs.cornell.edu (Ken Birman) writes:


   I have never seen this on a non-Ultrix system, or on Ultrix on anything
   but the 3100 workstation.

To be more specific: This problem only occurs on Ultrix on a DECstation
3100 which runs the name server and news.  

    And, protos logs do include timestamps.
   If a message is logged after a delay of more than 1 minute since the
   prior message, there will always be a line "... time is now xx:xx:xx".
   We'll give some thought to improving our logging facility, although not
   in time for the V2.0 beta release.

My mistake.  Protos logs do have timestamps.  I was talking about the
monitor's logs.


Tse Chih

rich@sendai.sendai.ann-arbor.mi.us (K. Richard Magill) (03/03/90)

In article <38015@cornell.UUCP> ken@gvax.cs.cornell.edu (Ken Birman) writes:

   We'll give some thought to improving our logging facility, although
   not in time for the V2.0 beta release.

I've actually been toying with the idea of an isis based syslog
replacement.  ie, I'd like my applications to do something very much
like syslog logging but without resorting to tcp communications.  I'd
also like to be potentially able to log all error/status messages from
a distributed application in a common log file.

'spose this would be a useful approach?  Could it be used for isis
system logging?

ken@gvax.cs.cornell.edu (Ken Birman) (03/03/90)

In article <RICH.90Mar2133250@sendai.sendai.ann-arbor.mi.us> rich@sendai.ann-arbor.mi.us writes:
>I've actually been toying with the idea of an isis based syslog...
>'spose this would be a useful approach?  Could it be used for isis
>system logging?

Sounds like what I had in mind, and I guess this makes more sense
than sending messages to a system console (those of you who run ISIS
at boot time presumably know what is bothering me...) If you do this,
let us know.  Maybe we can make it standard.

Ken