[comp.sys.isis] startup failure on SGI

pop@linus.mitre.org (Paul Perry) (11/05/90)

I compiled ISIS with BYPASS on an SGI 4D/220VGX (IRIX System V Release
3.3.1 uci4).  It compiled well (quite fast!), but failed on the
startup sequence.  I think I have all the appropriate config files
right (sites, and /etc/services) but I don't know how to interpret the
log files.  Could someone take a look ?  Attached are the output of
the startup sequence, and the log file. Thanks, Paul.

---startup-sequence--------------------------------------------------------
uci4: isis &
[1] 3957
uci4: WARNING: /etc/hosts doesn't list full hostname.
Continuing using just the shortname "uci4" .
Warning: old format sites file (no scope info).  ISIS may run inefficiently
Site 14 (uci4.mitre.org): isis is restarting...
Is anyone there?
Ignoring site uci11.mitre.org
Ignoring site [more of the same deleted]
... found no operational sites, checking again just in case
Is anyone there?
Ignoring site uci11.mitre.org
Ignoring site [more of the same deleted]
site 14 (uci4.mitre.org) doing a total restart
../protos/protos <isis-protos> -d#.logdir
WARNING: /etc/hosts doesn't list full hostname.
Continuing using just the shortname "uci4" .
... still waiting for protos startup, please be patient
... still waiting for protos startup, please be patient
... still waiting for protos startup, please be patient
... still waiting for protos startup, please be patient
isis: unable to restart <protos> at this site (check 14.log or for core image)

[1]    Done                 isis
uci4:
Sun Nov  4 17:19:45 1990
ISIS release V2.1, Aug 1990
Site <uci4> is now coming up, site-id 14, isis_dir <14.logdir>
Detect site-failure after: 60 secs
uci4: uci4 (14/128): -- panic --
site-restart sequence failure

--14.log-file--------------------------------------------------------------

Sun Nov  4 17:19:45 1990
ISIS release V2.1, Aug 1990
Site <uci4> is now coming up, site-id 14, isis_dir <14.logdir>
Detect site-failure after: 60 secs
... Time is now Sun Nov  4 17:21:45 1990
uci4 (14/128): -- panic --
site-restart sequence failure

PROTOCOLS PROCESS 14/128 INTERNAL DUMP REQUESTED: aborting in panic...
Memory mgt: 32 allocs, 0 frees 4424 bytes in use
Message counts: 3 allocs 0 frees (3 in use)

tasks: scheduler 10035370 ctp 10035370
runqueue:
Site view 0/0:

Process group views: root 10039c94

Associative store: as_ndelete 0, as_nlocdelete 0

abq:
  max_priority = 80000000
cbcast data structures:
  pbufs:
  pb_itemlist:
  idlists:
  piggylists:
gbcast data structures:
  wait1:
  wait queues:
  glocks:

Failure detector: current view 0/0:
  slist:
  incarn: 127/0
  failed: `00000000000000000000000000000000'
  recovered: `00000000000000000000000000000000'
  not coord, no fork, no fail, no prop, no oprop, not sent_oprop
  Pending failures:
  Pending recoveries:
  Replies wanted:
  View r_locks: `00000000000000000000000000000000'
  View w_locks: `00000000000000000000000000000000'
  View want_w_locks: `00000000000000000000000000000000'

clients:
Active remote clients:

Intersite:
  Message tank: 0 messages, 0 bytes


uci4:
-----------------------------------------------------------------------
-- 
Paul O. Perry                                    MITRE Corporation
Phone: (617) 271-5230                            Burlington Road
ARPA: pop@mitre.org                              Bedford, MA  01730
UUCP:   ...{decvax,philabs,genrad}!linus!pop

ken@gvax.cs.cornell.edu (Ken Birman) (11/05/90)

In article <125374@linus.mitre.org> pop@linus.mitre.org (Paul Perry) writes:
>
>I compiled ISIS with BYPASS on an SGI 4D/220VGX (IRIX System V Release
>3.3.1 uci4).  It compiled well (quite fast!), but failed on the
>startup sequence.  I think I have all the appropriate config files
>right (sites, and /etc/services) but I don't know how to interpret the
>log files.  Could someone take a look ?  Attached are the output of
>the startup sequence, and the log file. Thanks, Paul.


FYI, we have a mailing address "isis-bugs@cs.cornell.edu" that works better
than comp.sys.isis for this sort of problem.  

There seem to be several problems here, the first of which being that
at Cornell we have not worked directly with ISIS on an SGI workstation.
Someone provided us with the port and said that it works, but the evidence
is that it has a problem, as discussed below.  The problem may be configuration
specific, and it doesn't relate to BYPASS/non-BYPASS (things didn't
get far enough for that to matter).

The problems I see, in order are:

1) Your isis "sites" file is formatted incorrectly (it lacks the final "scope"
information field).  Apparently, you used lines like
    14: 1234,1235,1236 uci4.mitre.org
rather than
    14: 1234,1235,1236 uci4.mitre.org mitre
or even
    14: 1234,1235,1236 uci4.mitre.org mitre,sgi
(look this up if you are unclear on what I am talking about)

2) Your system naming service is not able to resolve full names in 
any case.  The lines reading "Ignoring site uci11.mitre.org" are 
because "gethostbyname("uci11.mitre.org") is failing on your machine.
Ask an administrator... probably something wrong with /etc/hosts.

3) The real problem, the one that caused the crash, is that "bin/isis"
is unable to exchange messages with "bin/protos".  In principle, this is
done in the following steps:
  3-a) because UNIX_DOMAIN is enabled in pr.h, isis.h, when isis first
       runs protos, protos creates a unix-domain socket named "/tmp/Is1235"
       using the second ("tcp") entry from the sites file you made.

       Various issues involving permission to create this (or a badly chosen
       umask that disables write permission, for example) could make this
       socket inaccessible to other programs... something of this sort is
       probably responsible.

  3-b) bin/isis tries to "connect" (see connect(2)) to this socket.  The
       sequence is that it creates a socket of its own, gives it a name
       based on its own process-id (i.e. /tmp/Cl5432), and issues a connect
       system call.  This fails, which is not normal, and so you get a message
       about an unexpectedly slow protos startup.  After a few retries, 
       bin/isis gives up.

  3-c) Meanwhile, bin/protos is getting antsy -- why hasn't bin/isis connected
       in yet?  After a while it times out and does a panic-exit.  This always
       prints the sort of dump you saw.  This time, the dump is quite boring.

What to do about this?

1) Fix the sites and /etc/hosts file; who knows, perhaps this is related.
2) ls -l /tmp to see what the story is during the first 15-20 seconds after
   startup.
3) If all else fails, consider changing isis.h and pr.h to NOT set UNIX_DOM
(i.e. just leave the #define UNIX_DOM 1 out) and recompile all the system
binaries.  This has a good chance of working.

Since we are into a posting loop, I guess it might not hurt to post
the explanation once the thing is working...

Ken Birman