pop@linus.mitre.org (Paul Perry) (11/05/90)
I compiled ISIS with BYPASS on an SGI 4D/220VGX (IRIX System V Release
3.3.1 uci4). It compiled well (quite fast!), but failed on the
startup sequence. I think I have all the appropriate config files
right (sites, and /etc/services) but I don't know how to interpret the
log files. Could someone take a look ? Attached are the output of
the startup sequence, and the log file. Thanks, Paul.
---startup-sequence--------------------------------------------------------
uci4: isis &
[1] 3957
uci4: WARNING: /etc/hosts doesn't list full hostname.
Continuing using just the shortname "uci4" .
Warning: old format sites file (no scope info). ISIS may run inefficiently
Site 14 (uci4.mitre.org): isis is restarting...
Is anyone there?
Ignoring site uci11.mitre.org
Ignoring site [more of the same deleted]
... found no operational sites, checking again just in case
Is anyone there?
Ignoring site uci11.mitre.org
Ignoring site [more of the same deleted]
site 14 (uci4.mitre.org) doing a total restart
../protos/protos <isis-protos> -d#.logdir
WARNING: /etc/hosts doesn't list full hostname.
Continuing using just the shortname "uci4" .
... still waiting for protos startup, please be patient
... still waiting for protos startup, please be patient
... still waiting for protos startup, please be patient
... still waiting for protos startup, please be patient
isis: unable to restart <protos> at this site (check 14.log or for core image)
[1] Done isis
uci4:
Sun Nov 4 17:19:45 1990
ISIS release V2.1, Aug 1990
Site <uci4> is now coming up, site-id 14, isis_dir <14.logdir>
Detect site-failure after: 60 secs
uci4: uci4 (14/128): -- panic --
site-restart sequence failure
--14.log-file--------------------------------------------------------------
Sun Nov 4 17:19:45 1990
ISIS release V2.1, Aug 1990
Site <uci4> is now coming up, site-id 14, isis_dir <14.logdir>
Detect site-failure after: 60 secs
... Time is now Sun Nov 4 17:21:45 1990
uci4 (14/128): -- panic --
site-restart sequence failure
PROTOCOLS PROCESS 14/128 INTERNAL DUMP REQUESTED: aborting in panic...
Memory mgt: 32 allocs, 0 frees 4424 bytes in use
Message counts: 3 allocs 0 frees (3 in use)
tasks: scheduler 10035370 ctp 10035370
runqueue:
Site view 0/0:
Process group views: root 10039c94
Associative store: as_ndelete 0, as_nlocdelete 0
abq:
max_priority = 80000000
cbcast data structures:
pbufs:
pb_itemlist:
idlists:
piggylists:
gbcast data structures:
wait1:
wait queues:
glocks:
Failure detector: current view 0/0:
slist:
incarn: 127/0
failed: `00000000000000000000000000000000'
recovered: `00000000000000000000000000000000'
not coord, no fork, no fail, no prop, no oprop, not sent_oprop
Pending failures:
Pending recoveries:
Replies wanted:
View r_locks: `00000000000000000000000000000000'
View w_locks: `00000000000000000000000000000000'
View want_w_locks: `00000000000000000000000000000000'
clients:
Active remote clients:
Intersite:
Message tank: 0 messages, 0 bytes
uci4:
-----------------------------------------------------------------------
--
Paul O. Perry MITRE Corporation
Phone: (617) 271-5230 Burlington Road
ARPA: pop@mitre.org Bedford, MA 01730
UUCP: ...{decvax,philabs,genrad}!linus!popken@gvax.cs.cornell.edu (Ken Birman) (11/05/90)
In article <125374@linus.mitre.org> pop@linus.mitre.org (Paul Perry) writes: > >I compiled ISIS with BYPASS on an SGI 4D/220VGX (IRIX System V Release >3.3.1 uci4). It compiled well (quite fast!), but failed on the >startup sequence. I think I have all the appropriate config files >right (sites, and /etc/services) but I don't know how to interpret the >log files. Could someone take a look ? Attached are the output of >the startup sequence, and the log file. Thanks, Paul. FYI, we have a mailing address "isis-bugs@cs.cornell.edu" that works better than comp.sys.isis for this sort of problem. There seem to be several problems here, the first of which being that at Cornell we have not worked directly with ISIS on an SGI workstation. Someone provided us with the port and said that it works, but the evidence is that it has a problem, as discussed below. The problem may be configuration specific, and it doesn't relate to BYPASS/non-BYPASS (things didn't get far enough for that to matter). The problems I see, in order are: 1) Your isis "sites" file is formatted incorrectly (it lacks the final "scope" information field). Apparently, you used lines like 14: 1234,1235,1236 uci4.mitre.org rather than 14: 1234,1235,1236 uci4.mitre.org mitre or even 14: 1234,1235,1236 uci4.mitre.org mitre,sgi (look this up if you are unclear on what I am talking about) 2) Your system naming service is not able to resolve full names in any case. The lines reading "Ignoring site uci11.mitre.org" are because "gethostbyname("uci11.mitre.org") is failing on your machine. Ask an administrator... probably something wrong with /etc/hosts. 3) The real problem, the one that caused the crash, is that "bin/isis" is unable to exchange messages with "bin/protos". In principle, this is done in the following steps: 3-a) because UNIX_DOMAIN is enabled in pr.h, isis.h, when isis first runs protos, protos creates a unix-domain socket named "/tmp/Is1235" using the second ("tcp") entry from the sites file you made. Various issues involving permission to create this (or a badly chosen umask that disables write permission, for example) could make this socket inaccessible to other programs... something of this sort is probably responsible. 3-b) bin/isis tries to "connect" (see connect(2)) to this socket. The sequence is that it creates a socket of its own, gives it a name based on its own process-id (i.e. /tmp/Cl5432), and issues a connect system call. This fails, which is not normal, and so you get a message about an unexpectedly slow protos startup. After a few retries, bin/isis gives up. 3-c) Meanwhile, bin/protos is getting antsy -- why hasn't bin/isis connected in yet? After a while it times out and does a panic-exit. This always prints the sort of dump you saw. This time, the dump is quite boring. What to do about this? 1) Fix the sites and /etc/hosts file; who knows, perhaps this is related. 2) ls -l /tmp to see what the story is during the first 15-20 seconds after startup. 3) If all else fails, consider changing isis.h and pr.h to NOT set UNIX_DOM (i.e. just leave the #define UNIX_DOM 1 out) and recompile all the system binaries. This has a good chance of working. Since we are into a posting loop, I guess it might not hurt to post the explanation once the thing is working... Ken Birman