pop@linus.mitre.org (Paul Perry) (11/05/90)
I compiled ISIS with BYPASS on an SGI 4D/220VGX (IRIX System V Release 3.3.1 uci4). It compiled well (quite fast!), but failed on the startup sequence. I think I have all the appropriate config files right (sites, and /etc/services) but I don't know how to interpret the log files. Could someone take a look ? Attached are the output of the startup sequence, and the log file. Thanks, Paul. ---startup-sequence-------------------------------------------------------- uci4: isis & [1] 3957 uci4: WARNING: /etc/hosts doesn't list full hostname. Continuing using just the shortname "uci4" . Warning: old format sites file (no scope info). ISIS may run inefficiently Site 14 (uci4.mitre.org): isis is restarting... Is anyone there? Ignoring site uci11.mitre.org Ignoring site [more of the same deleted] ... found no operational sites, checking again just in case Is anyone there? Ignoring site uci11.mitre.org Ignoring site [more of the same deleted] site 14 (uci4.mitre.org) doing a total restart ../protos/protos <isis-protos> -d#.logdir WARNING: /etc/hosts doesn't list full hostname. Continuing using just the shortname "uci4" . ... still waiting for protos startup, please be patient ... still waiting for protos startup, please be patient ... still waiting for protos startup, please be patient ... still waiting for protos startup, please be patient isis: unable to restart <protos> at this site (check 14.log or for core image) [1] Done isis uci4: Sun Nov 4 17:19:45 1990 ISIS release V2.1, Aug 1990 Site <uci4> is now coming up, site-id 14, isis_dir <14.logdir> Detect site-failure after: 60 secs uci4: uci4 (14/128): -- panic -- site-restart sequence failure --14.log-file-------------------------------------------------------------- Sun Nov 4 17:19:45 1990 ISIS release V2.1, Aug 1990 Site <uci4> is now coming up, site-id 14, isis_dir <14.logdir> Detect site-failure after: 60 secs ... Time is now Sun Nov 4 17:21:45 1990 uci4 (14/128): -- panic -- site-restart sequence failure PROTOCOLS PROCESS 14/128 INTERNAL DUMP REQUESTED: aborting in panic... Memory mgt: 32 allocs, 0 frees 4424 bytes in use Message counts: 3 allocs 0 frees (3 in use) tasks: scheduler 10035370 ctp 10035370 runqueue: Site view 0/0: Process group views: root 10039c94 Associative store: as_ndelete 0, as_nlocdelete 0 abq: max_priority = 80000000 cbcast data structures: pbufs: pb_itemlist: idlists: piggylists: gbcast data structures: wait1: wait queues: glocks: Failure detector: current view 0/0: slist: incarn: 127/0 failed: `00000000000000000000000000000000' recovered: `00000000000000000000000000000000' not coord, no fork, no fail, no prop, no oprop, not sent_oprop Pending failures: Pending recoveries: Replies wanted: View r_locks: `00000000000000000000000000000000' View w_locks: `00000000000000000000000000000000' View want_w_locks: `00000000000000000000000000000000' clients: Active remote clients: Intersite: Message tank: 0 messages, 0 bytes uci4: ----------------------------------------------------------------------- -- Paul O. Perry MITRE Corporation Phone: (617) 271-5230 Burlington Road ARPA: pop@mitre.org Bedford, MA 01730 UUCP: ...{decvax,philabs,genrad}!linus!pop
ken@gvax.cs.cornell.edu (Ken Birman) (11/05/90)
In article <125374@linus.mitre.org> pop@linus.mitre.org (Paul Perry) writes: > >I compiled ISIS with BYPASS on an SGI 4D/220VGX (IRIX System V Release >3.3.1 uci4). It compiled well (quite fast!), but failed on the >startup sequence. I think I have all the appropriate config files >right (sites, and /etc/services) but I don't know how to interpret the >log files. Could someone take a look ? Attached are the output of >the startup sequence, and the log file. Thanks, Paul. FYI, we have a mailing address "isis-bugs@cs.cornell.edu" that works better than comp.sys.isis for this sort of problem. There seem to be several problems here, the first of which being that at Cornell we have not worked directly with ISIS on an SGI workstation. Someone provided us with the port and said that it works, but the evidence is that it has a problem, as discussed below. The problem may be configuration specific, and it doesn't relate to BYPASS/non-BYPASS (things didn't get far enough for that to matter). The problems I see, in order are: 1) Your isis "sites" file is formatted incorrectly (it lacks the final "scope" information field). Apparently, you used lines like 14: 1234,1235,1236 uci4.mitre.org rather than 14: 1234,1235,1236 uci4.mitre.org mitre or even 14: 1234,1235,1236 uci4.mitre.org mitre,sgi (look this up if you are unclear on what I am talking about) 2) Your system naming service is not able to resolve full names in any case. The lines reading "Ignoring site uci11.mitre.org" are because "gethostbyname("uci11.mitre.org") is failing on your machine. Ask an administrator... probably something wrong with /etc/hosts. 3) The real problem, the one that caused the crash, is that "bin/isis" is unable to exchange messages with "bin/protos". In principle, this is done in the following steps: 3-a) because UNIX_DOMAIN is enabled in pr.h, isis.h, when isis first runs protos, protos creates a unix-domain socket named "/tmp/Is1235" using the second ("tcp") entry from the sites file you made. Various issues involving permission to create this (or a badly chosen umask that disables write permission, for example) could make this socket inaccessible to other programs... something of this sort is probably responsible. 3-b) bin/isis tries to "connect" (see connect(2)) to this socket. The sequence is that it creates a socket of its own, gives it a name based on its own process-id (i.e. /tmp/Cl5432), and issues a connect system call. This fails, which is not normal, and so you get a message about an unexpectedly slow protos startup. After a few retries, bin/isis gives up. 3-c) Meanwhile, bin/protos is getting antsy -- why hasn't bin/isis connected in yet? After a while it times out and does a panic-exit. This always prints the sort of dump you saw. This time, the dump is quite boring. What to do about this? 1) Fix the sites and /etc/hosts file; who knows, perhaps this is related. 2) ls -l /tmp to see what the story is during the first 15-20 seconds after startup. 3) If all else fails, consider changing isis.h and pr.h to NOT set UNIX_DOM (i.e. just leave the #define UNIX_DOM 1 out) and recompile all the system binaries. This has a good chance of working. Since we are into a posting loop, I guess it might not hurt to post the explanation once the thing is working... Ken Birman