[net.bugs.4bsd] Another bug in halt.c

bobvan (11/17/82)

Consider this scenario:  I run halt or shutdown -h from the terminal
at my desk.  Shutdown time arrives, the final shutdown messages go
around and my terminal stops echoing characters.  I head for the machine
room to work on the machine, finding the usual "use console to halt
system" message on the console.  I get back to my desk hours later and
find this message on my terminal screen:

	CAUTION: some process(es) wouldn't die

What has happened was that some process(es) were hung (usually in an I/O
driver) and halt couldn't get them to die in the 30 seconds allowed for
this.  The message was written to my terminal 30 seconds after I left
for the machine room.  At the very least, the message should have gone
to the console rather than my terminal.  Preferably, halt should have
forked a shell for the console so that I could do a ps axl and see what
it was that refused to die.

The preferred behavior I describe above is exactly what init does if it
runs into the same snag when trying to shut the system down.  Much of
the process killing code in halt and reboot has analogous code in
init.  I think it is wrong to duplicate this code (and associated bugs)
all over the system.  The halt and reboot commands should just signal
init and let it do all the work.  I don't see any good reason why it
wasn't done this way in the first place.

As a quick fix, I've routed the message to the console.  Diff -c is below.
This diff also show that fix to the first bug I found in halt.  Since
installing that fix, errors from fsck have dropped from an average of
2.4 errors per boot to about 0.07 errors per boot!  Have any other
Berkeley sites noticed a similar drop?

				Bob Van Valzah
				(...!decvax!ittvax!tpdcvax!bobvan)

*** halt.orig.c	Fri Oct 29 18:01:22 1982
--- halt.c	Wed Nov 17 09:08:24 1982
***************
*** 18,24
  	int howto;
  	char *ttyn = (char *)ttyname(2);
  	register i;
! 	register qflag;
  
  	howto = RB_HALT;
  	argc--, argv++;

--- 18,24 -----
  	int howto;
  	char *ttyn = (char *)ttyname(2);
  	register i;
! 	register qflag = 0;	/* RAV bug fix - have to init qflag to false */
  
  	howto = RB_HALT;
  	argc--, argv++;
***************
*** 57,63
  			exit(1);
  		}
  		if (i > 5) {
! 	fprintf(stderr, "CAUTION: some process(es) wouldn't die\n");
  			break;
  		}
  		setalarm(2 * i);

--- 57,68 -----
  			exit(1);
  		}
  		if (i > 5) {
! 			FILE *con; /* RAV important messages go to console */
! 
! 			if ((con=fopen("/dev/console", "w")) == NULL)
! 				con = stderr;
! 			fprintf(con,
! 				"CAUTION: some process(es) wouldn't die\r\n");
  			break;
  		}
  		setalarm(2 * i);

ecn-pa:bruner (11/18/82)

Our PDP-11 "init" was locally modified to perform shutdowns.  It
kills everyone off; first with a signal 15 and then (after 15
seconds) with a sequence of signal 9's.  It then prints a message
to the console, turns off accounting, does a sync() and exits.
[It has to turn off accounting to prevent the final accounting
record from being written after the last sync().]

A different signal causes "init" to inititate a reboot after
everything has been killed.  (The reboot is done by a user-mode
program via "phys"; it's dirty but it works very well.)

When any of our systems is rebooted, the "/etc/rc" file looks for
the file "/down".  If "/down" exists it is removed and the system
comes up right away.  If not, an automatic filesystem recovery
is performed.  The shutdown program (on the VAX) and "init"
(on our PDP-11's) create "/down" just before the final sync()
if and only if it could kill everything off (and hence the shutdown
was "clean").

This system has worked very well for us.

--John Bruner
  Purdue/EE