[comp.sys.sgi] various boot-related questions.

lamy@ai.utoronto.ca (Jean-Francois Lamy) (09/13/89)

People on the phone on the hotline are giving me blank stares :-) when
asked the following:

a) on an SGI 4D/240 using a tty as a console, how does one forcibly enter
   the PROM monitor? (this is often accomplished by BREAK on ttys).  While we
   understand this may not be desirable by default, there oughta be a way
   to make the machine pay attention without having to depress its belly
   button...

b) How does one do the equivalent of a "savecore" upon reboot (the goal is to
   be able to save memory state and peek around to see where things hang).

   Since someone will ask why on earth we'd want to do that - we
   have a machine that runs for several days and all of the sudden refuses
   to fork off new commands (i.e. after typing a command to the shell, all
   you can do is interrupt it; you can connect to the telnet daemon but
   won't get a shell; ping replies do get answered, etc.).  There is plenty
   of memory and plenty of swap space and plenty of process slots.

And while I'm at it, the whole fsck picture makes me and a bunch more people
shudder.

c) What exactly are the "minor repairs" where fsck -b will seek to reboot?
d) What exactly is checked by fsstat? Is the dirty bit just that, a bit, that
   could, say happen to be in a bad block and be wrong?  Is fsck under efs
   guaranteed to clean the filesystem in one pass, or does it suffer from
   the berkeley heritage and therefore sometimes two or three fscks of the
   same partition are required before fsck succeeds twice in a row?
   Why did a message arrive in my mailbox this very second suggesting we
   comment out all the calls to fsstat in the boot sequence and get rid of
   the -c flags in mountall?
e) Can we get a -p flag on fsck (i.e. repair minor damage, report error
   on major damage, causing machine to stay in single user mode).  Using
   -y in the boot sequence sounds like a bit trusting.

Brrrrrr.

Jean-Francois Lamy               lamy@ai.utoronto.ca, uunet!ai.utoronto.ca!lamy
AI Group, Department of Computer Science, University of Toronto, Canada M5S 1A4

jmb@patton.sgi.com (Jim Barton) (09/13/89)

In article <89Sep12.214425edt.2245@neat.cs.toronto.edu>, lamy@ai.utoronto.ca (Jean-Francois Lamy) writes:
> People on the phone on the hotline are giving me blank stares :-) when
> asked the following:
> 
> a) on an SGI 4D/240 using a tty as a console, how does one forcibly enter
>    the PROM monitor? (this is often accomplished by BREAK on ttys).  While we
>    understand this may not be desirable by default, there oughta be a way
>    to make the machine pay attention without having to depress its belly
>    button...

I'm not surprised you get a (over-the-phone) blank stare on this one.  The
PROM monitor is only active when the OS isn't.  There isn't any magic key
to press to get into it, although the magic key sequence "init 0" will 
get you there - without UNIX :-).

To get more complex, you can get into the built in kernel debugger by
lboot'ing a system with 'idbg' INCLUDE'd.  See /usr/sysgen, and edit system
to INCLUDE idbg.  Then execute /etc/init.d/autoconfig and reboot.  This will
also allow you to use the IRIX command 'idbg' to poke around kernel
data structures.  As a warning, though, don't call the Hotline if you have
problems with playing around with this - it's not in the normal operating
mode of most places.

> 
> b) How does one do the equivalent of a "savecore" upon reboot (the goal is to
>    be able to save memory state and peek around to see where things hang).
> 
>    Since someone will ask why on earth we'd want to do that - we
>    have a machine that runs for several days and all of the sudden refuses
>    to fork off new commands (i.e. after typing a command to the shell, all
>    you can do is interrupt it; you can connect to the telnet daemon but
>    won't get a shell; ping replies do get answered, etc.).  There is plenty
>    of memory and plenty of swap space and plenty of process slots.

I haven't heard of this problem running released software for the 240.  Be
sure you are running 3.1F, or preferably 3.1G.  The 3.1D release will run
a 240, but only poorly, and it has the bug you mention.  Even better, bug
your SE to get 3.2 (see below).

The system ALREADY does a 'savecore' on reboot if it crashed before.  It
doesn't work on a hang, since reset has to be pushed to get back.  Installing
the debugger as above, and then using the '^A' key on the console will drop
you into the debugger, from which you can get a stack trace and poke around
the kernel data structures.  Of course, you will recognize little, since
IRIX is a V.3 kernel with many 4.3 extensions re-written for a
multiprocessor ...

> And while I'm at it, the whole fsck picture makes me and a bunch more people
> shudder.
> 
> c) What exactly are the "minor repairs" where fsck -b will seek to reboot?
> d) What exactly is checked by fsstat? Is the dirty bit just that, a bit, that
>    could, say happen to be in a bad block and be wrong?  Is fsck under efs
>    guaranteed to clean the filesystem in one pass, or does it suffer from
>    the berkeley heritage and therefore sometimes two or three fscks of the
>    same partition are required before fsck succeeds twice in a row?
>    Why did a message arrive in my mailbox this very second suggesting we
>    comment out all the calls to fsstat in the boot sequence and get rid of
>    the -c flags in mountall?
> e) Can we get a -p flag on fsck (i.e. repair minor damage, report error
>    on major damage, causing machine to stay in single user mode).  Using
>    -y in the boot sequence sounds like a bit trusting.

The 3.2 release re-mounts the root filesystem instead of rebooting.  The
dirty bit is kept in the superblock, and is set dirty unless the filesystem
is unmounted, in which case it is set clean.  Fsck WILL clean the filesystem
in one pass.  And I'd ignore the mail message, unless you'd like even more
pain and suffering in a few days.  Finally, you are certainly welcome to
remove the '-y' option, but remember that all us UNIX guru's out here rely
on it.  After all, how many people have the sophistication to understand
when to say 'no' to fsck?  (There are maybe two or three people out here).
EFS is pretty robust, and getting better, and I haven't heard of data losses
after a crash being a big problem.

> Brrrrrr.
> 
> Jean-Francois Lamy               lamy@ai.utoronto.ca, uunet!ai.utoronto.ca!lamy
> AI Group, Department of Computer Science, University of Toronto, Canada M5S 1A4

-- Jim Barton
Silicon Graphics Computer Systems    "UNIX: Live Free Or Die!"
jmb@sgi.sgi.com, sgi!jmb@decwrl.dec.com, ...{decwrl,sun}!sgi!jmb

blbates@AERO4.LARC.NASA.GOV ("Brent L. Bates AAD/TAB MS294 x42854") (09/14/89)

    I know of one case in which I said 'no' to fsck (on our 3130).
We had a bad storager board and it said that the disk was bad and I
KNEW it wasn't bad, so I stop fsck.  But like you said, I almost always
say 'yes'.
--

	Brent L. Bates
	NASA-Langley Research Center
	M.S. 294
	Hampton, Virginia  23665-5225
	(804) 864-2854
	E-mail: blbates@aero4.larc.nasa.gov or blbates@aero2.larc.nasa.gov

sysruth@helios.physics.utoronto.ca (Ruth Milner) (09/14/89)

In article <89Sep12.214425edt.2245@neat.cs.toronto.edu> lamy@ai.utoronto.ca (Jean-Francois Lamy) writes:
>People on the phone on the hotline are giving me blank stares :-) when
>asked the following:
>
>a) on an SGI 4D/240 using a tty as a console, how does one forcibly enter
>   the PROM monitor? (this is often accomplished by BREAK on ttys).  While we
>   understand this may not be desirable by default, there oughta be a way
>   to make the machine pay attention without having to depress its belly
>   button...
>

I called the hotline about this back in June, and was told there is no
way to do this. I suggested that there should be, but I don't know whether
my suggestion was taken seriously enough that it was submitted as a change
request.

I have also suggested that there should be a way for it to i) automatically
reboot after a crash, without depressing its belly button :-), and ii) have
it actually HALT after the shutdown sequence has completed, rather than
rebooting if you aren't there in time to press <ESC>. I often initiate
shutdowns from my desk, and then go to the machine room a minute or two
later. In the IRIS' case, it had cleverly rebooted by this time and I had to 
shut it down again. There have been some hints that new PROMs might fix at
least one of these problems, but it has been like pulling teeth to get any
concrete information.

While I can live without ii), part i) is absolutely vital. If it crashes
due to power problems or greedy people filling up swap or something, there
is no reason why it shouldn't go ahead and reboot. And in other situations,
I can get the information from /usr/adm/SYSLOG (or from the console if it
can't reboot).
-- 
 Ruth Milner          UUCP - {uunet,pyramid}!utai!helios.physics!sysruth
 Systems Manager      BITNET - sysruth@utorphys
 U. of Toronto        INTERNET - sysruth@helios.physics.toronto.edu
  Physics/Astronomy/CITA Computing Consortium

jmb@patton.sgi.com (Jim Barton) (09/14/89)

In article <1989Sep13.183918.8050@helios.physics.utoronto.ca>, sysruth@helios.physics.utoronto.ca (Ruth Milner) writes:
> I have also suggested that there should be a way for it to i) automatically
> reboot after a crash, without depressing its belly button :-), and ii) have
> it actually HALT after the shutdown sequence has completed, rather than
> rebooting if you aren't there in time to press <ESC>. I often initiate
> shutdowns from my desk, and then go to the machine room a minute or two
> later. In the IRIS' case, it had cleverly rebooted by this time and I had to 
> shut it down again. There have been some hints that new PROMs might fix at
> least one of these problems, but it has been like pulling teeth to get any
> concrete information.

The system will halt after the shutdown sequence if you shut down to init
state 0, for instance 'shutdown -i0', or 'init 0'.  This works, I've done
it alot.  We don't autoreboot after an OS crash because there may be
important data sitting on the screen that would be lost in this case - like
the error message which tells you what happened.  There aren't any 'bugs'
to be fixed here, although we can argue about reboot after crash ...

> While I can live without ii), part i) is absolutely vital. If it crashes
> due to power problems or greedy people filling up swap or something, there
> is no reason why it shouldn't go ahead and reboot. And in other situations,
> I can get the information from /usr/adm/SYSLOG (or from the console if it
> can't reboot).

You can't get the final crash messages from SYSLOG!  The kernel just crashed,
how can it write reliably to the filesystem?  If you have the PROM environment
variable 'bootmode=c', the machine will reboot after a power fail.  And
filling up swap won't crash the machine, the OS will instead start gunning
down processes.  This would seem to answer most of your concerns.

> -- 
>  Ruth Milner          UUCP - {uunet,pyramid}!utai!helios.physics!sysruth
>  Systems Manager      BITNET - sysruth@utorphys
>  U. of Toronto        INTERNET - sysruth@helios.physics.toronto.edu
>   Physics/Astronomy/CITA Computing Consortium

-- Jim Barton
Silicon Graphics Computer Systems    "UNIX: Live Free Or Die!"
jmb@sgi.sgi.com, sgi!jmb@decwrl.dec.com, ...{decwrl,sun}!sgi!jmb

dret@dgp.toronto.edu (George Drettakis) (09/14/89)

In article <41737@sgi.sgi.com> jmb@patton.sgi.com (Jim Barton) writes:
>filling up swap won't crash the machine, the OS will instead start gunning
>down processes.  This would seem to answer most of your concerns.
>
This maybe true. However a I have probably witnessed more swap space filling
than any other person around (we did it 2-3 times a day at one point),  and
what happens is that the machine goes into a very confused state, and yet again,
the belly button is the only solution (NOTHING responds, no rlogins, 
not the console). This resulted in us writing a checkpointing mechanism so we
could save the most recent data before the crash.
>
>-- Jim Barton
>Silicon Graphics Computer Systems    "UNIX: Live Free Or Die!"
>jmb@sgi.sgi.com, sgi!jmb@decwrl.dec.com, ...{decwrl,sun}!sgi!jmb
-- 
George Drettakis (416) 978 5473        Dynamic Graphics Project	
UUCP:   ..!uunet!dgp.toronto.edu!dret  Computer Systems Research Institute
Bitnet:	  dret@dgp.utoronto            University of Toronto
Internet: dret@dgp.toronto.edu         Toronto, Ontario M5S 1A4, CANADA
EAN:      dret@dgp.utoronto.cdn	       -- Live where it's never below 20 deg. C.

brendan@illyria.wpd.sgi.com (Brendan Eich) (09/15/89)

In article <41737@sgi.sgi.com>, jmb@patton.sgi.com (Jim Barton) writes:
> In article <1989Sep13.183918.8050@helios.physics.utoronto.ca>, sysruth@helios.physics.utoronto.ca (Ruth Milner) writes:
> > I have also suggested that there should be a way for it to i) automatically
> > reboot after a crash, without depressing its belly button :-), and ii) have
> > it actually HALT after the shutdown sequence has completed, rather than
> > rebooting if you aren't there in time to press <ESC>.
> 
> The system will halt after the shutdown sequence if you shut down to init
> state 0, for instance 'shutdown -i0', or 'init 0'.

We ship BSD-like reboot(1M) and halt(1M) scripts that call shutdown with
enough arguments to get it to bounce the system and take it to the ground,
respectively.

Brendan Eich
Silicon Graphics, Inc.
brendan@sgi.com

sysruth@helios.physics.utoronto.ca (Ruth Milner) (09/21/89)

In article <41737@sgi.sgi.com> jmb@patton.sgi.com (Jim Barton) writes:
>We don't autoreboot after an OS crash because there may be
>important data sitting on the screen that would be lost in this case - like
>the error message which tells you what happened.  

I have been using this argument to justify hardcopy consoles since the day
we set up our first UNIX system. Why do none (as far as I know, anyway) of
the UNIX-box vendors suggest or sell a hardcopy console for just this
reason? It's so much more useful than a 24-line screen.

>You can't get the final crash messages from SYSLOG!  The kernel just crashed,

From /usr/adm/oSYSLOG:

Sep 11 08:06:32 irides.physics PANIC: CPU 0: assertion failure!

Granted, this is put there by savecore, but it is still available. There was 
not much more than that on the screen. Really serious crashes will mean the 
system can't reboot anyway, and there will still be information on the screen 
about why.

>filling up swap won't crash the machine, the OS will instead start gunning
>down processes.  

I've seen it with my own eyes. It can happen. I don't know whether it was
because the kernel needed to run something and couldn't get the swap, or
whether it "gunned down" a process it probably shouldn't have and got a panic, 
but it has happened to us. Further to this subject, I'd be interested to
know how it decides which processes to destroy. The biggest ones? The newest
ones? The lowest-priority ones ... ?

-- 
 Ruth Milner          UUCP - {uunet,pyramid}!utai!helios.physics!sysruth
 Systems Manager      BITNET - sysruth@utorphys
 U. of Toronto        INTERNET - sysruth@helios.physics.toronto.edu
  Physics/Astronomy/CITA Computing Consortium