lamy@ai.utoronto.ca (Jean-Francois Lamy) (09/13/89)
People on the phone on the hotline are giving me blank stares :-) when asked the following: a) on an SGI 4D/240 using a tty as a console, how does one forcibly enter the PROM monitor? (this is often accomplished by BREAK on ttys). While we understand this may not be desirable by default, there oughta be a way to make the machine pay attention without having to depress its belly button... b) How does one do the equivalent of a "savecore" upon reboot (the goal is to be able to save memory state and peek around to see where things hang). Since someone will ask why on earth we'd want to do that - we have a machine that runs for several days and all of the sudden refuses to fork off new commands (i.e. after typing a command to the shell, all you can do is interrupt it; you can connect to the telnet daemon but won't get a shell; ping replies do get answered, etc.). There is plenty of memory and plenty of swap space and plenty of process slots. And while I'm at it, the whole fsck picture makes me and a bunch more people shudder. c) What exactly are the "minor repairs" where fsck -b will seek to reboot? d) What exactly is checked by fsstat? Is the dirty bit just that, a bit, that could, say happen to be in a bad block and be wrong? Is fsck under efs guaranteed to clean the filesystem in one pass, or does it suffer from the berkeley heritage and therefore sometimes two or three fscks of the same partition are required before fsck succeeds twice in a row? Why did a message arrive in my mailbox this very second suggesting we comment out all the calls to fsstat in the boot sequence and get rid of the -c flags in mountall? e) Can we get a -p flag on fsck (i.e. repair minor damage, report error on major damage, causing machine to stay in single user mode). Using -y in the boot sequence sounds like a bit trusting. Brrrrrr. Jean-Francois Lamy lamy@ai.utoronto.ca, uunet!ai.utoronto.ca!lamy AI Group, Department of Computer Science, University of Toronto, Canada M5S 1A4
jmb@patton.sgi.com (Jim Barton) (09/13/89)
In article <89Sep12.214425edt.2245@neat.cs.toronto.edu>, lamy@ai.utoronto.ca (Jean-Francois Lamy) writes: > People on the phone on the hotline are giving me blank stares :-) when > asked the following: > > a) on an SGI 4D/240 using a tty as a console, how does one forcibly enter > the PROM monitor? (this is often accomplished by BREAK on ttys). While we > understand this may not be desirable by default, there oughta be a way > to make the machine pay attention without having to depress its belly > button... I'm not surprised you get a (over-the-phone) blank stare on this one. The PROM monitor is only active when the OS isn't. There isn't any magic key to press to get into it, although the magic key sequence "init 0" will get you there - without UNIX :-). To get more complex, you can get into the built in kernel debugger by lboot'ing a system with 'idbg' INCLUDE'd. See /usr/sysgen, and edit system to INCLUDE idbg. Then execute /etc/init.d/autoconfig and reboot. This will also allow you to use the IRIX command 'idbg' to poke around kernel data structures. As a warning, though, don't call the Hotline if you have problems with playing around with this - it's not in the normal operating mode of most places. > > b) How does one do the equivalent of a "savecore" upon reboot (the goal is to > be able to save memory state and peek around to see where things hang). > > Since someone will ask why on earth we'd want to do that - we > have a machine that runs for several days and all of the sudden refuses > to fork off new commands (i.e. after typing a command to the shell, all > you can do is interrupt it; you can connect to the telnet daemon but > won't get a shell; ping replies do get answered, etc.). There is plenty > of memory and plenty of swap space and plenty of process slots. I haven't heard of this problem running released software for the 240. Be sure you are running 3.1F, or preferably 3.1G. The 3.1D release will run a 240, but only poorly, and it has the bug you mention. Even better, bug your SE to get 3.2 (see below). The system ALREADY does a 'savecore' on reboot if it crashed before. It doesn't work on a hang, since reset has to be pushed to get back. Installing the debugger as above, and then using the '^A' key on the console will drop you into the debugger, from which you can get a stack trace and poke around the kernel data structures. Of course, you will recognize little, since IRIX is a V.3 kernel with many 4.3 extensions re-written for a multiprocessor ... > And while I'm at it, the whole fsck picture makes me and a bunch more people > shudder. > > c) What exactly are the "minor repairs" where fsck -b will seek to reboot? > d) What exactly is checked by fsstat? Is the dirty bit just that, a bit, that > could, say happen to be in a bad block and be wrong? Is fsck under efs > guaranteed to clean the filesystem in one pass, or does it suffer from > the berkeley heritage and therefore sometimes two or three fscks of the > same partition are required before fsck succeeds twice in a row? > Why did a message arrive in my mailbox this very second suggesting we > comment out all the calls to fsstat in the boot sequence and get rid of > the -c flags in mountall? > e) Can we get a -p flag on fsck (i.e. repair minor damage, report error > on major damage, causing machine to stay in single user mode). Using > -y in the boot sequence sounds like a bit trusting. The 3.2 release re-mounts the root filesystem instead of rebooting. The dirty bit is kept in the superblock, and is set dirty unless the filesystem is unmounted, in which case it is set clean. Fsck WILL clean the filesystem in one pass. And I'd ignore the mail message, unless you'd like even more pain and suffering in a few days. Finally, you are certainly welcome to remove the '-y' option, but remember that all us UNIX guru's out here rely on it. After all, how many people have the sophistication to understand when to say 'no' to fsck? (There are maybe two or three people out here). EFS is pretty robust, and getting better, and I haven't heard of data losses after a crash being a big problem. > Brrrrrr. > > Jean-Francois Lamy lamy@ai.utoronto.ca, uunet!ai.utoronto.ca!lamy > AI Group, Department of Computer Science, University of Toronto, Canada M5S 1A4 -- Jim Barton Silicon Graphics Computer Systems "UNIX: Live Free Or Die!" jmb@sgi.sgi.com, sgi!jmb@decwrl.dec.com, ...{decwrl,sun}!sgi!jmb
blbates@AERO4.LARC.NASA.GOV ("Brent L. Bates AAD/TAB MS294 x42854") (09/14/89)
I know of one case in which I said 'no' to fsck (on our 3130). We had a bad storager board and it said that the disk was bad and I KNEW it wasn't bad, so I stop fsck. But like you said, I almost always say 'yes'. -- Brent L. Bates NASA-Langley Research Center M.S. 294 Hampton, Virginia 23665-5225 (804) 864-2854 E-mail: blbates@aero4.larc.nasa.gov or blbates@aero2.larc.nasa.gov
sysruth@helios.physics.utoronto.ca (Ruth Milner) (09/14/89)
In article <89Sep12.214425edt.2245@neat.cs.toronto.edu> lamy@ai.utoronto.ca (Jean-Francois Lamy) writes: >People on the phone on the hotline are giving me blank stares :-) when >asked the following: > >a) on an SGI 4D/240 using a tty as a console, how does one forcibly enter > the PROM monitor? (this is often accomplished by BREAK on ttys). While we > understand this may not be desirable by default, there oughta be a way > to make the machine pay attention without having to depress its belly > button... > I called the hotline about this back in June, and was told there is no way to do this. I suggested that there should be, but I don't know whether my suggestion was taken seriously enough that it was submitted as a change request. I have also suggested that there should be a way for it to i) automatically reboot after a crash, without depressing its belly button :-), and ii) have it actually HALT after the shutdown sequence has completed, rather than rebooting if you aren't there in time to press <ESC>. I often initiate shutdowns from my desk, and then go to the machine room a minute or two later. In the IRIS' case, it had cleverly rebooted by this time and I had to shut it down again. There have been some hints that new PROMs might fix at least one of these problems, but it has been like pulling teeth to get any concrete information. While I can live without ii), part i) is absolutely vital. If it crashes due to power problems or greedy people filling up swap or something, there is no reason why it shouldn't go ahead and reboot. And in other situations, I can get the information from /usr/adm/SYSLOG (or from the console if it can't reboot). -- Ruth Milner UUCP - {uunet,pyramid}!utai!helios.physics!sysruth Systems Manager BITNET - sysruth@utorphys U. of Toronto INTERNET - sysruth@helios.physics.toronto.edu Physics/Astronomy/CITA Computing Consortium
jmb@patton.sgi.com (Jim Barton) (09/14/89)
In article <1989Sep13.183918.8050@helios.physics.utoronto.ca>, sysruth@helios.physics.utoronto.ca (Ruth Milner) writes: > I have also suggested that there should be a way for it to i) automatically > reboot after a crash, without depressing its belly button :-), and ii) have > it actually HALT after the shutdown sequence has completed, rather than > rebooting if you aren't there in time to press <ESC>. I often initiate > shutdowns from my desk, and then go to the machine room a minute or two > later. In the IRIS' case, it had cleverly rebooted by this time and I had to > shut it down again. There have been some hints that new PROMs might fix at > least one of these problems, but it has been like pulling teeth to get any > concrete information. The system will halt after the shutdown sequence if you shut down to init state 0, for instance 'shutdown -i0', or 'init 0'. This works, I've done it alot. We don't autoreboot after an OS crash because there may be important data sitting on the screen that would be lost in this case - like the error message which tells you what happened. There aren't any 'bugs' to be fixed here, although we can argue about reboot after crash ... > While I can live without ii), part i) is absolutely vital. If it crashes > due to power problems or greedy people filling up swap or something, there > is no reason why it shouldn't go ahead and reboot. And in other situations, > I can get the information from /usr/adm/SYSLOG (or from the console if it > can't reboot). You can't get the final crash messages from SYSLOG! The kernel just crashed, how can it write reliably to the filesystem? If you have the PROM environment variable 'bootmode=c', the machine will reboot after a power fail. And filling up swap won't crash the machine, the OS will instead start gunning down processes. This would seem to answer most of your concerns. > -- > Ruth Milner UUCP - {uunet,pyramid}!utai!helios.physics!sysruth > Systems Manager BITNET - sysruth@utorphys > U. of Toronto INTERNET - sysruth@helios.physics.toronto.edu > Physics/Astronomy/CITA Computing Consortium -- Jim Barton Silicon Graphics Computer Systems "UNIX: Live Free Or Die!" jmb@sgi.sgi.com, sgi!jmb@decwrl.dec.com, ...{decwrl,sun}!sgi!jmb
dret@dgp.toronto.edu (George Drettakis) (09/14/89)
In article <41737@sgi.sgi.com> jmb@patton.sgi.com (Jim Barton) writes: >filling up swap won't crash the machine, the OS will instead start gunning >down processes. This would seem to answer most of your concerns. > This maybe true. However a I have probably witnessed more swap space filling than any other person around (we did it 2-3 times a day at one point), and what happens is that the machine goes into a very confused state, and yet again, the belly button is the only solution (NOTHING responds, no rlogins, not the console). This resulted in us writing a checkpointing mechanism so we could save the most recent data before the crash. > >-- Jim Barton >Silicon Graphics Computer Systems "UNIX: Live Free Or Die!" >jmb@sgi.sgi.com, sgi!jmb@decwrl.dec.com, ...{decwrl,sun}!sgi!jmb -- George Drettakis (416) 978 5473 Dynamic Graphics Project UUCP: ..!uunet!dgp.toronto.edu!dret Computer Systems Research Institute Bitnet: dret@dgp.utoronto University of Toronto Internet: dret@dgp.toronto.edu Toronto, Ontario M5S 1A4, CANADA EAN: dret@dgp.utoronto.cdn -- Live where it's never below 20 deg. C.
brendan@illyria.wpd.sgi.com (Brendan Eich) (09/15/89)
In article <41737@sgi.sgi.com>, jmb@patton.sgi.com (Jim Barton) writes: > In article <1989Sep13.183918.8050@helios.physics.utoronto.ca>, sysruth@helios.physics.utoronto.ca (Ruth Milner) writes: > > I have also suggested that there should be a way for it to i) automatically > > reboot after a crash, without depressing its belly button :-), and ii) have > > it actually HALT after the shutdown sequence has completed, rather than > > rebooting if you aren't there in time to press <ESC>. > > The system will halt after the shutdown sequence if you shut down to init > state 0, for instance 'shutdown -i0', or 'init 0'. We ship BSD-like reboot(1M) and halt(1M) scripts that call shutdown with enough arguments to get it to bounce the system and take it to the ground, respectively. Brendan Eich Silicon Graphics, Inc. brendan@sgi.com
sysruth@helios.physics.utoronto.ca (Ruth Milner) (09/21/89)
In article <41737@sgi.sgi.com> jmb@patton.sgi.com (Jim Barton) writes: >We don't autoreboot after an OS crash because there may be >important data sitting on the screen that would be lost in this case - like >the error message which tells you what happened. I have been using this argument to justify hardcopy consoles since the day we set up our first UNIX system. Why do none (as far as I know, anyway) of the UNIX-box vendors suggest or sell a hardcopy console for just this reason? It's so much more useful than a 24-line screen. >You can't get the final crash messages from SYSLOG! The kernel just crashed, From /usr/adm/oSYSLOG: Sep 11 08:06:32 irides.physics PANIC: CPU 0: assertion failure! Granted, this is put there by savecore, but it is still available. There was not much more than that on the screen. Really serious crashes will mean the system can't reboot anyway, and there will still be information on the screen about why. >filling up swap won't crash the machine, the OS will instead start gunning >down processes. I've seen it with my own eyes. It can happen. I don't know whether it was because the kernel needed to run something and couldn't get the swap, or whether it "gunned down" a process it probably shouldn't have and got a panic, but it has happened to us. Further to this subject, I'd be interested to know how it decides which processes to destroy. The biggest ones? The newest ones? The lowest-priority ones ... ? -- Ruth Milner UUCP - {uunet,pyramid}!utai!helios.physics!sysruth Systems Manager BITNET - sysruth@utorphys U. of Toronto INTERNET - sysruth@helios.physics.toronto.edu Physics/Astronomy/CITA Computing Consortium