eap@cs.bu.edu (Eric A. Pearce) (02/28/90)
We have a Iris 4D running 3.2.1 with: 2 25 MHZ IP7 Processors FPU: MIPS R2010A/R3010 VLSI Floating Point Chip Revision: 2.0 CPU: MIPS R2000A/R3000 Processor Chip Revision: 2.0 Data cache size: 64 Kbytes Instruction cache size: 64 Kbytes Main memory size: 16 Mbytes GT Graphics option installed Integral Ethernet controller Integral SCSI controller WD33C93 Disk drive: unit 2 on SCSI controller 0 Disk drive: unit 1 on SCSI controller 0 Tape drive: unit 7 on SCSI controller 0: QIC 150 It recently started behaving strangely - When trying to log in from the console, the screen will go blank for a couple of minutes and then return to the login prompt. The behavior seems random. The only unusual message that I could find in SYSLOG was: Feb 21 14:41:22 panda grcond[10521]: In limbo Feb 21 14:42:07 panda grcond[10521]: Tried and failed 3 times to download graphics subsystem I asked our usual service person and the SGI hotline people and nobody had seen this message before. Any ideas? -e ------------------------------------------------------------------------------- Eric Pearce eap@bu-pub.bu.edu Boston University Information Technology 111 Cummington Street Boston MA 02215 617-353-2780 voice 617-353-6260 fax
brian@cs.utah.edu (Brian Sturgill) (02/28/90)
> We have a Iris 4D running 3.2.1 with: > It recently started behaving strangely - When trying to log in from the console, > the screen will go blank for a couple of minutes and then return to the login > prompt. The behavior seems random. The only unusual message that I could > find in SYSLOG was: > > Feb 21 14:41:22 panda grcond[10521]: In limbo > Feb 21 14:42:07 panda grcond[10521]: Tried and failed 3 times to download > graphics subsystem > > I asked our usual service person and the SGI hotline people and nobody had > seen this message before. > > Any ideas? The main idea I get is that it is odd that SGI does not know about this problem. ALL of our 4D/20's, and our 240GTX have this problem. Looking at our SYSLOGs shows that this occurs 4.51 times per machine per day. Often just before the limbo message we get: ... grcond[5015]: Child process /etc/gl/pandora exited with status 0 I do not know if the exact same mechanism is responsible, but we also had the graphics servers crash so frequently (leaving a very large /core) that I installed /core as a symlink to /dev/null. It seems odd that it is not occuring regularly at SGI on their machines. (Perhaps they have not upgraded to 3.2 yet?) Brian (Sorry if I sound grouchy, but I have been preparing a report about our problems with SGI machines for the last 2 days, and am quite annoyed about the number of problems we are having.)
andru@rhialto.sgi.com (Andrew Myers) (02/28/90)
In article <1990Feb27.171242.7976@hellgate.utah.edu> brian@cs.utah.edu (Brian Sturgill) writes: >> We have a Iris 4D running 3.2.1 with: >> prompt. The behavior seems random. The only unusual message that I could >> find in SYSLOG was: >> >> Feb 21 14:41:22 panda grcond[10521]: In limbo >> Feb 21 14:42:07 panda grcond[10521]: Tried and failed 3 times to download >> graphics subsystem >> >Often just before the limbo message we get: > > ... grcond[5015]: Child process /etc/gl/pandora exited with status 0 This message is perfectly normal, as is the limbo message. If your system starts failing to download microcode, I think you can get around it by logging in NOGRAPHICS, then logging out. This should reset the graphics more thoroughly than pandora can. I'm not sure how the state of the graphics gets corrupted in this way. Andrew
ktl@wag240.caltech.edu (Kian-Tat Lim) (02/28/90)
We went around and around with SGI hotline personnel on this problem last month. It was actually reported in this newsgroup back in December. I finally managed to get a fix out of someone at SGI, but apparently the hotline people have yet to hear about it. The cause of the problem appears to be high CPU load producing timeouts when downloading the graphics microcode into the graphics processor. This supposedly only occurs on multiprocessor systems with at least one processor saturated (100% usage). Apparently SGI didn't really expect us to beat on these boxes... The fix (which you should be able to confirm by talking to Gretchen at the hotline or Momi, who apparently found the problem [sorry, I didn't get last names]) is to change the variable network_processor in file /usr/sysgen/master.d/kernel to 0 (zero) from its default value of 1 (one). You must then run lboot (I used lboot -t -v -d) and reboot with the new kernel. I suppose you might be able to adb the kernel if you were feeling lucky, but neither SGI, I imagine, nor I will sanction that. This fix has cured our problems so far, though we did have one other suspicious hang after the fix when physical memory was very low that has neither recurred nor been explained. -- Kian-Tat Lim (ktl@wagvax.caltech.edu, KTL @ CITCHEM.BITNET, GEnie: K.LIM1)
brendan@illyria.wpd.sgi.com (Brendan Eich) (02/28/90)
In article <1990Feb27.171242.7976@hellgate.utah.edu>, brian@cs.utah.edu (Brian Sturgill) writes: > > [. . .] The behavior seems random. The only unusual message that I could > > find in SYSLOG was: > > > > Feb 21 14:41:22 panda grcond[10521]: In limbo > > Feb 21 14:42:07 panda grcond[10521]: Tried and failed 3 times to download > > graphics subsystem > > > > I asked our usual service person and the SGI hotline people and nobody had > > seen this message before. > > The main idea I get is that it is odd that SGI does not know about this > problem. ALL of our 4D/20's, and our 240GTX have this problem. Do you get the "Tried and failed 3 times to download graphics subsystem" message on all of your 4D/20's, or only some? On your 240GTX? The reason I ask is because very different versions of the grcond program are shipped for different models, according to their graphics hardware, and >only the 240GTX version contains the "Tried and failed" message<. Has someone inadvertently copied the 240GTX's /etc/gl/grcond to a 4D/20? Or does the message you quote in fact occur only on your 240GTX? > Looking at our SYSLOGs shows that this occurs 4.51 times per machine per day. > Often just before the limbo message we get: > > ... grcond[5015]: Child process /etc/gl/pandora exited with status 0 This SYSLOG entry was intended to be informational (LOG_INFO) only, and does not necessarily indicate a problem. Logging successful exit status does not seem useful; perhaps this unduly alarming message should be eliminated. > I do not know if the exact same mechanism is responsible, but we also > had the graphics servers crash so frequently (leaving a very large /core) that > I installed /core as a symlink to /dev/null. The graphics server meaning /bin/news_server? Was there any SYSLOG message from news_server (rather than from grcond) at the time of the coredump? > It seems odd that it is not occuring regularly at SGI on their machines. > (Perhaps they have not upgraded to 3.2 yet?) We're running 3.2, 3.2.1, 3.2.2, and what will become 3.3 in engineering, on hundreds of Iris 4Ds. Generally, engineers install and run a release long before any customers see it. The only troubles I've had with news_server, grcond, and microcode have been during development, when I used mismatched versions. I've heard, but not seen, of GT/GTX microcode problems that occasionally result in SYSLOG messages and graphics crashes. I've had no such problems with my 4D/20 in more than a year; I've been running 3.2 for about six months. > Brian Brendan Eich Silicon Graphics, Inc. brendan@sgi.com
martinm@kmart.sgi.com (martin) (03/01/90)
In article <52878@bu.edu.bu.edu> eap@cs.bu.edu (Eric A. Pearce) writes: > > We have a Iris 4D running 3.2.1 with: > > 2 25 MHZ IP7 Processors > > It recently started behaving strangely - When trying to log in from the console, > the screen will go blank for a couple of minutes and then return to the login > prompt. The behavior seems random. The only unusual message that I could > find in SYSLOG was: > > Feb 21 14:41:22 panda grcond[10521]: In limbo > Feb 21 14:42:07 panda grcond[10521]: Tried and failed 3 times to download > graphics subsystem GTX's running 3.2 and 3.2.1 have this problem when there are jobs in the background. if you have support, call the hotline. if not a temporary fix follows. - edit file /usr/sysgen/master.d/kernel, - change line that reads: int network_processor = 1; to int network_processor = 0; - su to root and cd to /, - type lboot (this will build a new kernel called unix.new in the current directory), - mv unix unix.old, - mv unix.new unix, - sync, init 0, then restart the system. this is in the last issue of the PipeLine. Martin McDonald SGI Life is unpredictable. Eat dessert first.