[comp.sys.sgi] Intermittent Login Problems

eap@cs.bu.edu (Eric A. Pearce) (02/28/90)

 We have a Iris 4D running 3.2.1 with:

 2 25 MHZ IP7 Processors
 FPU: MIPS R2010A/R3010 VLSI Floating Point Chip Revision: 2.0
 CPU: MIPS R2000A/R3000 Processor Chip Revision: 2.0
 Data cache size: 64 Kbytes
 Instruction cache size: 64 Kbytes
 Main memory size: 16 Mbytes
 GT Graphics option installed
 Integral Ethernet controller
 Integral SCSI controller WD33C93
 Disk drive: unit 2 on SCSI controller 0
 Disk drive: unit 1 on SCSI controller 0
 Tape drive: unit 7 on SCSI controller 0: QIC 150

 It recently started behaving strangely - When trying to log in from the console,
 the screen will go blank for a couple of minutes and then return to the login 
 prompt.   The behavior seems random.   The only unusual message that I could 
 find in SYSLOG was:

 Feb 21 14:41:22 panda grcond[10521]: In limbo
 Feb 21 14:42:07 panda grcond[10521]: Tried and failed 3 times to download 
 graphics subsystem

 I asked our usual service person and the SGI hotline people and nobody had 
 seen this message before.

 Any ideas?

 -e

-------------------------------------------------------------------------------
 Eric Pearce eap@bu-pub.bu.edu
 Boston University Information Technology      
 111 Cummington Street  Boston MA 02215  617-353-2780 voice  617-353-6260 fax

brian@cs.utah.edu (Brian Sturgill) (02/28/90)

>  We have a Iris 4D running 3.2.1 with:
>  It recently started behaving strangely - When trying to log in from the console,
>  the screen will go blank for a couple of minutes and then return to the login 
>  prompt.   The behavior seems random.   The only unusual message that I could 
>  find in SYSLOG was:
> 
>  Feb 21 14:41:22 panda grcond[10521]: In limbo
>  Feb 21 14:42:07 panda grcond[10521]: Tried and failed 3 times to download 
>  graphics subsystem
> 
>  I asked our usual service person and the SGI hotline people and nobody had 
>  seen this message before.
> 
>  Any ideas?

The main idea I get is that it is odd that SGI does not know about this
problem.  ALL of our 4D/20's, and our 240GTX have this problem.
Looking at our SYSLOGs shows that this occurs 4.51 times per machine per day.
Often just before the limbo message we get:

	... grcond[5015]: Child process /etc/gl/pandora exited with status 0

I do not know if the exact same mechanism is responsible, but we also
had the graphics servers crash so frequently (leaving a very large /core) that
I installed /core as a symlink to /dev/null.

It seems odd that it is not occuring regularly at SGI on their machines.
(Perhaps they have not upgraded to 3.2 yet?)

Brian

(Sorry if I sound grouchy, but I have been preparing a report about
our problems with SGI machines for the last 2 days, and am quite
annoyed about the number of problems we are having.)

andru@rhialto.sgi.com (Andrew Myers) (02/28/90)

In article <1990Feb27.171242.7976@hellgate.utah.edu> brian@cs.utah.edu (Brian Sturgill) writes:
>>  We have a Iris 4D running 3.2.1 with:
>>  prompt.   The behavior seems random.   The only unusual message that I could 
>>  find in SYSLOG was:
>> 
>>  Feb 21 14:41:22 panda grcond[10521]: In limbo
>>  Feb 21 14:42:07 panda grcond[10521]: Tried and failed 3 times to download 
>>  graphics subsystem
>> 
>Often just before the limbo message we get:
>
>	... grcond[5015]: Child process /etc/gl/pandora exited with status 0

    This message is perfectly normal, as is the limbo message.

    If your system starts failing to download microcode, I think you can
    get around it by logging in NOGRAPHICS, then logging out. This should
    reset the graphics more thoroughly than pandora can. I'm not sure how
    the state of the graphics gets corrupted in this way.

Andrew

ktl@wag240.caltech.edu (Kian-Tat Lim) (02/28/90)

	We went around and around with SGI hotline personnel on this
problem last month.  It was actually reported in this newsgroup back
in December.  I finally managed to get a fix out of someone at SGI,
but apparently the hotline people have yet to hear about it.

	The cause of the problem appears to be high CPU load producing
timeouts when downloading the graphics microcode into the graphics
processor.  This supposedly only occurs on multiprocessor systems with
at least one processor saturated (100% usage).  Apparently SGI didn't
really expect us to beat on these boxes...

	The fix (which you should be able to confirm by talking to
Gretchen at the hotline or Momi, who apparently found the problem
[sorry, I didn't get last names]) is to change the variable
network_processor in file /usr/sysgen/master.d/kernel to 0 (zero) from
its default value of 1 (one).  You must then run lboot (I used lboot
-t -v -d) and reboot with the new kernel.  I suppose you might be able
to adb the kernel if you were feeling lucky, but neither SGI, I
imagine, nor I will sanction that.

	This fix has cured our problems so far, though we did have one
other suspicious hang after the fix when physical memory was very low
that has neither recurred nor been explained.

--
Kian-Tat Lim (ktl@wagvax.caltech.edu, KTL @ CITCHEM.BITNET, GEnie: K.LIM1)

brendan@illyria.wpd.sgi.com (Brendan Eich) (02/28/90)

In article <1990Feb27.171242.7976@hellgate.utah.edu>, brian@cs.utah.edu (Brian Sturgill) writes:
> >  [. . .] The behavior seems random.   The only unusual message that I could 
> >  find in SYSLOG was:
> > 
> >  Feb 21 14:41:22 panda grcond[10521]: In limbo
> >  Feb 21 14:42:07 panda grcond[10521]: Tried and failed 3 times to download 
> >  graphics subsystem
> > 
> >  I asked our usual service person and the SGI hotline people and nobody had 
> >  seen this message before.
> 
> The main idea I get is that it is odd that SGI does not know about this
> problem.  ALL of our 4D/20's, and our 240GTX have this problem.

Do you get the "Tried and failed 3 times to download graphics subsystem"
message on all of your 4D/20's, or only some?  On your 240GTX?  The reason
I ask is because very different versions of the grcond program are shipped
for different models, according to their graphics hardware, and >only the
240GTX version contains the "Tried and failed" message<.

Has someone inadvertently copied the 240GTX's /etc/gl/grcond to a 4D/20?
Or does the message you quote in fact occur only on your 240GTX?

> Looking at our SYSLOGs shows that this occurs 4.51 times per machine per day.
> Often just before the limbo message we get:
> 
> 	... grcond[5015]: Child process /etc/gl/pandora exited with status 0

This SYSLOG entry was intended to be informational (LOG_INFO) only, and does
not necessarily indicate a problem.  Logging successful exit status does not
seem useful; perhaps this unduly alarming message should be eliminated.

> I do not know if the exact same mechanism is responsible, but we also
> had the graphics servers crash so frequently (leaving a very large /core) that
> I installed /core as a symlink to /dev/null.

The graphics server meaning /bin/news_server?  Was there any SYSLOG message
from news_server (rather than from grcond) at the time of the coredump?

> It seems odd that it is not occuring regularly at SGI on their machines.
> (Perhaps they have not upgraded to 3.2 yet?)

We're running 3.2, 3.2.1, 3.2.2, and what will become 3.3 in engineering,
on hundreds of Iris 4Ds.  Generally, engineers install and run a release
long before any customers see it.

The only troubles I've had with news_server, grcond, and microcode have
been during development, when I used mismatched versions.  I've heard, but
not seen, of GT/GTX microcode problems that occasionally result in SYSLOG
messages and graphics crashes.  I've had no such problems with my 4D/20 in
more than a year; I've been running 3.2 for about six months.

> Brian

Brendan Eich
Silicon Graphics, Inc.
brendan@sgi.com

martinm@kmart.sgi.com (martin) (03/01/90)

In article <52878@bu.edu.bu.edu> eap@cs.bu.edu (Eric A. Pearce) writes:
>
> We have a Iris 4D running 3.2.1 with:
>
> 2 25 MHZ IP7 Processors
>
> It recently started behaving strangely - When trying to log in from the console,
> the screen will go blank for a couple of minutes and then return to the login 
> prompt.   The behavior seems random.   The only unusual message that I could 
> find in SYSLOG was:
>
> Feb 21 14:41:22 panda grcond[10521]: In limbo
> Feb 21 14:42:07 panda grcond[10521]: Tried and failed 3 times to download 
> graphics subsystem

GTX's running 3.2 and 3.2.1 have this problem when there are jobs in the 
background. if you have support, call the hotline. if not a temporary fix
follows.

  - edit file /usr/sysgen/master.d/kernel,
	   - change line that reads:
	   int network_processor = 1;
	    to
	   int network_processor = 0;
	   - su to root and cd to /,
	   - type lboot (this will build a new kernel
			called unix.new in the current directory),
	   - mv unix unix.old,
	   - mv unix.new unix,
	   - sync, init 0, then restart the system.

this is in the last issue of the PipeLine.

Martin McDonald
SGI
			Life is unpredictable.
			  Eat dessert first.