[comp.sys.sgi] 4D-200 series hangs frequently

jdh@pub.bu.edu (Jason Heirtzler) (06/20/91)

We have a problem with our 4D-220 and 4D-240 series machines
hanging a lot.  The symptoms vary from just the window system
locking up (you can still rlogin in) to sometimes the whole
machine will hang -- and with five 4D-200 series machines, we
probably average one machine hung each day.  Sometimes, when
the machine(s) keep running, various combinations of running
/etc/gl/restart_gl and using the window system "hot key" sequence
(F12-/-whatever) will return the console to normal.  But then,
the other times, only /etc/reboot (or pressing "reset") will do.

After numerous calls to the hotline, there's been no improvment,
and I'm sure I've personally installed every "dot dot" release
since the early 3.1 days.  Everything is running release 3.3.2
at the moment, and I'm waiting for my latest call to be returned
with another "Gee.. dunno.. have you tried 3.3.3?"

The specifics:

	2 25 MHZ IP7 Processors
	FPU: MIPS R2010A/R3010 VLSI Floating Point Chip Revision: 2.0
	CPU: MIPS R2000A/R3000 Processor Chip Revision: 2.0
	Data cache size: 64 Kbytes
	Instruction cache size: 64 Kbytes
	Main memory size: 16 Mbytes
	Integral Ethernet controller: Version 2
	GT Graphics option installed
	Integral SCSI controller 0: Version WD33C93
	Tape drive: unit 7 on SCSI controller 0: QIC 150
	Disk drive: unit 2 on SCSI controller 0
	Disk drive: unit 1 on SCSI controller 0

Does anyone else have this problem?  We're running everything
pretty generic (no custom device drivers added, for instance) so
it's hard to believe we're alone.  What's the deal?  How do you
get a very serious problem fixed?  Do you think I look forward
to release 4.0?

Jason Heirtzler
Boston University

jmb@patton.wpd.sgi.com (Jim Barton) (06/20/91)

In article <84172@bu.edu>, jdh@pub.bu.edu (Jason Heirtzler) writes:
|> We have a problem with our 4D-220 and 4D-240 series machines
|> hanging a lot.  The symptoms vary from just the window system
|> locking up (you can still rlogin in) to sometimes the whole
|> machine will hang -- and with five 4D-200 series machines, we
|> probably average one machine hung each day.  Sometimes, when
|> the machine(s) keep running, various combinations of running
|> /etc/gl/restart_gl and using the window system "hot key" sequence
|> (F12-/-whatever) will return the console to normal.  But then,
|> the other times, only /etc/reboot (or pressing "reset") will do.

|> Does anyone else have this problem?  We're running everything
|> pretty generic (no custom device drivers added, for instance) so
|> it's hard to believe we're alone.  What's the deal?  How do you
|> get a very serious problem fixed?  Do you think I look forward
|> to release 4.0?

The hot line is not much help because few problems of this type are reported,
and your complaint doesn't say nearly enough.

For instance, what application are you running? Is it a commercial package,
or is homebrew?

GL applications access the hardware pipe directly. Therefore it is possible for
a program to send garbage down the pipe, which can crash the graphics,
sometimes causing a hang. (Sure we could fix it. Then you wouldn't be
interested in the machines for graphics, 'cause they would run like a Sun :-)

Applications that deal directly with NeWS can also cause problems. Piping
postscript into NeWS without knowing what you're up to can cause all sorts of
interesting things to happen, including hangs.

We're very committed to making our customers happy, but you have to give us
as much information as possible so we can help out.

-- Jim Barton
   Silicon Graphics Computer Systems
   jmb@sgi.com

jwag@moose.asd.sgi.com (Chris Wagner) (06/21/91)

In article <84172@bu.edu>, jdh@pub.bu.edu (Jason Heirtzler) writes:
|> We have a problem with our 4D-220 and 4D-240 series machines
|> hanging a lot.  The symptoms vary from just the window system
|> locking up (you can still rlogin in) to sometimes the whole
|> machine will hang -- and with five 4D-200 series machines, we
|> probably average one machine hung each day.  Sometimes, when
|> the machine(s) keep running, various combinations of running
|> /etc/gl/restart_gl and using the window system "hot key" sequence
|> (F12-/-whatever) will return the console to normal.  But then,
|> the other times, only /etc/reboot (or pressing "reset") will do.
|> 
|> After numerous calls to the hotline, there's been no improvment,
|> and I'm sure I've personally installed every "dot dot" release
|> since the early 3.1 days.  Everything is running release 3.3.2
|> at the moment, and I'm waiting for my latest call to be returned
|> with another "Gee.. dunno.. have you tried 3.3.3?"
|> 
-- 

The problems you present are most likely  derived from a few different
issues. It is usually important, when trying to improve things
to start classifying the 'hangs'.

For example, if the graphics wedges, and the rest of the system (network, etc)
seems ok, then look in /usr/adm/SYSLOG for any messages from the
graphics hdw, and do some ps listings to see if there is a particular
process that is usually present, that is doing graphics...


As for the entire machine hang, again, trying to classify the problems
can help to zero in on the problem

so:
1) any nfs hard mounts???

2) any suspicious logs in SYSLOG (like disk errors??)

3) can you ping it

4) are the front panel LED digits blinking???

5) can you rsh in (not rlogin necessarily)

There are also some statistics that may help - like running netstat -m
to determine network memory usage, and sar to determine system load
sometimes these statistics can help characterize what your
doing thats slightrly different than others and therefore bringing
out some bug (software or hardware)

I would also suggest running the ecc(1) command to be sure that
your memory is ok.

Listings of #users (how are they logged in - telnet, rlogin, ftp??)
are also useful


This data should be able to help the hotline - keep bugging them!!!

(and by the way, have you tried 3.3.3? :-)

----
Chris Wagner (jwag@sgi.com)

ramani@Hookipa.Stanford.EDU (Ramani Pichumani) (06/22/91)

On a related note, we have a 4D/380 which frequently hangs up for a
period of anywhere from 30 seconds to 2 minutes.  The funny thing is
that when a process hangs up (say the shell), I can still log in from
another window and everything works fine.  The hung process eventually
comes back to life but it is very annoying to say the least.  Does
anyone know what causes this scenario?  It happens several times a
week, sometimes several times a day.  When are running IRIX 3.3.1 with
8 processors.

Thanks,
Ramani Pichumani

--
Ramani Pichumani                           Tel: (415) 725-3398 or 322-4623
Section on Medical Informatics                         Fax: (415) 725-7944
Medical School Office Bldg X-215             email: ramani@cs.stanford.edu
Stanford, CA 94305 USA                        uunet!cs.stanford.edu!ramani

moraes@cs.toronto.edu (Mark Moraes) (06/30/91)

For what it's worth, we have a 4D/240 and a 4D/340 that hang
frequently too. (frequently == every day at the worst of times, once
every four days at the best; we've been waiting for them to pass the
canonical "can it stay up and working for 30 days" test for over a
year now :-) (Took our Sun4/280 a year and a half to reach this
blissful state, so this must be something about modern OSes) Both
systems run 3.3.1, neither has a graphics console.  Both have their
console serial lines wired to a Develcon Develswitch, so we can get at
them remotely when need be.  Both act as fileservers for diskless
Sun3s, have users login from our terminal server, and from X terminals
or workstations.

The 240 runs a non-standard Ciprico disk controller and driver, so
it's possible that the problems on it are our fault.  However, the 340
is standard SGI hardware and software, just a few kernel constants
(like streams buffers) cranked up (and a couple of streams code fixes
that solved some of the more frequent hangs)

Typical hang conditions are when the system has a dozen users or so --
the 340 usually has all four processors busy with crunch jobs in the
background, the 240 usually has a couple of processors idle.  Both
systems are frequently pushed to the limit, the 340 more often than
the 240 (the 240 hangs more often, though) which must be some part of
the problem, because other less loaded 240s and 280s around here have
stayed up for months on end.  

Both systems have a lot of NFS hard mounts, including cross-mounts.
We're well aware that an NFS server going down can hang them, but
there have been many hangs that cannot be explained by this.  (We also
mount NFS directories in /nfs/machine/filesystem to try to avoid some
of the problems)

I've seen some correlation between hangs and a home directory file
system filling up.  Not too conclusive, though.  (Both machines have a
reasonable amount of swap -- 200Mb or so; we're well aware that a
process filling up swap can degrade the system impressively as it does
its best to make the process dump core:-)

I know we ought to have reported this in more detail before, but we've
been embarassingly sluggish about collecting enough facts to make this
sort of report useful to the kernel folk we contact at SGI, and calls
to the hotline about this sort of problem produce, um, less than
helpful answers once we confirm that we have lots of space in /tmp, we
age logs regularly, and have lots of swap space.

	Mark.