[comp.sys.sgi] problems with automount

dixons%phvax.dnet@SMITHKLINE.COM (01/23/91)

I have been having problems with automount.  The details are a little
involved so first my question:
Has anyone been using automount with Irix 3.3.1 on multiprocessor machines?
If so I would like to hear of your experiences, successful or unsuccessful.

Now for the gory details:
I have been having problems which seem to be related to automount although
the hotline thus far claims that they haven't had any other reports of
problems.  
First observed problems usually are with df.  It prints out the hard (fstab)
mounted disks but hangs up before printing info about automounted disks (my
first clue that something was wrong with automount).  Then gradually (over about
an hour or so) other disk related things (like ls and pwd) begin to hang.  When
a process hangs in this case, you can't interrupt it or kill it although you
can ^Z it and leave it in the background.  Eventually you can no longer log
in to the system (although compute bound jobs seem to continue to run) and
the only thing to do is push the reset button.  Trying to reboot before it
is impossible to log in to the console doesn't seem to help since system
shutdown hangs up in unmounting disks and you still have to reset the system.
This behaviour seems to have nothing to do with what jobs are running on the
system since it occurs when they are running multiple jobs or when they are
empty.  Sometimes it takes a few days to happen and sometimes it happens within
5 minutes of turning on automounter.  If you nfs mount the same set of disks
and servers via fstab then the machine seems to run fine for weeks.  Change
over to automount and boom.  Up until recently this has occured only on a
240GTX.  We have been running a couple of PIs with the same or similar
map files without problems.  Recently we installed five 380 servers and within
a week of firing them up, three have hung up in a similar way.  They are all
running Irix 3.3.1.
The only common thread I can see is that the machines that fail are all
multiprocessor machines while our single processor PIs chug along fine.
There are a number of reasons why using automounter would make administration
of all of these machines a lot easier so I would like to get to the bottom
of this.  In the meantime, I guess I will have to go back to fstab mounts
of the various disks.
If anyone else is seeing similar problems, or is running automount fine
on MP machines, I would like to hear about it.

Scott Dixon (dixons@smithkline.com)

slevy@poincare.geom.umn.edu (Stuart Levy) (01/24/91)

In article <9101231344.AA15399@smithkline.com> dixons%phvax.dnet@SMITHKLINE.COM writes:
>Has anyone been using automount with Irix 3.3.1 on multiprocessor machines?
>...
>First observed problems usually are with df...
>Then gradually (over about an hour or so) other disk related things
>(like ls and pwd) begin to hang.  When a process hangs in this case, you can't
>interrupt it or kill it although you can ^Z it and leave it in the background.
>Eventually you can no longer log in to the system (although compute bound jobs
>seem to continue to run) and the only thing to do is push the reset button....
>If anyone else is seeing similar problems, or is running automount fine
>on MP machines, I would like to hear about it.
>
>Scott Dixon (dixons@smithkline.com)

We're running 3.3.1 on an MP machine using NFS with *NO* automounting nor
lockd/statd, but we do have occasional problems similar to yours, I think.

Some disk-related things will hang while others keep working for a while,
as gradually more and more shell windows wedge; existing programs (clock, NeWS
interaction) seem to keep running; it's impossible to log in (network daemons
respond but logins hang before reaching a shell prompt); inetd-started daemons
still answer for a while, then TCP connections cease to open (maybe when the
listen() queues fill?).  No "NFS server not responding" messages appear.
I keep thinking there's some important inode ("/"?) getting locked,
but it's hard to tell.  It happens fairly quickly -- "a while" on our system
tends to be ~5 minutes rather than an hour.

SGI support was sympathetic, but it was hard to pin anything down.
Since talking to them we've started running a network daemon that lets you
cause a panic remotely, i.e. get a crash dump.  (Anyone wanting this daemon,
let me know.)  This syndrome has recurred once since then; I haven't yet shown
SGI the resulting dump, and can't see how to get much out of it with dbx -k.

   Stuart Levy, Geometry Group, University of Minnesota
   slevy@geom.umn.edu