[comp.unix.sysv386] SCO NFS dies when heavily used

mrm@nss1.simpact.com (Michael R. Miller) (06/07/91)

We are having a problem with our SCO NFS package.  It seems that when
we start doing large amounts of NFS work, the NFS and the TCP/IP just
simply dies. There are streams resources available when this happens.

The SCO OS/NFS is exporting its directory.  A SUN OS/NFS is importing
the directory.  Large numbers of reads and writes are going back and
forth for some time -- sometimes just a few minutes to an hour, other
times a couple of days -- and then the software decides to lay over
and play dead.  We need to reboot the machine to breath life into its
networking support.

The SUN's NFS continues to operate although that window is "dead"
with the program running in the window waiting for a never-to-be-answered
NFS request.  We have determined that the SUN isn't at fault by successfully
reading and writing another NFS mounted directory exported by another SUN.
The SUN is an OS 4.1 product.

The SCO is UNIX 3.2.  It's running on an AST Premium 33MHz 8Meg computer.
The hard disk has plenty of space left on it when it dies.  Our network
card is a WD8003.

Please email responses to me.  I will summarize what I receive plus a
description of how we resolved the problem after we resolve the problem.
Thanks in advance for the help.

Michael R. Miller
Simpact Associates, Inc.

jim@tiamat.fsc.com ( IT Manager) (06/07/91)

In article <1991Jun06.171047.15327@nss1.com>, mrm@nss1.simpact.com (Michael R. Miller) writes:
> We are having a problem with our SCO NFS package.  It seems that when
> we start doing large amounts of NFS work, the NFS and the TCP/IP just
> simply dies. There are streams resources available when this happens.

This reminds me of a question I've been meaning to ask.  Under Xenix, there
is a program called "sw" which does a really nice job of reporting, in
real-time, the Streams resources in use.  Is there an equivalent funtion
under SCO Unix?  We have all the parameters really high on one system, since
it is heavily used, but my guess is that some of them are too high, and that
there are a lot of unused resources that could be returned to user space.

Any ideas?
------------- 
James B. O'Connor			jim@tiamat.fsc.com
Ahlstrom Filtration, Inc.		615/821-4022 x. 651

larry@nstar.rn.com (Larry Snyder) (06/08/91)

jim@tiamat.fsc.com ( IT Manager) writes:

>This reminds me of a question I've been meaning to ask.  Under Xenix, there
>is a program called "sw" which does a really nice job of reporting, in
>real-time, the Streams resources in use.  Is there an equivalent funtion
>under SCO Unix?  We have all the parameters really high on one system, since
>it is heavily used, but my guess is that some of them are too high, and that
>there are a lot of unused resources that could be returned to user space.

have you tried netstat -m ?


-- 
      Larry Snyder, NSTAR Public Access Unix 219-289-0287/317-251-7391
                         HST/PEP/V.32/v.32bis/v.42bis 
                        regional UUCP mapping coordinator 
               {larry@nstar.rn.com, ..!uunet!nstar.rn.com!larry}

jtsillas@sprite.ma30.bull.com (James Tsillas) (06/08/91)

THe only way I've managed to get this info is by running 'crash' and 
entering 'strstat'. 

-Jim.
--
 == James Tsillas                    Bull HN Information Systems Inc. ==
 == (508) 294-2937                   300 Concord Road   826A          ==
 == jtsillas@bubba.ma30.bull.com     Billerica, MA 01821              ==
 ==                                                                   ==
 == The opinions expressed above are solely my own and do not reflect ==
 == those of my employer.                                             ==
		    -== no solicitations please ==-

vlr@litwin.com (Vic Rice) (06/10/91)

In <1991Jun07.210220.5073@nstar.rn.com> larry@nstar.rn.com (Larry Snyder) writes:

>jim@tiamat.fsc.com ( IT Manager) writes:

>>This reminds me of a question I've been meaning to ask.  Under Xenix, there
>>is a program called "sw" which does a really nice job of reporting, in
>>real-time, the Streams resources in use.  Is there an equivalent funtion
>>under SCO Unix?  We have all the parameters really high on one system, since
>>it is heavily used, but my guess is that some of them are too high, and that
>>there are a lot of unused resources that could be returned to user space.

>have you tried netstat -m ?


This yields the following on SCO ODT 1.1 :

# netstat -m
netstat: Memory information not currently supported
-- 
Dr. Victor L. Rice
Litwin Process Automation

wes@harem.clydeunix.com (Barnacle Wes) (06/15/91)

In article <1991Jun06.171047.15327@nss1.com>, mrm@nss1.simpact.com (Michael R. Miller) writes:
> The SCO OS/NFS is exporting its directory.  A SUN OS/NFS is importing
> the directory.  Large numbers of reads and writes are going back and
> forth for some time -- sometimes just a few minutes to an hour, other
> times a couple of days -- and then the software decides to lay over
> and play dead.  We need to reboot the machine to breath life into its
> networking support.
> 
> The SUN's NFS continues to operate although that window is "dead"
> with the program running in the window waiting for a never-to-be-answered
> NFS request.  We have determined that the SUN isn't at fault by successfully
> reading and writing another NFS mounted directory exported by another SUN.
> The SUN is an OS 4.1 product.

This doesn't necessarily mean that the Sun NFS is correct, or bug-free,
but just that Sun NFS has a bug-set that is compatible with (surprise!)
Sun NFS.  If you have another SCO system, try doing the same test with
an SCO client & server.  This may help to narrow the possibilities.

Also, when you encounter this problem, does the entire network on the
SCO box die, or just NFS?  In other words, do telnet, ping, finger, etc
still work?  If so, it may just be a problem with SCO-NFS.  If it
crashing the entire network, including inetd, the problem may be in
your TCP/IP software rather than the NFS server.  Does nfsstat show any
problems before or after the crash, such as lots of rpc badcalls?

Good luck bug-hunting.

	Wes Peters
-- 
#include <std/disclaimer.h>                               The worst day sailing
My opinions, your screen.                                   is much better than
Raxco had nothing to do with this!                        the best day at work.
     Wes Peters:  wes@harem.clydeunix.com   ...!sun!unislc!harem!wes

larryp@sco.COM (Larry Philps) (06/17/91)

In <342@harem.clydeunix.com> wes@harem.clydeunix.com (Barnacle Wes) writes:

> In article <1991Jun06.171047.15327@nss1.com>, mrm@nss1.simpact.com (Michael R. Miller) writes:
> > The SCO OS/NFS is exporting its directory.  A SUN OS/NFS is importing
> > the directory.  Large numbers of reads and writes are going back and
> > forth for some time -- sometimes just a few minutes to an hour, other
> > times a couple of days -- and then the software decides to lay over
> > and play dead.  We need to reboot the machine to breath life into its
> > networking support.
> > 
> > The SUN's NFS continues to operate although that window is "dead"
> > with the program running in the window waiting for a never-to-be-answered
> > NFS request.  We have determined that the SUN isn't at fault by successfully
> > reading and writing another NFS mounted directory exported by another SUN.
> > The SUN is an OS 4.1 product.
> 
> This doesn't necessarily mean that the Sun NFS is correct, or bug-free,
> but just that Sun NFS has a bug-set that is compatible with (surprise!)
> Sun NFS.  If you have another SCO system, try doing the same test with
> an SCO client & server.  This may help to narrow the possibilities.
> 
> Also, when you encounter this problem, does the entire network on the
> SCO box die, or just NFS?  In other words, do telnet, ping, finger, etc
> still work?  If so, it may just be a problem with SCO-NFS.  If it
> crashing the entire network, including inetd, the problem may be in
> your TCP/IP software rather than the NFS server.  Does nfsstat show any
> problems before or after the crash, such as lots of rpc badcalls?
>
> Good luck bug-hunting.

I sent mail to Michael Miller regarding this problem, but since the
question has now resurfaced a week later, I figured I should let
everybody in on the scoop.

This *bug* has already been found and fixed.  Please note that the
problem is in the WD8003 driver, not NFS.

It turns out that in certain circumstances (transmitting while under
extremely heavy receive loads), the WD 8003 card can drop a transmit
interrupt.  The driver did not check for, and thus did not recover from
this situation.  This will produce exactly the symptoms Michael is
seeing.

We also found that under even heavier loads, the entire system could
hang.  This turned out to be the result of the NIC chip on the board
putting a bogus value into the next packet pointer register.  If this
bogus value was 0, the driver would infinite loop at spl5.

Both bugs have been fixed in the current driver, and are now shipping
as part of the LLI Drivers EFS.  You can get this from support for a
fee of approx $50 (I think), or uucp download it for free from sosco or
ftp it for free from sco-archive on uunet.

---
Larry Philps,	 SCO Canada, Inc.
Postman:  130 Bloor St. West, 10th floor, Toronto, Ontario.  M5S 1N5
InterNet: larryp@sco.COM  or larryp%scocan@uunet.uu.net
UUCP:	  {uunet,utcsri,sco}!scocan!larryp
Phone:	  (416) 922-1937