[comp.sys.sun] Sun-4 severe NFS problem

dan@watson.bbn.com (Dan Franklin) (03/31/89)

We're having severe NFS problems involving our (only) Sun 4/110, running
SunOS 4.0.1. The symptom is that a process attempting to copy (cp) a
"large" file (greater than 2k bytes or so) between this machine and any of
several others, including a Sun-3/160 (SunOS 3.4), a MicroVAX (Ultrix
2.3), and our diskless Sun-3/50 machines (SunOS 3.4), will almost always
hang.

We've seen the problem while copying:

1) from a Sun-4 directory to one on the Sun-3/160, while on the Sun-4,
2) from a Vax directory to one on the Sun-4, while on the Sun-4,
3) from a Sun-4 directory to one on the Sun-3/160, while on a Sun-3/50.

We can copy tiny files without any problem.  But we get long delays when
copying larger files, ranging up to a delay of infinity :-) We ran
experiments mostly copying between Suns (cases 1 and 3). The definition of
"large" is not constant, but seems to be between 1k and 2k bytes.  With
files greater than that, the cp hangs; sometimes it returns, but usually
not.  Generally we get an accompanying "NFS server <hostname> not
responding still trying" error message.  It usually doesn't return at all,
until it's been interrupted.

The trace command reveals that the cp hangs in a variety of places: doing
a "stat" on the destination directory, or writing to the destination file,
or closing it--but always an operation involving the destination.

While a cp is hung, all of the machines involved in the cp operation
continue to respond to other commands, including other NFS commands.
However, on the initiating machine, you cannot access the directory
containing the file being cp'd. For example, in case 1, an "ls", on the
Sun-4, of the remote directory containing the file being copied will also
hang.  But you can look at that file on the serving machine, as well as on
other machines besides the Sun-4 that have that file mounted.

Other network services, including FTP and rlogin, work perfectly.

These symptoms seem to be quite different from those discussed in other
Sun-4 hanging situations.  No nfsd ever ends up in a permanent "D" wait
state on any of the machines, including the Sun-4.  Unrelated NFS
activities on the two machines in question work fine.

Our problem sounded a little like the interrupt priority bug discussed by
Charles Hedrick recently, so I tried raising the priority of splnet() to 2
and then to 3 by patching the kernel according to his instructions.  It
didn't help.

Naturally, we've called the Sun Hotline.  They said they'd call back in a
few hours; so far it's been two days with no response.

This situation renders our brand new Sun-4 completely useless for the
reason we bought it.  We desperately need to get it to work.  Any
suggestions, hints, things to try, wild guesses, etc. will be gratefully
received.

	Dan Franklin
	dfranklin@bbn.com or dan@bbn.com

hedrick@geneva.rutgers.edu (Charles Hedrick) (04/25/89)

Yup, your problem does sound like ours.  However the patch to redefine
splnet turned out not to be the solution.  (Harmless, but not the
solution.)  As far as we know, the problem happens only on Sun 4's, and
only on machiens with more than one Ethernet.  The problem is that the
second Ethernet interrupts at a different level than the first.  If you're
unlucky, one Ethernet can interrupt the other, and the queue management
can get messed up.  The fix we used requires source, though I think I
could fix it on a non-source machine if I had to.  I verified with a
friend inside Sun that the problem is known to Sun and has been fixed.
Ask the Hotline people for the new if_subr.o.  If you can't get the
Hotline to answer, I guess I could set things up to let you FTP if_subr.o
from us.  If you only have one Ethernet (or Ethernet-like device: FDDI,
802.X network interface, etc.), then this is probably a different problem.

dan@bbn.com (Dan Franklin) (04/25/89)

Thanks to everyone who responded.  After I sent the original message (but
before it appeared in Sun-Spots) someone here at BBN on our local Sun
mailing list suggested adjusting the buffer sizes (thanks Matt!), and that
did the trick.  Surprisingly, lowering wsize to 1024 wasn't enough; but
512 was.  So we now have the world's slowest Sun-4/110 (for I/O), but at
least it's usable.  Only wsize, not rsize, needed to be adjusted.  The
Sun-4 never had any problem receiving data.  I don't consider this to be
the conclusion of this episode--I would still like some way to achieve
reasonable NFS speeds--but it helps.

I have noticed one other anomaly: when doing a cp -r on the Sun-4 copying
files onto our Microvax, it often gets a "Permission denied" error in the
middle of copying a series of files to the same directory.  It never
chooses the same file, and it always succeeds when I run it again.  The
file in question does get created, but is only zero-length.  All the files
are mode 444, so I could understand if it never worked at all (since NFS
is stateless), but to have it fail occasionally puzzles me.  But this is
only a minor annoyance right now.

	Dan Franklin

weltyc@fs3.cs.rpi.edu (Christopher A. Welty) (05/03/89)

dan@watson.bbn.com (Dan Franklin) writes:
>We're having severe NFS problems involving our (only) Sun 4/110, running
>SunOS 4.0.1. The symptom is that a process attempting to copy (cp) a
>"large" file (greater than 2k bytes or so) between this machine and any of
>several others, including a Sun-3/160 (SunOS 3.4), a MicroVAX (Ultrix
>2.3), and our diskless Sun-3/50 machines (SunOS 3.4), will almost always
>hang.

We used to have this problem and learned:

Whenever you mount an NFS filesystem to/from a machine that is vastly
different in speed (for us this is sun2<->sun3 and sun3<->sun4 and
sun2<->sun4) you should use the following options to mount:

slfs1.cs.rpi.edu:/us1 /fs1/us1 nfs bg,rw,soft,rsize=2048,wsize=2048,timeo=100 1 5

(this is a line from our fstab).  You should also never mount nfs
partitions at the root level.

Christopher Welty  ---  Asst. Director, RPI CS Labs
weltyc@cs.rpi.edu             ...!njin!nyser!weltyc

lmb@vicom.com (Larry Blair) (05/11/89)

weltyc@fs3.cs.rpi.edu (Christopher A. Welty) writes:
=X-Sun-Spots-Digest: Volume 7, Issue 262, message 5 of 14
=Whenever you mount an NFS filesystem to/from a machine that is vastly
=different in speed (for us this is sun2<->sun3 and sun3<->sun4 and
=sun2<->sun4) you should use the following options to mount:
=
=slfs1.cs.rpi.edu:/us1 /fs1/us1 nfs bg,rw,soft,rsize=2048,wsize=2048,timeo=100 1 5
                                       ^^^^^^^
This is a situation guaranteed to cause disaster.  We've found the writing
to a soft mount can result in lost data.
-- 
Larry Blair   ames!vsi1!lmb   lmb@vicom.com