dan@watson.bbn.com (Dan Franklin) (03/31/89)
We're having severe NFS problems involving our (only) Sun 4/110, running SunOS 4.0.1. The symptom is that a process attempting to copy (cp) a "large" file (greater than 2k bytes or so) between this machine and any of several others, including a Sun-3/160 (SunOS 3.4), a MicroVAX (Ultrix 2.3), and our diskless Sun-3/50 machines (SunOS 3.4), will almost always hang. We've seen the problem while copying: 1) from a Sun-4 directory to one on the Sun-3/160, while on the Sun-4, 2) from a Vax directory to one on the Sun-4, while on the Sun-4, 3) from a Sun-4 directory to one on the Sun-3/160, while on a Sun-3/50. We can copy tiny files without any problem. But we get long delays when copying larger files, ranging up to a delay of infinity :-) We ran experiments mostly copying between Suns (cases 1 and 3). The definition of "large" is not constant, but seems to be between 1k and 2k bytes. With files greater than that, the cp hangs; sometimes it returns, but usually not. Generally we get an accompanying "NFS server <hostname> not responding still trying" error message. It usually doesn't return at all, until it's been interrupted. The trace command reveals that the cp hangs in a variety of places: doing a "stat" on the destination directory, or writing to the destination file, or closing it--but always an operation involving the destination. While a cp is hung, all of the machines involved in the cp operation continue to respond to other commands, including other NFS commands. However, on the initiating machine, you cannot access the directory containing the file being cp'd. For example, in case 1, an "ls", on the Sun-4, of the remote directory containing the file being copied will also hang. But you can look at that file on the serving machine, as well as on other machines besides the Sun-4 that have that file mounted. Other network services, including FTP and rlogin, work perfectly. These symptoms seem to be quite different from those discussed in other Sun-4 hanging situations. No nfsd ever ends up in a permanent "D" wait state on any of the machines, including the Sun-4. Unrelated NFS activities on the two machines in question work fine. Our problem sounded a little like the interrupt priority bug discussed by Charles Hedrick recently, so I tried raising the priority of splnet() to 2 and then to 3 by patching the kernel according to his instructions. It didn't help. Naturally, we've called the Sun Hotline. They said they'd call back in a few hours; so far it's been two days with no response. This situation renders our brand new Sun-4 completely useless for the reason we bought it. We desperately need to get it to work. Any suggestions, hints, things to try, wild guesses, etc. will be gratefully received. Dan Franklin dfranklin@bbn.com or dan@bbn.com
hedrick@geneva.rutgers.edu (Charles Hedrick) (04/25/89)
Yup, your problem does sound like ours. However the patch to redefine splnet turned out not to be the solution. (Harmless, but not the solution.) As far as we know, the problem happens only on Sun 4's, and only on machiens with more than one Ethernet. The problem is that the second Ethernet interrupts at a different level than the first. If you're unlucky, one Ethernet can interrupt the other, and the queue management can get messed up. The fix we used requires source, though I think I could fix it on a non-source machine if I had to. I verified with a friend inside Sun that the problem is known to Sun and has been fixed. Ask the Hotline people for the new if_subr.o. If you can't get the Hotline to answer, I guess I could set things up to let you FTP if_subr.o from us. If you only have one Ethernet (or Ethernet-like device: FDDI, 802.X network interface, etc.), then this is probably a different problem.
dan@bbn.com (Dan Franklin) (04/25/89)
Thanks to everyone who responded. After I sent the original message (but before it appeared in Sun-Spots) someone here at BBN on our local Sun mailing list suggested adjusting the buffer sizes (thanks Matt!), and that did the trick. Surprisingly, lowering wsize to 1024 wasn't enough; but 512 was. So we now have the world's slowest Sun-4/110 (for I/O), but at least it's usable. Only wsize, not rsize, needed to be adjusted. The Sun-4 never had any problem receiving data. I don't consider this to be the conclusion of this episode--I would still like some way to achieve reasonable NFS speeds--but it helps. I have noticed one other anomaly: when doing a cp -r on the Sun-4 copying files onto our Microvax, it often gets a "Permission denied" error in the middle of copying a series of files to the same directory. It never chooses the same file, and it always succeeds when I run it again. The file in question does get created, but is only zero-length. All the files are mode 444, so I could understand if it never worked at all (since NFS is stateless), but to have it fail occasionally puzzles me. But this is only a minor annoyance right now. Dan Franklin
weltyc@fs3.cs.rpi.edu (Christopher A. Welty) (05/03/89)
dan@watson.bbn.com (Dan Franklin) writes: >We're having severe NFS problems involving our (only) Sun 4/110, running >SunOS 4.0.1. The symptom is that a process attempting to copy (cp) a >"large" file (greater than 2k bytes or so) between this machine and any of >several others, including a Sun-3/160 (SunOS 3.4), a MicroVAX (Ultrix >2.3), and our diskless Sun-3/50 machines (SunOS 3.4), will almost always >hang. We used to have this problem and learned: Whenever you mount an NFS filesystem to/from a machine that is vastly different in speed (for us this is sun2<->sun3 and sun3<->sun4 and sun2<->sun4) you should use the following options to mount: slfs1.cs.rpi.edu:/us1 /fs1/us1 nfs bg,rw,soft,rsize=2048,wsize=2048,timeo=100 1 5 (this is a line from our fstab). You should also never mount nfs partitions at the root level. Christopher Welty --- Asst. Director, RPI CS Labs weltyc@cs.rpi.edu ...!njin!nyser!weltyc
lmb@vicom.com (Larry Blair) (05/11/89)
weltyc@fs3.cs.rpi.edu (Christopher A. Welty) writes:
=X-Sun-Spots-Digest: Volume 7, Issue 262, message 5 of 14
=Whenever you mount an NFS filesystem to/from a machine that is vastly
=different in speed (for us this is sun2<->sun3 and sun3<->sun4 and
=sun2<->sun4) you should use the following options to mount:
=
=slfs1.cs.rpi.edu:/us1 /fs1/us1 nfs bg,rw,soft,rsize=2048,wsize=2048,timeo=100 1 5
^^^^^^^
This is a situation guaranteed to cause disaster. We've found the writing
to a soft mount can result in lost data.
--
Larry Blair ames!vsi1!lmb lmb@vicom.com