iglesias@orion.cf.uci.edu (Mike Iglesias) (09/28/89)
We're using a 4 processor Sequent Symmetry system as a NFS server for various systems (Sun 3s, Sun 4s, Sun 386s, etc) around campus. Our campus network has different areas divided into subnets with each subnet served by a cisco router, and the routers connected by a fiber optic backbone. On occasion (more often than we'd like), the client systems will start printing messages about the NFS server not responding. This goes on for about 45 seconds or so, and then the client can talk to the server again. We don't see this on Suns that are on the same subnet as the Sequent server, which could indicate it's a network problem. The problem is that we don't really know how to go about figuring out whether it's a network problem or a problem with the Sequent or Sequent's NFS product. We've tried playing with the NFS timeout and retransmission values, but we haven't been very successful in curing the problem without using somewhat high values (timeouts of 2 seconds and retransmissions set to 9). If anyone has any experience with Sequent's NFS product they'd like to relate or on how we can figure out where the problem is, please let me know. Thanks in advance, Mike Iglesias University of California, Irvine
csg@pyramid.pyramid.com (Carl S. Gutekunst) (09/28/89)
In article <3022@orion.cf.uci.edu> iglesias@orion.oac.uci.edu (Mike Iglesias) writes: >On occasion (more often than we'd like), the client systems will start >printing messages about the NFS server not responding. This goes on >for about 45 seconds or so, and then the client can talk to the server >again. We don't see this on Suns that are on the same subnet as the >Sequent server, which could indicate it's a network problem. That's the key that tells you it's a routing problem. You can easily verify this by attempting an rlogin at the same time as NFS is having trouble. If the rlogin just sits there, then routing is the problem. Whether it's something you can easily fix is less clear. Older versions of routed, the route daemon, are known for going catatonic, spitting up, and other rude behavior that can result in networks temporarilly being separated from each other; and last I knew, Dynix was still using the 4.2BSD routed. Also SunOS older than, oh, 3.4 or so. Figuring out who dropped what involves using netstat -r to determine which routes are vanishing at which nodes. If you can find a TCP/IP wizard to watch the netstat output while you are having troubles, you may be able to find a single node that is eating routes because of some trivial configuration problem. More often than not, it will be the gateway machine itself that is having problems. The draconion solution is to use static routing, instead of dynamic. You then use the route(8) command to add the routes by hand in /etc/rc.boot, and do *not* start routed(8). It actually works very well if your network is small and you don't change the topology very often. If it's the gateway that is dropping routes, then you use static routing on the gateway, and dynamic on all other machines. <csg>
eckert@immd4.informatik.uni-erlangen.de (Toerless Eckert) (09/28/89)
From article <85680@pyramid.pyramid.com>, by csg@pyramid.pyramid.com (Carl S. Gutekunst): > In article <3022@orion.cf.uci.edu> iglesias@orion.oac.uci.edu (Mike Iglesias) writes: >>On occasion (more often than we'd like), the client systems will start >>printing messages about the NFS server not responding. This goes on >>for about 45 seconds or so, and then the client can talk to the server >>again. We don't see this on Suns that are on the same subnet as the >>Sequent server, which could indicate it's a network problem. > > That's the key that tells you it's a routing problem. You can easily verify > this by attempting an rlogin at the same time as NFS is having trouble. If > the rlogin just sits there, then routing is the problem. I don't think that the problem is a routing problem, given that no other machines experience this. Also, if the udp packets from the NFS client cannot find their way to the NFS server, the error message is usually> "NFS <op> failed for server <server>: RPC: Unable to send". whereas the error message for a non responding NFS server that is reachable through the network is usually: "NFS server <server> not responding, [still trying]" This was the error that Mike Iglesias described. We are experiencing the same problems with our S81, and the clients are on a directly connected network! We have traced this for long, and in fact, our Sequent is loosing packets. This problem has been acknowledged by sequent, they say that it is a hardware problem on the ethernet interface on the SCED. One remarkable effect is that the sequent stops serving any NFS request for 30 to 45 seconds from time to time. This follows directly from the packet losses of the ethernet interface. The overall effect is that we cannot get the expected NFS performance from the sequent (we are using it as a NFS server for both swapping and usual Filesystem access for several SUN clients). Help ? I don't know by now, maybe we will try to use the VME-bus based ethernet controller. Toerless Eckert X.400: <S=eckert;OU=informatik;P=uni-erlangen;A=dbp;C=de> RFC822: eckert@informatik.uni-erlangen.de UUCP: {pyramid,unido}!fauern!eckert BITNET: tte@derrze0
jtkohl@quicksilver.MIT.EDU (John T Kohl) (09/28/89)
In article <3022@orion.cf.uci.edu> iglesias@orion.cf.uci.edu (Mike Iglesias) writes:
On occasion (more often than we'd like), the client systems will start
printing messages about the NFS server not responding. This goes on
for about 45 seconds or so, and then the client can talk to the server
again.
one thing to look at is your NFS read/write block size. We found here
at MIT that our network router software/hardware had some problems with
the 8k packets normally used (which must be fragmented, and usually are
sent out back-to-back at maximum ethernet packet size). Although the
network group here has finally fixed the software, Athena has for a long
time used 1k read/write packets to avoid excercising this bug.
I have seen this problem both with MIT's custom C-gateway code and with
a vendor's stock hardware/software solution with very similar topology
(lots of IP subnets on ethernet, connected to routers which sit on a
central fiber-optic campus spine).
--
John Kohl <jtkohl@ATHENA.MIT.EDU> or <jtkohl@Kolvir.Brookline.MA.US>
Digital Equipment Corporation/Project Athena
(The above opinions are MINE. Don't put my words in somebody else's mouth!)