[comp.sys.sequent] Problems using Sequent Symmetry as NFS server?

iglesias@orion.cf.uci.edu (Mike Iglesias) (09/28/89)

We're using a 4 processor Sequent Symmetry system as a NFS server for
various systems (Sun 3s, Sun 4s, Sun 386s, etc) around campus.  Our
campus network has different areas divided into subnets with each
subnet served by a cisco router, and the routers connected by a fiber
optic backbone. 

On occasion (more often than we'd like), the client systems will start
printing messages about the NFS server not responding.  This goes on
for about 45 seconds or so, and then the client can talk to the server
again.  We don't see this on Suns that are on the same subnet as the
Sequent server, which could indicate it's a network problem.  The
problem is that we don't really know how to go about figuring out
whether it's a network problem or a problem with the Sequent or
Sequent's NFS product.  We've tried playing with the NFS timeout and
retransmission values, but we haven't been very successful in curing the
problem without using somewhat high values (timeouts of 2 seconds and
retransmissions set to 9).

If anyone has any experience with Sequent's NFS product they'd like to
relate or on how we can figure out where the problem is, please let me
know. 


Thanks in advance,

Mike Iglesias
University of California, Irvine

csg@pyramid.pyramid.com (Carl S. Gutekunst) (09/28/89)

In article <3022@orion.cf.uci.edu> iglesias@orion.oac.uci.edu (Mike Iglesias) writes:
>On occasion (more often than we'd like), the client systems will start
>printing messages about the NFS server not responding.  This goes on
>for about 45 seconds or so, and then the client can talk to the server
>again.  We don't see this on Suns that are on the same subnet as the
>Sequent server, which could indicate it's a network problem.

That's the key that tells you it's a routing problem. You can easily verify
this by attempting an rlogin at the same time as NFS is having trouble. If
the rlogin just sits there, then routing is the problem.

Whether it's something you can easily fix is less clear. Older versions of
routed, the route daemon, are known for going catatonic, spitting up, and
other rude behavior that can result in networks temporarilly being separated
from each other; and last I knew, Dynix was still using the 4.2BSD routed.
Also SunOS older than, oh, 3.4 or so.

Figuring out who dropped what involves using netstat -r to determine which
routes are vanishing at which nodes. If you can find a TCP/IP wizard to watch
the netstat output while you are having troubles, you may be able to find a
single node that is eating routes because of some trivial configuration
problem. More often than not, it will be the gateway machine itself that is
having problems.

The draconion solution is to use static routing, instead of dynamic. You then
use the route(8) command to add the routes by hand in /etc/rc.boot, and do
*not* start routed(8). It actually works very well if your network is small
and you don't change the topology very often. If it's the gateway that is
dropping routes, then you use static routing on the gateway, and dynamic on
all other machines.

<csg>

eckert@immd4.informatik.uni-erlangen.de (Toerless Eckert) (09/28/89)

From article <85680@pyramid.pyramid.com>, by csg@pyramid.pyramid.com (Carl S. Gutekunst):
> In article <3022@orion.cf.uci.edu> iglesias@orion.oac.uci.edu (Mike Iglesias) writes:
>>On occasion (more often than we'd like), the client systems will start
>>printing messages about the NFS server not responding.  This goes on
>>for about 45 seconds or so, and then the client can talk to the server
>>again.  We don't see this on Suns that are on the same subnet as the
>>Sequent server, which could indicate it's a network problem.
> 
> That's the key that tells you it's a routing problem. You can easily verify
> this by attempting an rlogin at the same time as NFS is having trouble. If
> the rlogin just sits there, then routing is the problem.

I don't think that the problem is a routing problem, given that
no other machines experience this. Also, if the udp packets from
the NFS client cannot find their way to the NFS server, the
error message is usually>

"NFS <op> failed for server <server>: RPC: Unable to send".

whereas the error message for a non responding NFS server that is
reachable through the network is usually:

"NFS server <server> not responding, [still trying]"

This was the error that Mike Iglesias described. We are experiencing
the same problems with our S81, and the clients are on a directly
connected network! We have traced this for long, and in fact,
our Sequent is loosing packets.

This problem has been acknowledged by sequent, they say that it
is a hardware problem on the ethernet interface on the SCED.

One remarkable effect is that the sequent stops serving any NFS request
for 30 to 45 seconds from time to time. This follows directly
from the packet losses of the ethernet interface.
The overall effect is that we cannot get the expected NFS performance
from the sequent (we are using it as a NFS server for both swapping
and usual Filesystem access for several SUN clients).

Help ? I don't know by now, maybe we will try to use the VME-bus based
ethernet controller.


Toerless Eckert X.400: <S=eckert;OU=informatik;P=uni-erlangen;A=dbp;C=de>
		RFC822: eckert@informatik.uni-erlangen.de
		UUCP:   {pyramid,unido}!fauern!eckert BITNET: tte@derrze0

jtkohl@quicksilver.MIT.EDU (John T Kohl) (09/28/89)

In article <3022@orion.cf.uci.edu> iglesias@orion.cf.uci.edu (Mike Iglesias) writes:

   On occasion (more often than we'd like), the client systems will start
   printing messages about the NFS server not responding.  This goes on
   for about 45 seconds or so, and then the client can talk to the server
   again.  

one thing to look at is your NFS read/write block size.  We found here
at MIT that our network router software/hardware had some problems with
the 8k packets normally used (which must be fragmented, and usually are
sent out back-to-back at maximum ethernet packet size).  Although the
network group here has finally fixed the software, Athena has for a long
time used 1k read/write packets to avoid excercising this bug.

I have seen this problem both with MIT's custom C-gateway code and with
a vendor's stock hardware/software solution with very similar topology
(lots of IP subnets on ethernet, connected to routers which sit on a
central fiber-optic campus spine).

--
John Kohl <jtkohl@ATHENA.MIT.EDU> or <jtkohl@Kolvir.Brookline.MA.US>
Digital Equipment Corporation/Project Athena
(The above opinions are MINE.  Don't put my words in somebody else's mouth!)