[comp.sys.sun] Networking problems in unusual configuration

reg@lti.com (Rick Genter x18) (04/22/89)

I have a bizarre situation.  We have a Sun-3/180 server hooked up to an
Isolan multiport repeater.  We have four network legs connected to the
repeater (including the leg containing the server).  We have ~20 diskless
and dataless Suns hooked up to the network; some are on the same leg as
the server, while some are on the other legs.

The Suns on the server leg work fine.  The Suns on the other legs work
marginally, requiring us to cut the buffer sizes for NFS reads and writes
to 1024.  Anything larger and we get NFS Server not responding/still
trying.

For a long time we suspected the repeater, but then we got an Ethernet
test box (handy dandy little piece of equipment) and blasted packets
across the repeater in great volumes during times of great load, and never
lost a single packet.  What's even more bizarre is that the Suns on the
other legs go through tftpboot and arp and everything just fine, until
they start to "heavily" use NFS.  Thus I suspect there is some parameter
(other than rsize/wsize) in NFS that needs tweaking.

The rsize/wsize tweak is not an acceptable option; the non-server-leg Suns
run *SIGNIFICANTLY* slower than the server-leg Suns, to the point of
almost being unusable.  I tried fiddling with timeo, but with 8K buffers
even a timeo of 50 (5.0 seconds) doesn't work.

We have a variation of the problem which is that some machines on the
non-server leg will work with rsize/wsize=1024 when configured as dataless
nodes, but if you try to make them diskless nodes it doesn't matter what
size you give, they hang during /etc/rc.  They always hang executing "ps
-U"; if you comment that line out, they hang at the next command in rc
that does "significant" I/O.  (ps -U builds the file /etc/psdatabase --
it's not clear what ps does with this file, but it doesn't seem to work
without it.)  "Hanging" means "NFS Server not responding/still trying"

We've been around and around with Sun on this; I have yet to talk to
someone who struck me as halfway competent.  I've been dealing with Suns
since '82 and am no novice at this stuff, but I can't seem to convince Sun
of that (I get bullsh*t suggestions like "are your cables making good
contact").

Has anyone out there in 'Spots-Land seen anything like this?  Frankly,
we're stumped, and we're tired of dealing with the incompetent bozos at
Sun.  Thanks for hearing my gripe.

					- reg
--
Rick Genter					...!{buita,bbn}!lti!reg
Language Technology, Inc.			reg%lti.uucp@bu-it.bu.edu
27 Congress St., Salem, MA 01970		(508) 741-1507

[[ This was added later:  --wnl ]]

A followup to my last message: all machines are Sun-3s, and all are running
4.0.1 with tty patch and itrunc patch.
					-reg

sater@cs.vu.nl (Hans van Staveren) (05/05/89)

We had exactly the same problem, I think. If it only happens to 3/60's and
not to 3/50's it must be the same. Our ISOLAN repeaters were swapped to
get it fixed. It seems that there is a crystal in early models of it that
is just a teeny bit out of spec. The Sun 3/60 seems to be marginal but
just in spec, so the combination fails. Every large packet sent from a
3/60 through a repeater has a pretty high chance of being munged.

Since 8K NFS writes are 6 Ethernet packets, and are retransmitted all six
when things fail, the chance of all six getting through gets close to zero
giving the effect you describe.

Get your repeaters replaced.

	Hans van Staveren
	Vrije Universiteit
	Amsterdam, Holland

joe@uunet.uu.net (Joe Michel-Angelo) (05/05/89)

| We've been around and around with Sun on this; I have yet to talk to
| someone who struck me as halfway competent.  I've been dealing with Suns
| since '82 and am no novice at this stuff, but I can't seem to convince Sun
| of that (I get bullsh*t suggestions like "are your cables making good
| contact").

but are your cables making good contact? whenever nd or nfs fails but the
network looks good to test hardware, it's because of the following:

	- thin/thick net segment bad
	- bad bnc, barrel, or t connector
	- bad vamp connection

	- bad drop cable
	- bad drop cable connector/connection
	- drop cable too long ("they" say 50 feet is max; i say it's 15 or 25 feet)

	- giant network!  (the length of every segments adds to total length)

	- too many neighboring repeators (you should never have more then 2 repeators
				next to each other)
in bizzare cases:

	- isolan repeator power source in 220 and not 110 volts position
		(repeator works but only for small packets)
		(hey-- don't try this ....)

	- ethernet segment ground problem ... ie: grounded to an
		isolated ground ups and not mother earth.

	- xcvrs set with ENCODER on ... ENCODER option should be disabled.


-- 
"The Network         Joe Angelo,  VP/Technical Support - Support Group Division
 Adminstrator        Teknekron Software Systems, Palo Alto, CA     415-325-1025
 Is the Computer"    
                     joe@tss.com - uunet!tekbspa!joe - tekbspa!joe@uunet.uu.net