[comp.dcom.lans] SUN 3.4 problems

jmr@philabs.Philips.Com (Joanne Mannarino) (08/26/87)

In trying to upgrade our SUN 3/180 fileserver (named condor) to SUN UNIX
version 3.4 along with 11 diskless clients, I ran into some problems.  The
upgrade procedure on condor went fine.  I reconfigured the kernel for 3.4,
rebooted condor and still no problem.  Then I tried booting up all of the
diskless clients (one at a time) and then the headaches began.  

The booting process began with "requesting internet address" with the host
responding with the correct information (thus there is communication via our
Ethernet).  The problem began when the booting process got to the point for:

	starting rpc and net services: portmap router biod

The boot process then halts with the following error messages:

	server not responding
	RPC: program not registered
	mount retrying
		/usr
		/usr/condor

This will remain at this point until you either manually abort or power down
the unit.  At this point, any active workstation on the network (ie, SUNs
either connected to our other fileserver (which still runs 3.2) or
diskful SUNs sitting on the net)  displays a screenful of "ie0: no carrier"
and "Ethernet jammed" error messages.

I contacted SUN support immediately and after running tests to see if all of
the daemons that should be running were running, the conclusion was made by
SUN that the problem is somewhere within our Ethernet structure.  SUN said
that 3.4 includes major changes in the Ethernet drivers that don't correct
for possible problems in the network.  At this point SUN support referred me
to someone in their Data Communications support department.  After running
some net stats and sending them the data, I was told "your network looks ok".
BUT still we are having problems.

We've tried some different things to see if we could isolate the problem
(actually this was done before the fileserver upgrade, but we wrote it off as
being a network problem isolated to a particular laboratory).  We tried
running a diskful 3/160 as a server for a diskless 3/160 both running 3.4 and
we ran across the same problems.  It was suggested that we take both units
off of our main net and hook them up directly to their own mini net.  When
this was done, the problem went away, ie, the client came up running 3.4.

We have also tried changing the /etc/fstab on a client and "backgrounding"
the mount process.  This results in the client coming up in single user
mode.  Then after trying to manually mount a filesystem, I get the above
errors of "server not responding" and "mount retrying".

As an interim solution, we have kept the 3.4 enhancements (I didn't back out
of the upgrade) and are running a 3.2 kernel.  Everything seems fine, but this
still doesn't solve our problem.  

Some SUN reps claim that the problem is definitely with our network, others
say it's in the 3.4 software.  Anyone else experienced these symptoms when
upgrading to 3.4?  Any suggestions on what we should do now?

thanks in advance,
Joanne Mannarino


-- 
joanne mannarino				   seismo!philabs!jmr   
philips laboratories 				           or
(914)945-6008					 jmr@philabs.philips.com

earle@jplopto.uucp (Greg Earle) (08/27/87)

Some of the files that were supposed to be on the 3.4 upgrade tapes didn't
make it.  There is a tar file of fixes, available via anonymous FTP from
host sesun.JPL.NASA.GOV [128.149.4.18], in pub/3.4-fix.tar.  Here is the
README file that accompanies it (3.4-fix.README) :

--------------------------------------------------------------------------

3.4-fix.tar:

This distribution contains a  number of files that were inadvertantly
omitted from the SunOS 3.4 distribution tape.  All the binaries
(except in.rwhod) are ones received directly from Sun and will be
incorporated in the next release.  This should be considered as a
MANDATORY patch to SunOS 3.4 systems.  The problems that appear are
related to RPC broadcasts.  In particular, ypbind will fail if 
the machine is on a subnet.

	David Robinson
	MS 168-522
	JPL
	4800 Oak Grove Drive, Pasadena CA 91109
	(818) 354-3595 (Office)

Contents:
./etc/ypbind
./etc/umount
./etc/in.routed
./usr/ucb/rup
./usr/ucb/rusers
./usr/etc/in.rwhod

Install by:
# /etc/halt
> b vmunix -s
# mount /usr
# cd /
# tar xvpf 3.4-fix.tar
# /etc/reboot

----------------------------------------------------------------------

David Robinson of Caltech discovered the problem, and talking with Sun
discovered the fact that the corrected versions accidentally didn't get
on the 3.4 upgrade tapes.

Since the network was OK before, it's doubtful that it is bad; however the
`ie0: no carrier' messages *are* puzzling since they normally indicate a
tap fallen off, or a bad solder connection in a cable connector (if not
using vampire taps), etc.

Are you running subnets?

	- Greg Earle
	  Currently moonlighting for Sun Consulting

	Greg Earle		earle@jplopto.JPL.NASA.GOV
	Sun Consulting		earle%jplopto@jpl-elroy.ARPA	[aka:]
	(Freelance -		earle%jplopto@elroy.JPL.NASA.GOV
	    write me)		...!cit-vax!elroy!smeagol!jplopto!earle

earle@jplopto.uucp (Greg Earle) (08/27/87)

Joanne,
	Are you running YP?  Your boots are dying when it tries to do NFS
mounts; in order to do this there has to be an entry for the rpc.mountd
daemon in /etc/servers, so the program can be registered.  Trouble is,
when you install diskless clients, either /etc/servers or /etc/services
(or both) *do not get installed* on diskless clients.  If it is /etc/servers
and (I think) you do not run Yellow Pages, then the portmapper will not
be able to get the entry for the server, and it will emit the messages
you describe.
	I suggest you halt all your diskless clients, retry the 3.4 kernel on
`condor', then successively mount /dev/ndl[0-9?] onto /mnt, and do a
cp /etc/servers /mnt/etc/servers (and do /etc/services just to make sure).
Then unmount /mnt.  After you're all done, reboot the server with the 3.4
kernel, then try the client reboots again.  See what happens.

Just a thought,

	- Greg

	Greg Earle		earle@jplopto.JPL.NASA.GOV
	Sun Consulting		earle%jplopto@jpl-elroy.ARPA	[aka:]
	(Freelance -		earle%jplopto@elroy.JPL.NASA.GOV
	    write me)		...!cit-vax!elroy!smeagol!jplopto!earle

hedrick@topaz.rutgers.edu (Charles Hedrick) (08/27/87)

I claim no expertise in SunOS 3.4.  We are using 3.2 with
locally-added networking enhancements that put it somewhere between
3.3 and 3.4 in terms of functionality.  However from your results, it
sounds like Sun's diagnosis is right.  The fact that your hosts all
get "ie0: no carrier" or "Ethernet jammed" strongly indicates a
broadcast storm.  The fact that things work when you use a separate
Ethernet suggests that there is no error in your software or setup.
However it's not quite right to say that the problem is with your
"network".  The problem is not with the network itself, but with the
hosts on that network.  If all of the hosts on it are Suns, then Sun
can't entirely avoid blame.  3.4 is based on 4.3BSD's version of IP.
3.2 is based on 4.2BSD's version of IP.  Between 4.2 and 4.3, the
broadcast address was changed.  (The people who changed the standard
should be shot.  The amount of damage done to networks and the
reputation of IP due to inconsistent broadcast addresses is enormous.
By the way, this is not Berkeley's fault.  The standard actually
changed.)  Unfortunately, there are various bugs in 4.2 (and
presumably Sun 3.2), such that any disagreement over the broadcast
address can cause such a flurry of ICMP unreachables and ARP's that
the network becomes unusable.  The solution is going to depend upon
the particular set of machines on your network.  You have two choices:
find some broadcast address on which everyone can agree, or split the
network.  4.3-based systems allow you to set the broadcast address.
So do some 4.2-based systems that contain "4.3 enhancements".  This
includes Ultrix and Pyramid.  Unmodified 4.2 systems use net.0 as the
broadcast address.  E.g.  if your network number is 128.6, your
broadcast address is 128.6.0.0.  The new standard allows either
128.6.255.255 or 255.255.255.255.  If you are using subnets, things
get more complex.  4.2 didn't support subnets, but if you patched your
4.2 to do so, you will probably have ended up with a broadcast address
of net.subnet.0.  E.g. for us a typical one would be 128.6.4.0.  The
new standard, and 4.3, say that the correct broadcast address for a
subnetted network is 128.6.4.255.

One approach would be to tell your 4.3-based systems (i.e. your 
Sun 3.4 systems) to use the old broadcast address.  There should be
an option to ifconfig to do this.  What bothers me is that this
option may not take effect during the early stages of booting.
However the simplest thing to try would be to change the ifconfig
commands, normally present in /etc/rc or /etc/rc.boot to contain
the appropriate option.  Assuming you don't use subnets, this would
be something like
  ifconfig ie0 `/bin/hostname` up -trailers broadcast 128.6.0.0
Everything up to "broadcast" should be whatever your ifconfig command
is now.  It may be that the option is -broadcast.  You should use your
own net number in place of 128.6.0.0.  You must make this change to
/etc/rc.boot for every individual client partition.  This means you'll
have to bring up the clients one by one single-user or just mount the
partitions on the server, using /dev/ndlx (making sure that the
clients are not running at the time).  You might try this for a few
clients to see whether it fixes your problem, before doing it on
all of them.

In retrospect, Sun would probably have been better off distributing
3.4 with the old broadcast address as a default.  Once everyone had
upgraded to 3.4, the next release could safely move to the new
address, since 3.4 should (if it is properly implemented) accept
either.  At the very least the setup program should provide this as an
option.  (Of course I haven't seen 3.4 yet -- maybe it does.)

Other approaches to this problem are to fix all your existing systems
to accept the new address (which may be the best solution if you
have source to them -- we can give you the changes), or to put a
gateway between your 3.4 systems and everything else.  If you don't
have any other kind of gateway, you could add a second Ethernet board
to one of your servers and use it as a gateway.

Finally, if all of your systems are Suns, the simplest thing to do is
simply to upgrade them all at once.  Bring them all down, and then
bring them up one by one on 3.4.

david@elroy.Jpl.Nasa.Gov (David Robinson) (08/28/87)

It has been noted before by someone that SunOS 3.4 does
not check to see if the value of the ICMP mask
request is valid.  Supposedly Wollengong Win 3.0 for
VMS returns back a subnet address of 0x0000FFFF which then
causes the Suns to go into a broadcast storm.  You then must
disconnect the VMS Vaxen and reboot all Suns.

-- 
	David Robinson		elroy!david@csvax.caltech.edu     ARPA
				david@elroy.jpl.nasa.gov (new)
				seismo!cit-vax!elroy!david UUCP
Disclaimer: No one listens to me anyway!

cyrus@hi.UUCP (Tait Cyrus) (08/29/87)

In article <1615@briar.Philips.Com> jmr@philabs.Philips.Com (Joanne Mannarino) writes:
>
>
>In trying to upgrade our SUN 3/180 fileserver (named condor) to SUN UNIX
>version 3.4 along with 11 diskless clients, I ran into some problems.  The
> ...
>Ethernet).  The problem began when the booting process got to the point for:
>
>	starting rpc and net services: portmap router biod
>
>The boot process then halts with the following error messages:
>
>	server not responding
>	RPC: program not registered
>	mount retrying

Something that ?might? be causing some problems is that when a SUN client
boots, it determines its netmask from an ICMP (rfc 950) 'netmask request'.

At one point, here at the University of New Mexico, one of our SUN's
was configured with the WRONG netmask.  When we tried to boot our
diskless clients, they got the wrong netmask and were unable to talk
with the server.   Rebooting the server did not fix the problem because
it got the netmask from the partially booted clients.  We ended up
halting ALL of our SUN's, booting the servers with the correct netmask,
and THEN booting our clients.  

I don't know if this is your problem or not, but you might look into it.

-- 
    @__________@    W. Tait Cyrus   (505) 277-0806
   /|         /|    University of New Mexico
  / |        / |    Dept of EECE - Hypercube Project
 @__|_______@  |    Albuquerque, New Mexico 87131
 |  |       |  |
 |  |  hc   |  |    e-mail:
 |  @.......|..@       cyrus@hc.dspo.gov or
 | /        | /        seismo!unmvax!hi!cyrus
 @/_________@/