jmr@philabs.Philips.Com (Joanne Mannarino) (08/26/87)
In trying to upgrade our SUN 3/180 fileserver (named condor) to SUN UNIX version 3.4 along with 11 diskless clients, I ran into some problems. The upgrade procedure on condor went fine. I reconfigured the kernel for 3.4, rebooted condor and still no problem. Then I tried booting up all of the diskless clients (one at a time) and then the headaches began. The booting process began with "requesting internet address" with the host responding with the correct information (thus there is communication via our Ethernet). The problem began when the booting process got to the point for: starting rpc and net services: portmap router biod The boot process then halts with the following error messages: server not responding RPC: program not registered mount retrying /usr /usr/condor This will remain at this point until you either manually abort or power down the unit. At this point, any active workstation on the network (ie, SUNs either connected to our other fileserver (which still runs 3.2) or diskful SUNs sitting on the net) displays a screenful of "ie0: no carrier" and "Ethernet jammed" error messages. I contacted SUN support immediately and after running tests to see if all of the daemons that should be running were running, the conclusion was made by SUN that the problem is somewhere within our Ethernet structure. SUN said that 3.4 includes major changes in the Ethernet drivers that don't correct for possible problems in the network. At this point SUN support referred me to someone in their Data Communications support department. After running some net stats and sending them the data, I was told "your network looks ok". BUT still we are having problems. We've tried some different things to see if we could isolate the problem (actually this was done before the fileserver upgrade, but we wrote it off as being a network problem isolated to a particular laboratory). We tried running a diskful 3/160 as a server for a diskless 3/160 both running 3.4 and we ran across the same problems. It was suggested that we take both units off of our main net and hook them up directly to their own mini net. When this was done, the problem went away, ie, the client came up running 3.4. We have also tried changing the /etc/fstab on a client and "backgrounding" the mount process. This results in the client coming up in single user mode. Then after trying to manually mount a filesystem, I get the above errors of "server not responding" and "mount retrying". As an interim solution, we have kept the 3.4 enhancements (I didn't back out of the upgrade) and are running a 3.2 kernel. Everything seems fine, but this still doesn't solve our problem. Some SUN reps claim that the problem is definitely with our network, others say it's in the 3.4 software. Anyone else experienced these symptoms when upgrading to 3.4? Any suggestions on what we should do now? thanks in advance, Joanne Mannarino -- joanne mannarino seismo!philabs!jmr philips laboratories or (914)945-6008 jmr@philabs.philips.com
earle@jplopto.uucp (Greg Earle) (08/27/87)
Some of the files that were supposed to be on the 3.4 upgrade tapes didn't
make it. There is a tar file of fixes, available via anonymous FTP from
host sesun.JPL.NASA.GOV [128.149.4.18], in pub/3.4-fix.tar. Here is the
README file that accompanies it (3.4-fix.README) :
--------------------------------------------------------------------------
3.4-fix.tar:
This distribution contains a number of files that were inadvertantly
omitted from the SunOS 3.4 distribution tape. All the binaries
(except in.rwhod) are ones received directly from Sun and will be
incorporated in the next release. This should be considered as a
MANDATORY patch to SunOS 3.4 systems. The problems that appear are
related to RPC broadcasts. In particular, ypbind will fail if
the machine is on a subnet.
David Robinson
MS 168-522
JPL
4800 Oak Grove Drive, Pasadena CA 91109
(818) 354-3595 (Office)
Contents:
./etc/ypbind
./etc/umount
./etc/in.routed
./usr/ucb/rup
./usr/ucb/rusers
./usr/etc/in.rwhod
Install by:
# /etc/halt
> b vmunix -s
# mount /usr
# cd /
# tar xvpf 3.4-fix.tar
# /etc/reboot
----------------------------------------------------------------------
David Robinson of Caltech discovered the problem, and talking with Sun
discovered the fact that the corrected versions accidentally didn't get
on the 3.4 upgrade tapes.
Since the network was OK before, it's doubtful that it is bad; however the
`ie0: no carrier' messages *are* puzzling since they normally indicate a
tap fallen off, or a bad solder connection in a cable connector (if not
using vampire taps), etc.
Are you running subnets?
- Greg Earle
Currently moonlighting for Sun Consulting
Greg Earle earle@jplopto.JPL.NASA.GOV
Sun Consulting earle%jplopto@jpl-elroy.ARPA [aka:]
(Freelance - earle%jplopto@elroy.JPL.NASA.GOV
write me) ...!cit-vax!elroy!smeagol!jplopto!earle
earle@jplopto.uucp (Greg Earle) (08/27/87)
Joanne, Are you running YP? Your boots are dying when it tries to do NFS mounts; in order to do this there has to be an entry for the rpc.mountd daemon in /etc/servers, so the program can be registered. Trouble is, when you install diskless clients, either /etc/servers or /etc/services (or both) *do not get installed* on diskless clients. If it is /etc/servers and (I think) you do not run Yellow Pages, then the portmapper will not be able to get the entry for the server, and it will emit the messages you describe. I suggest you halt all your diskless clients, retry the 3.4 kernel on `condor', then successively mount /dev/ndl[0-9?] onto /mnt, and do a cp /etc/servers /mnt/etc/servers (and do /etc/services just to make sure). Then unmount /mnt. After you're all done, reboot the server with the 3.4 kernel, then try the client reboots again. See what happens. Just a thought, - Greg Greg Earle earle@jplopto.JPL.NASA.GOV Sun Consulting earle%jplopto@jpl-elroy.ARPA [aka:] (Freelance - earle%jplopto@elroy.JPL.NASA.GOV write me) ...!cit-vax!elroy!smeagol!jplopto!earle
hedrick@topaz.rutgers.edu (Charles Hedrick) (08/27/87)
I claim no expertise in SunOS 3.4. We are using 3.2 with locally-added networking enhancements that put it somewhere between 3.3 and 3.4 in terms of functionality. However from your results, it sounds like Sun's diagnosis is right. The fact that your hosts all get "ie0: no carrier" or "Ethernet jammed" strongly indicates a broadcast storm. The fact that things work when you use a separate Ethernet suggests that there is no error in your software or setup. However it's not quite right to say that the problem is with your "network". The problem is not with the network itself, but with the hosts on that network. If all of the hosts on it are Suns, then Sun can't entirely avoid blame. 3.4 is based on 4.3BSD's version of IP. 3.2 is based on 4.2BSD's version of IP. Between 4.2 and 4.3, the broadcast address was changed. (The people who changed the standard should be shot. The amount of damage done to networks and the reputation of IP due to inconsistent broadcast addresses is enormous. By the way, this is not Berkeley's fault. The standard actually changed.) Unfortunately, there are various bugs in 4.2 (and presumably Sun 3.2), such that any disagreement over the broadcast address can cause such a flurry of ICMP unreachables and ARP's that the network becomes unusable. The solution is going to depend upon the particular set of machines on your network. You have two choices: find some broadcast address on which everyone can agree, or split the network. 4.3-based systems allow you to set the broadcast address. So do some 4.2-based systems that contain "4.3 enhancements". This includes Ultrix and Pyramid. Unmodified 4.2 systems use net.0 as the broadcast address. E.g. if your network number is 128.6, your broadcast address is 128.6.0.0. The new standard allows either 128.6.255.255 or 255.255.255.255. If you are using subnets, things get more complex. 4.2 didn't support subnets, but if you patched your 4.2 to do so, you will probably have ended up with a broadcast address of net.subnet.0. E.g. for us a typical one would be 128.6.4.0. The new standard, and 4.3, say that the correct broadcast address for a subnetted network is 128.6.4.255. One approach would be to tell your 4.3-based systems (i.e. your Sun 3.4 systems) to use the old broadcast address. There should be an option to ifconfig to do this. What bothers me is that this option may not take effect during the early stages of booting. However the simplest thing to try would be to change the ifconfig commands, normally present in /etc/rc or /etc/rc.boot to contain the appropriate option. Assuming you don't use subnets, this would be something like ifconfig ie0 `/bin/hostname` up -trailers broadcast 128.6.0.0 Everything up to "broadcast" should be whatever your ifconfig command is now. It may be that the option is -broadcast. You should use your own net number in place of 128.6.0.0. You must make this change to /etc/rc.boot for every individual client partition. This means you'll have to bring up the clients one by one single-user or just mount the partitions on the server, using /dev/ndlx (making sure that the clients are not running at the time). You might try this for a few clients to see whether it fixes your problem, before doing it on all of them. In retrospect, Sun would probably have been better off distributing 3.4 with the old broadcast address as a default. Once everyone had upgraded to 3.4, the next release could safely move to the new address, since 3.4 should (if it is properly implemented) accept either. At the very least the setup program should provide this as an option. (Of course I haven't seen 3.4 yet -- maybe it does.) Other approaches to this problem are to fix all your existing systems to accept the new address (which may be the best solution if you have source to them -- we can give you the changes), or to put a gateway between your 3.4 systems and everything else. If you don't have any other kind of gateway, you could add a second Ethernet board to one of your servers and use it as a gateway. Finally, if all of your systems are Suns, the simplest thing to do is simply to upgrade them all at once. Bring them all down, and then bring them up one by one on 3.4.
david@elroy.Jpl.Nasa.Gov (David Robinson) (08/28/87)
It has been noted before by someone that SunOS 3.4 does not check to see if the value of the ICMP mask request is valid. Supposedly Wollengong Win 3.0 for VMS returns back a subnet address of 0x0000FFFF which then causes the Suns to go into a broadcast storm. You then must disconnect the VMS Vaxen and reboot all Suns. -- David Robinson elroy!david@csvax.caltech.edu ARPA david@elroy.jpl.nasa.gov (new) seismo!cit-vax!elroy!david UUCP Disclaimer: No one listens to me anyway!
cyrus@hi.UUCP (Tait Cyrus) (08/29/87)
In article <1615@briar.Philips.Com> jmr@philabs.Philips.Com (Joanne Mannarino) writes: > > >In trying to upgrade our SUN 3/180 fileserver (named condor) to SUN UNIX >version 3.4 along with 11 diskless clients, I ran into some problems. The > ... >Ethernet). The problem began when the booting process got to the point for: > > starting rpc and net services: portmap router biod > >The boot process then halts with the following error messages: > > server not responding > RPC: program not registered > mount retrying Something that ?might? be causing some problems is that when a SUN client boots, it determines its netmask from an ICMP (rfc 950) 'netmask request'. At one point, here at the University of New Mexico, one of our SUN's was configured with the WRONG netmask. When we tried to boot our diskless clients, they got the wrong netmask and were unable to talk with the server. Rebooting the server did not fix the problem because it got the netmask from the partially booted clients. We ended up halting ALL of our SUN's, booting the servers with the correct netmask, and THEN booting our clients. I don't know if this is your problem or not, but you might look into it. -- @__________@ W. Tait Cyrus (505) 277-0806 /| /| University of New Mexico / | / | Dept of EECE - Hypercube Project @__|_______@ | Albuquerque, New Mexico 87131 | | | | | | hc | | e-mail: | @.......|..@ cyrus@hc.dspo.gov or | / | / seismo!unmvax!hi!cyrus @/_________@/