[comp.protocols.nfs] Wierd bug in PC-NFS?

gnb@bby.oz (Gregory N. Bond) (09/26/89)

Here is what seems to be a bug in the PC-NFS V3.0 - either
implementation or protocol.

Background: 

We have 2 yp servers (melba and sid), with melba as the master.  Melba
(a Sun 3/260 running SunOs 3.5Export) has 2 ethernet cards, ie0 is the
main Sun net, ie1 is a net with about 6 PCs running PC-NFS V3.0. Sid
is on the main net (ie0).  All PCs use RARP and YP, with only the YP
domain name being configured on the PC (i.e. everything else is
determined dynamically).  Both nets are class C nets, no subnets.

This morning, the main net was ruptured by workmen, and the portmapper
died on melba in the ensuing net.chaos.  We shut melba down to single
user and rebooted.

Problem: 

When the main server was back up, we rebooted the PCs.  All of them
got as far as "NET START RDR" and hung.  Much probing with etherfind
etc found that all machines were sending a UDP packet to melba-gw (the
ie1 interface of melba), port 1027, about 1 per second.  The sequence
was 4 packets to port 1027, 1 to port 5379, which was ypbind on melba.
Port 1027 was NOT registered in the melba portmapper, nor is it in
/etc/services or /etc/servers.

Cause:

During the net.chaos, melba had bound its yp program to sid, even
though melba was the YP master.  In fact, port 1027 on sid was the
actual ypserv port on sid (as shown by rpcinfo -p).  The fix was
simple - use /usr/etc/ypset to bind melba to melba.  The PCs then
continued the boot sequence without any further intervention.

Analysis:

This is my tentative analysis of what happens: During the boot
sequence, the PC broadcasts a request to find a yp server.  The ypbind
on melba replies, saying that the server is on port x, but don't
mention that it is host sid.  The PC then assumes that they mean port
x on melba and hang waiting on YP service.

Hete is the output of etherfind during a boot attempt:
                                           icmp type
 lnth proto         source     destination  src port   dst port
   64 icmp            rsm3      0xffffffff      17
   64 icmp            rsm3      0xffffffff      17
   64  arp            rsm3       broadcast 
  100  udp            rsm3        melba-gw       1500     sunrpc
   64  udp            rsm3       broadcast    discard    discard
   92  udp            rsm3        melba-gw       1500       5371
  100  udp            rsm3        melba-gw       1501     sunrpc
   92  udp            rsm3        melba-gw       1501       5379
  124  udp            rsm3        melba-gw       1501       1027
  124  udp            rsm3        melba-gw       1501       1027
  124  udp            rsm3        melba-gw       1501       1027
  124  udp            rsm3        melba-gw       1501       1027
	[ Last 5 lines repeaed indefinitly, 1 per sec (approx) ]

(I have a fuller dump if people are interested).

Can anyone confirm my analysis?  Is the bug in the SunOs side or the
PC side?

Greg.
--
Gregory Bond, Burdett Buckeridge & Young Ltd, Melbourne, Australia
Internet: gnb@melba.bby.oz.au    non-MX: gnb%melba.bby.oz@uunet.uu.net
Uucp: {uunet,pyramid,ubc-cs,ukc,mcvax,prlb2,nttlab...}!munnari!melba.bby.oz!gnb