[comp.sys.sun] NULs in NFS files

mp@allegra.att.com (12/30/88)

We've encountered a rather frustrating problem with NFS files (under 4.0
and 4.0.1 on Sun-3's) getting corrupted, usually by having short sequences
of NULs appear in them.

So far the only corrupted files we've found are some of the small .o files
generated when a new kernel is made on a diskless client.  The new kernel
is compiled in /var/sys (an NFS filesystem; if /var is mounted on a small
local disk, there's no problem).  /var/sys is comprised of (when possible)
symbolic links to the original files in /sys.  The kernel I'm comparing
things against is called CLIENT, which is a result of a GENERIC config
file with one difference: vmunix is specified as having its default root
and swap on type nfs.  Sometimes a .o file (usually ioconf.o) will have no
namelist, but what usually happens is that 3 or 4 of the files will each
have a streak of a few NULs and the resulting kernel won't behave right.
Some files have problems much more frequently than others: these are
stubs.o, sc_conf.o, in_proto.o, and mcp_conf.o.  Here are some sample
differences:

diff between CLIENT/stubs.o and OMEGA/stubs.o
text	data	bss	dec	hex
32	72	0	104	68		stubs.o
? map
b1 = 0x0	 e1 = 0x20	  f1 = 0x20	   `stubs.o'
b2 = 0x0	 e2 = 0x68	  f2 = 0x20	   `stubs.o'
cmp -l gives
   129 156   0
   130 151   0
   131 164   0
od | diff gives
*** CLIENT	Tue Nov	 8 10:13:44 1988
--- OMEGA	Tue Nov	 8 10:13:44 1988
***************
  0000160  000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
! 0000200  156 151 164 000 000 000 000 000 000 000 000 042 000 000 006 100
  0000220  000 000 000 004 006 000 000 000 000 000 000 040 000 000 000 014
---
  0000160  000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
! 0000200  000 000 000 000 000 000 000 000 000 000 000 042 000 000 006 100
  0000220  000 000 000 004 006 000 000 000 000 000 000 040 000 000 000 014


diff between CLIENT/ioconf.o and OMEGA/ioconf.o
text	data	bss	dec	hex
0	8400	0	8400	20d0	ioconf.o
? map
b1 = 0x0	 e1 = 0x0	  f1 = 0x20	   `ioconf.o'
b2 = 0x0	 e2 = 0x20d0	  f2 = 0x20	   `ioconf.o'
cmp -l gives
    19   6   0
    20  60   0
    31   6  11
    32 100 140
od | diff gives
*** /tmp/CLIENT	Tue Dec 20 13:00:28 1988
--- /tmp/OMEGA	Tue Dec 20 13:00:19 1988
***************
  0000000  000 002 001 007 000 000 000 000 000 000 040 320 000 000 000 000
! 0000020  000 000 006 060 000 000 000 000 000 000 000 000 000 000 006 100
  0000040  000 000 000 000 000 000 000 000 000 000 000 104 000 000 000 000
---
  0000000  000 002 001 007 000 000 000 000 000 000 040 320 000 000 000 000
! 0000020  000 000 000 000 000 000 000 000 000 000 000 000 000 000 011 140
  0000040  000 000 000 000 000 000 000 000 000 000 000 104 000 000 000 000

Environment: Server is Sun-3/280 with xy451, 2 supereagles.  Client is a
diskless 3/260.  Server has about 3 clients, but problem occurs even when
the other clients are idle.  Both client and server are on DELNI's.  Both
are running SunOS 4.0; bug occurs whether running the out-of-the-box 4.0
kernel, one compiled and linked using the GENERIC config file, and one
containing 4.0.1 fixes related to nfs problems (nfs_vnodeops.o,
nfs_client.o, vm_hat.o, and kudp_fastsend.o with subsequent enabling of
udpcksum).  [Of course, these kernels are being compiled on the server,
not on the client!]  Problem occurs even if a different 3/260 is used.
Problem occurs even if 2 xy451's are used in the server (we and Sun
initially thought it might be the old
xylogics-controller-can't-handle-2-disks bug, especially since I've
arranged the server's filesystems so that /usr is on a disk different from
the clients' root and swap, which hopefully keeps both disks' arms going
simultaneously.)  There are no error messages on the consoles.  There is
plenty of free disk space in the client root partition (about 20MB).  The
client mounts its NFS partitions using whatever defaults Sun provides -
the options in the fstab are "rw" for / and "ro" for /usr.

By the way, here's a separate problem that I ran into when investigating
the above problem: when the additional xy451 controller was added I
thought I'd do the clients a favor and not make them reboot.  So rather
than xy1 becoming xy3 (because it would be drive 1 on controller 1) and
invalidating their mounted NFS filesystems, I made a kernel that had xy1
be xyc1 drive 1, and commented out the lines for xy2 and xy3.  Was I
sorry!  The nightly "find" that searches for core files crashed the server
each night!  It seems that just statting /export/root/.../dev/xy2a causes
a kernel mode bus error near specvp().

	Mark Plotnick
	allegra!mp