mp@allegra.att.com (12/30/88)
We've encountered a rather frustrating problem with NFS files (under 4.0 and 4.0.1 on Sun-3's) getting corrupted, usually by having short sequences of NULs appear in them. So far the only corrupted files we've found are some of the small .o files generated when a new kernel is made on a diskless client. The new kernel is compiled in /var/sys (an NFS filesystem; if /var is mounted on a small local disk, there's no problem). /var/sys is comprised of (when possible) symbolic links to the original files in /sys. The kernel I'm comparing things against is called CLIENT, which is a result of a GENERIC config file with one difference: vmunix is specified as having its default root and swap on type nfs. Sometimes a .o file (usually ioconf.o) will have no namelist, but what usually happens is that 3 or 4 of the files will each have a streak of a few NULs and the resulting kernel won't behave right. Some files have problems much more frequently than others: these are stubs.o, sc_conf.o, in_proto.o, and mcp_conf.o. Here are some sample differences: diff between CLIENT/stubs.o and OMEGA/stubs.o text data bss dec hex 32 72 0 104 68 stubs.o ? map b1 = 0x0 e1 = 0x20 f1 = 0x20 `stubs.o' b2 = 0x0 e2 = 0x68 f2 = 0x20 `stubs.o' cmp -l gives 129 156 0 130 151 0 131 164 0 od | diff gives *** CLIENT Tue Nov 8 10:13:44 1988 --- OMEGA Tue Nov 8 10:13:44 1988 *************** 0000160 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 ! 0000200 156 151 164 000 000 000 000 000 000 000 000 042 000 000 006 100 0000220 000 000 000 004 006 000 000 000 000 000 000 040 000 000 000 014 --- 0000160 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 ! 0000200 000 000 000 000 000 000 000 000 000 000 000 042 000 000 006 100 0000220 000 000 000 004 006 000 000 000 000 000 000 040 000 000 000 014 diff between CLIENT/ioconf.o and OMEGA/ioconf.o text data bss dec hex 0 8400 0 8400 20d0 ioconf.o ? map b1 = 0x0 e1 = 0x0 f1 = 0x20 `ioconf.o' b2 = 0x0 e2 = 0x20d0 f2 = 0x20 `ioconf.o' cmp -l gives 19 6 0 20 60 0 31 6 11 32 100 140 od | diff gives *** /tmp/CLIENT Tue Dec 20 13:00:28 1988 --- /tmp/OMEGA Tue Dec 20 13:00:19 1988 *************** 0000000 000 002 001 007 000 000 000 000 000 000 040 320 000 000 000 000 ! 0000020 000 000 006 060 000 000 000 000 000 000 000 000 000 000 006 100 0000040 000 000 000 000 000 000 000 000 000 000 000 104 000 000 000 000 --- 0000000 000 002 001 007 000 000 000 000 000 000 040 320 000 000 000 000 ! 0000020 000 000 000 000 000 000 000 000 000 000 000 000 000 000 011 140 0000040 000 000 000 000 000 000 000 000 000 000 000 104 000 000 000 000 Environment: Server is Sun-3/280 with xy451, 2 supereagles. Client is a diskless 3/260. Server has about 3 clients, but problem occurs even when the other clients are idle. Both client and server are on DELNI's. Both are running SunOS 4.0; bug occurs whether running the out-of-the-box 4.0 kernel, one compiled and linked using the GENERIC config file, and one containing 4.0.1 fixes related to nfs problems (nfs_vnodeops.o, nfs_client.o, vm_hat.o, and kudp_fastsend.o with subsequent enabling of udpcksum). [Of course, these kernels are being compiled on the server, not on the client!] Problem occurs even if a different 3/260 is used. Problem occurs even if 2 xy451's are used in the server (we and Sun initially thought it might be the old xylogics-controller-can't-handle-2-disks bug, especially since I've arranged the server's filesystems so that /usr is on a disk different from the clients' root and swap, which hopefully keeps both disks' arms going simultaneously.) There are no error messages on the consoles. There is plenty of free disk space in the client root partition (about 20MB). The client mounts its NFS partitions using whatever defaults Sun provides - the options in the fstab are "rw" for / and "ro" for /usr. By the way, here's a separate problem that I ran into when investigating the above problem: when the additional xy451 controller was added I thought I'd do the clients a favor and not make them reboot. So rather than xy1 becoming xy3 (because it would be drive 1 on controller 1) and invalidating their mounted NFS filesystems, I made a kernel that had xy1 be xyc1 drive 1, and commented out the lines for xy2 and xy3. Was I sorry! The nightly "find" that searches for core files crashed the server each night! It seems that just statting /export/root/.../dev/xy2a causes a kernel mode bus error near specvp(). Mark Plotnick allegra!mp