stern@polygen.UUCP (Hal Stern) (10/09/87)
we have a heterogeneous network of Suns and VaxStations, all with NFS. the interesting part of our network is: machine polygen, Sun 3/160, two eagles running Sun 3.2 machine figeac, Vaxstation II GPX, running Ultrix 2.0-1 polygen exports /usr/polygen, figeac exports /usr/figeac. both mount the other's disks. figeac's /etc/fstab entry for polygen's disk looks like: /usr/polygen@polygen:/usr/polygen:rw:0:0:nfs:bg,soft: /usr/polygen is the "g" partition of a fujitsu eagle. figeac has three RD-53s. our local source code control system keeps all sources for a product in a single directory tree, using subdirectories for each specific machine on which it gets compiled. in the case of product XX, the source code lives on the Sun disk, in /usr/polygen/XX. here's the problem: when the Vaxstation figeac begins to compile the source for XX, using the files mounted via NFS, various object files get mangled. the errors we usually see are from ar: "mangled string table" "foo.o: bad format" i have run nm on the corrupted object files, and have found that many of them have text and bss points with no names. it is very hard to reproduce this problem, although it appears to happen when we combine many large .o files into a single .a archive. the individual objects appear to compile OK, but when the archiver runs it complains of corrupted files. it appears that both cc and ar are at fault; i have stopped the compilation and inspected .o files to find that they have been corrupted. we have temporarily solved the problem by separating the source trees and running the compile locally (and it completes without problem). does anyone else have this problem? could this be related to an NFS timeout problem? the compile completes OK on local disks but fails on NFS-mounted disks, which would lead me think it is NFS-related. replies by e-mail will be summarized, with results. thanks. --hal stern polygen corporation 200 fifth avenue waltham, ma 02254 (617) 890-2888 {bu-cs, princeton}!polygen!stern
hedrick@topaz.rutgers.edu (Charles Hedrick) (10/14/87)
Reply-Path: > does anyone else have this problem? could this be related to an NFS > timeout problem? the compile completes OK on local disks but fails > on NFS-mounted disks, which would lead me think it is NFS-related. We have only used NFS on Suns and Pyramids so far, so I can't give experiences on Ultrix. But I can comment on whether your problem could be caused by timeouts. Assuming that NFS has been properly implemented, you can control the results of a timeout by whether the remote filesystem is mounted hard or soft. If it is mounted hard, a timeout simply causes the system to reset some parameters and try again. The program will not proceed until the data transfer has succeeded. So with hard mounts, it should (aside from bugs in NFS) be impossible for network problems to lead to corrupted data. With soft mounts, at some point NFS will give up and return an error to the program. If all programs were written ideally, the operation you were attempting would print some error message and terminate abnormally. Unfortunately, as we all know, there are Unix programs that do not bother to check for error returns from read and write. So it is quite possible that a program could proceed as if the write had succeeded, and you could end up with corrupted data. For this reason, all NFS documentation that I have seen cautions against the use of soft mounts. In the most recent Sun implementation there is a compromise, the "intr" option. This allows you to ^C out of a failing operation (eventually), but otherwise acts likes a normal hard mount (assuming you use "hard" and "intr"). The biggest problem with NFS is that there is no really nice way to make the "right" thing happen all the time. With the int option, if a server goes down, you can eventually get out of failing programs. But it may take several ^C's and a fair amount of waiting. "df" is particularly irritating. What you'd like is that when a system went down, somehow anything trying to use that file system would somehow magically get aborted in some unambiguous way, but that for transient failures, things would retry until they succeed. With the existing Unix and its utilities, this may be hard to do. There is one other way to get corrupted data from NFS: through undetected network problems. For performance reasons Sun disables the normal UDP checksumming for NFS packets. (One presumes that DEC has not changed this in Ultrix, though of course they could have.) They depend entirely upon the Ethernet packet checksums. This should be OK. But we once had a bad board in a gateway cause data going through that gateway to be corrupted. These errors were not detected, and we ended up with bad files. This certainly sounds scary, but in fact bad hardware can always corrupt your data. If the same error that happened in the gateway had happened to one of the end systems, then no checksumming would have been able to detect the problem.