[comp.unix.ultrix] cc, ld, ar over NFS

stern@polygen.UUCP (Hal Stern) (10/09/87)

we have a heterogeneous network of Suns and VaxStations, all with
NFS.  the interesting part of our network is:
	machine polygen, Sun 3/160, two eagles running Sun 3.2
	machine figeac, Vaxstation II GPX, running Ultrix 2.0-1

polygen exports /usr/polygen, figeac exports /usr/figeac.  both
mount the other's disks.  figeac's /etc/fstab entry for polygen's
disk looks like:
	/usr/polygen@polygen:/usr/polygen:rw:0:0:nfs:bg,soft:
/usr/polygen is the "g" partition of a fujitsu eagle.  figeac has 
three RD-53s.

our local source code control system keeps all sources for a product
in a single directory tree, using subdirectories for each specific
machine on which it gets compiled.  in the case of product XX, the
source code lives on the Sun disk, in /usr/polygen/XX.  

here's the problem: when the Vaxstation figeac begins to compile
the source for XX, using the files mounted via NFS, various object
files get mangled.  the errors we usually see are from ar:
	"mangled string table"
	"foo.o: bad format"

i have run nm on the corrupted object files, and have found that
many of them have text and bss points with no names.  it is
very hard to reproduce this problem, although it appears to happen
when we combine many large .o files into a single .a archive.
the individual objects appear to compile OK, but when the archiver
runs it complains of corrupted files.  it appears that both cc and
ar are at fault; i have stopped the compilation and inspected .o 
files to find that they have been corrupted.

we have temporarily solved the problem by separating the source trees
and running the compile locally (and it completes without problem).  

does anyone else have this problem?  could this be related to an NFS 
timeout problem?  the compile completes OK on local disks but fails 
on NFS-mounted disks, which would lead me think it is NFS-related.

replies by e-mail will be summarized, with results.  thanks.

--hal stern
  polygen corporation
  200 fifth avenue
  waltham, ma  02254
  (617) 890-2888

  {bu-cs, princeton}!polygen!stern

hedrick@topaz.rutgers.edu (Charles Hedrick) (10/14/87)

Reply-Path:



> does anyone else have this problem?  could this be related to an NFS 
> timeout problem?  the compile completes OK on local disks but fails 
> on NFS-mounted disks, which would lead me think it is NFS-related.

We have only used NFS on Suns and Pyramids so far, so I can't give
experiences on Ultrix.  But I can comment on whether your problem
could be caused by timeouts.  Assuming that NFS has been properly
implemented, you can control the results of a timeout by whether the
remote filesystem is mounted hard or soft.  If it is mounted hard, a
timeout simply causes the system to reset some parameters and try
again.  The program will not proceed until the data transfer has
succeeded.  So with hard mounts, it should (aside from bugs in NFS) be
impossible for network problems to lead to corrupted data.  With soft
mounts, at some point NFS will give up and return an error to the
program.  If all programs were written ideally, the operation you were
attempting would print some error message and terminate abnormally.
Unfortunately, as we all know, there are Unix programs that do not
bother to check for error returns from read and write.  So it is quite
possible that a program could proceed as if the write had succeeded,
and you could end up with corrupted data.  For this reason, all NFS
documentation that I have seen cautions against the use of soft
mounts.  In the most recent Sun implementation there is a compromise,
the "intr" option.  This allows you to ^C out of a failing operation
(eventually), but otherwise acts likes a normal hard mount (assuming
you use "hard" and "intr").  

The biggest problem with NFS is that there is no really nice way to
make the "right" thing happen all the time.  With the int option, if a
server goes down, you can eventually get out of failing programs.  But
it may take several ^C's and a fair amount of waiting.  "df" is
particularly irritating.  What you'd like is that when a system went
down, somehow anything trying to use that file system would somehow
magically get aborted in some unambiguous way, but that for transient
failures, things would retry until they succeed.  With the existing
Unix and its utilities, this may be hard to do.

There is one other way to get corrupted data from NFS: through
undetected network problems.  For performance reasons Sun disables the
normal UDP checksumming for NFS packets.  (One presumes that DEC has
not changed this in Ultrix, though of course they could have.)  They
depend entirely upon the Ethernet packet checksums.  This should be
OK.  But we once had a bad board in a gateway cause data going through
that gateway to be corrupted.  These errors were not detected, and we
ended up with bad files.  This certainly sounds scary, but in fact bad
hardware can always corrupt your data.  If the same error that
happened in the gateway had happened to one of the end systems, then
no checksumming would have been able to detect the problem.