[comp.sys.sgi] nfs failure between Gould NP1 and Personal Iris

mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) (11/27/89)

In article <MIKE.89Nov27094420@cfdl.larc.nasa.gov>
mike@cfdl.larc.nasa.gov (Mike Walker) writes:
>I am having a strange error occur on a NFS mounted partition on our PI.
>First a little info about the machines involved:
	[ ... details deleted ... ]
>Symptoms:
>  - ls works everywhere
>  - cat, grep, etc. (normal file access) works everywhere
>  - echo * fails only on the Gould file-system (error: ``no match'')
>  - find fails only on the Gould file-system
>    (error: ``getwd: read error in ..'')
>  - None of these problems show up on the Sun or the Gould using local,
>    Sun NFS, or Irix NFS file-systems.
>Thanks for any help,
>Mike
>--
>Mike Walker   AS&M Inc/NASA LaRC   (804) 864-2305 

I have had very similar problems trying to get NFS to work between our
PI and our NeXT (which we bought as a cheap file server). NFS works
pretty well between our 3030's and our NeXT, though often when the
NeXT tries to write a file on the IRIS's disk, it ends up with
user-id=-1 and group-id=-1.  I would understand this if a setuid
program were trying to write the file, but it happens when almost any
program writes.... Examples are `cp' and `emacs', both owned by root,
but neither with setuid or setgroupid bits set....
--
John D. McCalpin - mccalpin@masig1.ocean.fsu.edu
		   mccalpin@scri1.scri.fsu.edu
		   mccalpin@delocn.udel.edu

mike@cfdl.larc.nasa.gov (Mike Walker) (11/27/89)

I am having a strange error occur on a NFS mounted partition on our PI.
First a little info about the machines involved:

  1) Personal Iris 4D-20 w/ Irix 3.2
  2) Gould NP1 w/ UTX/32 3.1 (BSD w/ SVR3 extensions)
  3) Sun 3/280 w/ Sun Unix 3.4

I have one file system from machines 2 and 3 above mounted on the PI.
Everything seems to work fine with the Sun based fs, but certain
operations fail on the fs mounted off of the Gould. Symptoms:

  - ls works everywhere
  - cat, grep, etc. (normal file access) works everywhere
  - echo * fails only on the Gould file-system (error: ``no match'')
  - find fails only on the Gould file-system
    (error: ``getwd: read error in ..'')
  - None of these problems show up on the Sun or the Gould using local,
    Sun NFS, or Irix NFS file-systems.

I noticed the problem when I tried running the X11 lndir.sh script on
the Iris to set up links to the X distribution mounted from my Gould.
Does anyone have any clues as to what could be wrong here? As I said,
many things work find (after running lndir on the Gould, I was able to
compile the X library on the Iris without any [NFS related] problems).

Thanks for any help,
Mike
--
Mike Walker   AS&M Inc/NASA LaRC   (804) 864-2305 

brendan@illyria.wpd.sgi.com (Brendan Eich) (12/25/89)

In article <MIKE.89Nov27094420@cfdl.larc.nasa.gov>, mike@cfdl.larc.nasa.gov (Mike Walker) writes:
> I am having a strange error occur on a NFS mounted partition on our PI.
> First a little info about the machines involved:
> 
>   1) Personal Iris 4D-20 w/ Irix 3.2
>   2) Gould NP1 w/ UTX/32 3.1 (BSD w/ SVR3 extensions)
>   3) Sun 3/280 w/ Sun Unix 3.4
> 
> I have one file system from machines 2 and 3 above mounted on the PI.
> Everything seems to work fine with the Sun based fs, but certain
> operations fail on the fs mounted off of the Gould. Symptoms:
> 
>   - ls works everywhere
>   - cat, grep, etc. (normal file access) works everywhere
>   - echo * fails only on the Gould file-system (error: ``no match'')
>   - find fails only on the Gould file-system
>     (error: ``getwd: read error in ..'')
>   - None of these problems show up on the Sun or the Gould using local,
>     Sun NFS, or Irix NFS file-systems.

Mike informed me via private communication that only the C-shell's echo
failed to match * against visible filenames; 'echo *' in the Bourne shell
worked as expected.  This clue, plus Ethernet packet traces captured by
Mike (thanks!), exposed a server bug seen at previous Connectathons (a
Connectathon is an annual NFS interoperation conference thrown by Sun,
attended by most NFS vendors).

Clients may call the NFS readdir remote procedure with an arbitrary byte
count indicating the number of bytes allocated for filesystem-independent
directory entries.  The reference NFS server code uses this byte count to
allocate space for server-dependent directory entries, and calls the local
filesystem to read the directory.  Older reference NFS ports contained
BSD Fast File System (FFS) readdir code that failed with EINVAL if the
requested byte count was less than, or not congruent with, DIRBLKSIZ.  

DIRBLKSIZ is typically 512.  SGI's C-shell, and several other BSD-derived
programs that SGI ships, use a byte count of 512 when they call the BSD
version of readdir(3B).  If the directory is remote, and if its NFS server
is based on an older NFS reference port and has a DIRBLKSIZ of, say, 1024,
the server will reject the client's readdir call with a status code equal
to EINVAL (22).  This is exactly what Mike's Gould server does, so it is
likely that Gould has defined their DIRBLKSIZ to be 1024 (perhaps because
their disks use 1024-byte sectors).

Our C-shell, a straight port of 4.3BSD csh, doesn't check for readdir
errors, so the EINVAL causes 'echo *' to silently complete, apparently
successfully, but with "No match".  The bourne shell uses the AT&T-based
readdir(3C) routine, which asks for 4096 bytes worth of directory entries,
thus avoiding the bug.

Note that the NFS protocol doesn't define EINVAL as a well-known status
code -- however, the protocol's status codes are defined by enumerating
certain 4.2BSD/SunOS intro(2) error numbers, and all NFS implementations
that I've seen from Sun fail to check for error numbers not in the status
enumeration, in order to avoid sending them.  Almost any server error code
could leak through the protocol.  Our NFS maps unspecified error numbers
such as EINVAL onto the NFSERR_IO status code.  Gould's NFS does not.

NFS implementors have always relied on the Sun reference ports of NFS
to 4.3BSD for standardization, lacking a complete spec (the NFS version 2
protocol has an RFC, but it doesn't place any restrictions on readdir's
byte count argument; it doesn't even distinguish between client and server
uses of this number).  The latest reference port (NFSSRC4.0.x) that Sun
has shipped to licensed NFS vendors has fixed BSD FFS readdir to accept
any byte count.  Perhaps Gould has, or will soon have, a version of NFS
based on this release.

Brendan Eich
Silicon Graphics, Inc.
brendan@sgi.com