[comp.unix.wizards] [Jason Venner: nfs that follows through mounted file systems

merritt@BRL.MIL (Don Merritt) (02/15/89)
>I remember seeing a note from someone that said they had hacked NFS
>so that client nfs requests followed through mounted file systems.
>[ie: just have to mount the root on your file server to have all
>the filesystems available].
>
>Can anyone tell me how...
>
>JASON


That work was done here at BRL by Doug Kingston. Here is Doug's description
of how to do it.

=============================================================================

Date:    Fri, 5 Dec 86 22:12:03 EST
>From:    NFS Functionality Enhancement Committee <dpk@brl.arpa>
To:      Sun-Spots@rice.edu, unix-wizards@brl.arpa
Subject: Updated NFS Change to merge filesystems

(This is an updated version of my previous letter.  We think its finished.)

We are just beginning to use NFS around BRL and I have been amazed
at how little thought seems to have been put into using NFS in
a large collection of large hosts.  Many of our machine have 8 to 16
disk segments mounted, and almost as many physical disks, so there
is little that can be done to lower the number of mounted partitions.
We wish to make every file system available from every system (or a
close approximation of this).  If we were to use the normal SUN NFS
implementation, we would have mount tables with 100 to 200 mounted
filesystems.  This is a nightmare.  I like to sleep, so I have made
the following change to nfs/nfs_server.c and nfs/nfs_vnodeops.c,
both part of the NFS related kernel source.

The effect of this change is to make tree of mounted local file
systems appear as a single homogeneous file system to remote
system that mount the root of such a tree.

Mount points are invisibly followed as long as they go to a file
system of the same type (which in this case is local).  The restriction
on the same type of file system is necessary to prevent file system
loops.  When/If more local file system types are supported, the "if"
below would have to be made smarter.  The statfs operation is
somewhat meaningless with this change since it will only return
the stats for the file system you mounted and not any file systems
under it.

The change to nfs_vnodeops.c is to improve the information content
of the faked-up dev entry in a stat structure of a remote file.
The key problem is that the dev entry is still a short, making it
very hard to make useful dev entries for remote files.  My adhoc
scheme allows for up to 31 remote mounts (hosts) until things fall
apart.  st_dev should really be at least a long.  Ideally it would
be an object containing an fsid and a machineid.  Maybe on the next
version...

The end result of all this is that you can now make all the file
systems on a server system available by simply mounting the root file
system (actually directory, e.g. mount -t nfs -o bg,soft host:/ /n/host).

We have chosen to creat a directory /n and to make a directory in
it for each system we wish to make available.  We then mount the
root of each system as /n/hostA, /n/hostB, ...

It is quite possible some of you may be able to suggest some
improvments to this implementation, such as ways to make it conditional
or to better handle the statfs data.  For us, this change alone is
a big step forward in making NFS usable in a large cluster of
independent super-mini computers (Vaxen, Goulds, Alliants) as well
as workstations (Iris's, Suns).

Comments welcome.

	-Doug-

Encl.  Diff of /sys/nfs/nfs_server.c and nfs_vnodeops.c.   Line numbers are
	from the Gould version of the SUN 3.0 sources, your numbers may vary.

*** /tmp/,RCSt1000202	Mon Jan 26 23:30:03 1987
--- nfs_server.c	Mon Jan 26 23:03:44 1987
***************
*** 282,288 ****
--- 282,306 ----
  		return;
  	}
  
+ #ifdef BRL
  	/*
+ 	 * Handle ".." special case.
+ 	 *    If this vnode is the root of a mounted
+ 	 *    file system, then replace it with the
+ 	 *    vnode which was mounted on so we take the
+ 	 *    .. in the other file system.
+ 	 */
+ 	if (da->da_name[0]=='.' && da->da_name[1]=='.' && da->da_name[2]==0) {
+ 		while (dvp->v_flag & VROOT) {
+ 			vp = dvp->v_vfsp->vfs_vnodecovered;
+ 			VN_HOLD(vp);
+ 			VN_RELE(dvp);
+ 			dvp = vp;
+ 		}
+ 	}
+ #endif BRL
+ 
+ 	/*
  	 * do lookup.
  	 */
  	error = VOP_LOOKUP(dvp, da->da_name, &vp, u.u_cred);
***************
*** 289,294 ****
--- 307,345 ----
  	if (error) {
  		vp = (struct vnode *)0;
  	} else {
+ #ifdef BRL
+ 		register struct vfs *vfsp;
+ 		struct vnode *tvp;
+ 
+ 	        /*
+ 		 * The following allows the exporting of contiguous
+ 		 * collections of local file systems.  -DPK-
+ 		 *
+                  * If this vnode is mounted on, and the mounted VFS
+ 		 * is the same as the current one (local), then we
+ 		 * transparently indirect to the vnode which
+ 		 * is the root of the mounted file system.
+ 		 * Before we do this we must check that an unmount is not
+ 		 * in progress on this vnode. This maintains the fs status
+ 		 * quo while a possibly lengthy unmount is going on.
+ 		 */
+ mloop:
+ 		while ((vfsp = vp->v_vfsmountedhere) &&
+ 			vfsp->vfs_op == vp->v_vfsp->vfs_op) {
+ 			while (vfsp->vfs_flag & VFS_MLOCK) {
+ 				vfsp->vfs_flag |= VFS_MWAIT;
+ 				sleep((caddr_t)vfsp, PVFS);
+ 				goto mloop;
+ 			}
+ 			error = VFS_ROOT(vp->v_vfsmountedhere, &tvp);
+ 			VN_RELE(vp);
+ 			if (error) {
+ 				vp = (struct vnode *)0;
+ 				goto bad;
+ 			}
+ 			vp = tvp;
+ 		}
+ #endif BRL
  		error = VOP_GETATTR(vp, &va, u.u_cred);
  		if (!error) {
  			vattr_to_nattr(&va, &dr->dr_attr);
***************
*** 295,305 ****
  			error = makefh(&dr->dr_fhandle, vp);
  		}
  	}
  	dr->dr_status = puterrno(error);
! 	if (vp) {
  		VN_RELE(vp);
! 	}
! 	VN_RELE(dvp);
  #ifdef NFSDEBUG
  	dprint(nfsdebug, 5, "rfs_lookup: returning %d\n", error);
  #endif
--- 346,357 ----
  			error = makefh(&dr->dr_fhandle, vp);
  		}
  	}
+ bad:
  	dr->dr_status = puterrno(error);
! 	if (vp)
  		VN_RELE(vp);
! 	if (dvp)
! 		VN_RELE(dvp);
  #ifdef NFSDEBUG
  	dprint(nfsdebug, 5, "rfs_lookup: returning %d\n", error);
  #endif
*** /tmp/,RCSt1000210	Mon Jan 26 23:30:18 1987
--- nfs_vnodeops.c	Fri Jan  2 17:51:54 1987
***************
*** 579,585 ****
--- 578,590 ----
  		 */
  		rp = vtor(vp);
  		nattr_to_vattr(&rp->r_nfsattr, vap);
+ #ifdef BRL
+ 		/* a better better kludge ??? */
+ 		vap->va_fsid &= 0x7ff;
+ 		vap->va_fsid |= ((vtomi(vp)->mi_mntno+1)<<11);
+ #else
  		vap->va_fsid = 0xff00 | vtomi(vp)->mi_mntno;
+ #endif BRL
  		if (rp->r_size < vap->va_size) {
  			rp->r_size = vap->va_size;
  		} else if (vap->va_size < rp->r_size) {
***************
*** 600,606 ****
--- 605,617 ----
  			 * an dev from the mount number and an arbitrary major
  			 * number 255.
  			*/
+ #ifdef BRL
+ 			/* a better better kludge ??? */
+ 			vap->va_fsid &= 0x7ff;
+ 			vap->va_fsid |= ((vtomi(vp)->mi_mntno+1)<<11);
+ #else
  			vap->va_fsid = 0xff00 | vtomi(vp)->mi_mntno;
+ #endif BRL
  			if (rp->r_size < vap->va_size) {
  				rp->r_size = vap->va_size;
  			} else if (vap->va_size < rp->r_size) {