[net.sources] RFS design summary

toddb@tekcrl.UUCP (Todd Brunhoff) (01/13/86)
The following is a summary of the implementation details for RFS, a public
domain distributed file system which was posted recently to mod.sources
along with an announcement to net.sources and net.unix-wizards.  This
is being posted at the request of Mike Muuss (a very reasonable request,
indeed) who is the moderator for Unix-Wizards at BRL.

I will be at the Usenix Conference (I am 6'4', medium build, with graying
brown hair) for Wednesday and Thursday attending the Window tutorial and
technical session if anyone has more questions.  I will also be carrying
one copy of an RFS tape for those of you that do not receive mod.sources.
Be sure to bring your own tape: you are responsible for making arrangements
to copy it.  I currently do not have any plans for starting a tape distribution
service or for fixing any major bugs in the current distribution because
of the present demands on my time at work, but I would very much like to
receive all bug fixes so that I may review and redistribute them.

So far there has been no confusion, but I want to emphasize that although
I work at Tektronix, this software has nothing whatsoever to do with an
excellent, but proprietary distributed file system, called DFS, available
on the 6000 series Tektronix workstation.  The work I did on RFS is for
my masters degree at University of Denver.

I might also note that RFS is not product-quality (grad students are soooo
sloppy, aren't they?).  I believe that it works very well, but neither
Tektronix, the University of Denver nor myself accept any responsibility
for any damage done directly or indirectly by this software.  Read the
disclamer included in all the source files.  Send no money.  Modify it
to your hearts content (except for the disclaimers).


				Thanks for all the interest,

				Todd Brunhoff
				toddb%crl@tektronix
				tektronix!crl!toddb


Design Goals
------------
There were three:
	1. Very installable on 4.2/4.3 flavor unix.  Small localized
	   changes.  No changes in basic 4.2 design.
	2. Extremely low overhead.  No large code segments inserted into
	   the normal path of 4.2 execution.
	3. Complete transparancy.

Installation
------------
Because I was able the meet goal #1 and #2 above, I was able to make
installation in 4.2/4.3 and Pyramid 2.5 unix completely automatic...
just run a shell script, and you end up with complete RFS kernel
sources, either modified where they lie in /usr/sys or kept in a
private directory with the bulk symbolically linked in.

RFS depends heavily on the 4.2 kernel implementation of sockets, and so
is not easily portable to System V.

List of System Calls Which Gain ``Remote'' Capability
-----------------------------------------------------
  One path name     Two path names     File descriptor      Other
  *************     **************     ***************      *****
    access()          link()              close()           fork()
    chdir()           readname()          dup()             vfork()
    chmod()           symlink()           dup2()            exit()
    chown()                               fchmod()
    creat()                               fchown()
    execv()                               fcntl()
    lstat()                               flock()
    mkdir()                               fstat()
    mknod()                               ftruncate()
    open()                                ioctl()
    rmdir()                               lseek()
    stat()                                readv()
    readlink()                            readv()
    truncate()                            select()
    unlink()                              write()
    utimes()                              writev()

Profile of Remote Access
------------------------
  Pathname access:
	- Process ``A'' makes a one or two path system call.
	- the normal system call runs, and at some point calls namei()
	  to translate the path into an inode.
	- Namei() discovers that the file is ``remote'' (see Namei Changes
	  below), and returns a NULL to the kernel level system call
	  with u.u_error = EISREMOTE.  As a side effect of determining
	  the file's remoteness, the portion of the path falling after
	  the remote mount point is saved in an mbuf (chain, if necessary).
	  Also, at this point, the process has been marked as having made
	  a remote access, and which system was accessed.
	- The syscall() routine (kernel dispatcher of system calls) sees
	  this and starts up the remote version of the system call.
	- The system call is packaged and sent (see Transport Mechanism)
	  to the server.  The client process is not allowed to be interrupted
	  (but may sleep) until the reply arrives.
	- The return value or error number obtained by the server is
	  returned to the process.

  File descriptor access:
	- A file descriptor returned by a remote version of creat(), open(),
	  or fcntl() is passed to one of the File Desciptor system calls.
	  Since the process is marked as having made a remote access(),
	  the remote version of the system call is tried first.
	- Again, the system call is sent to the server.  If it is a write
	  or a writev, the data is sent along with it.  If a read or readv,
	  the data is read along with the reply.  Currently, the entire
	  data associated with the read or write is sent before any other
	  operation to that host can happen.  Typically, its not bad, but
	  potentially, the connection could become rather slow.

  Other system calls:
	- fork(), vfork(), and exit() are also sent to the server so that
	  it can fork (for the sake of file descriptors and current
	  directories), or throw away state information for the that
	  process.

Remote Pathname Syntax
----------------------
  While almost all distributed file systems allow the use of symbolic
  links to relocate a ``remote'' file, the final syntax is important to
  know.  Some implementations of distributed file systems choose to place
  the burden for deciding remoteness of a file on the pathname syntax
  itself.  Some examples are:

	host:/etc/passwd
	//host/etc/passwd
	/../host/etc/passwd

  Others place the burden in attributes associated with a ``mount point''
  similar the traditional mount(2) system call.  This, of course,
  provides more natural pathnames, and is used by RFS, et. al.:

	/net/host/etc/passwd
	/host/etc/passwd

  I believe that the special path names or the special attributes or
  anything other than a plain directory is needed partly because no UNIX
  program should ever find these gateways through normal perusal of a
  file system.  Imagine how long the command ``find / -print'' would take
  if it traversed every remote host as well as itself!

Namei Changes and Mount Points
------------------------------
  RFS uses a plain file as a mount point because
	- the simplicity of the code changes to namei()
	- don't have to add another file type for UNIX utilities to learn
	- The test for ``remoteness'' doesn't happen in namei() unless
	  we are considering a plain file with a trailing ``/'', which
	  means that the overhead to processes not using RFS is
	  virtually none!
	- find, et. al., will not ``descend'' a mount point which is
	  a plain file.

  The path name ``/host'' still remains a valid local filename, but
  /host/ or anything longer results in a special case, which namei
  labels with the error EISREMOTE

  When hosts are mounted (by convention, on /hostname), the mount command,
  /etc/rmtmnt, provides the internet address found in /etc/hosts.

Generic Mount Points
--------------------
  In addition to mounting a specific host, say foovax, on the mount point
  /foovax, you can have a generic mount point, say /net.  Typically, a
  reference to a pathname like ``/net/foovax/etc/passwd'' will cause the
  kernel to wake up the nameserver and pass him ``/foovax'' for
  translation to an internet address.  If ``foovax'' is a valid host name
  in /etc/hosts, the nameserver will pass this to the kernel.  After
  receiving the address to call, the remote access to that machine is
  just the same as with /host.

Transport Mechanism
-------------------
  On first access, the server host is called using the internet information
  supplied by /etc/rmtmnt or the nameserver, and using a boiled down version
  of the connect() system call.  Hence, RFS depends entirely upon socket
  level IPC communication;  any host you can call with rlogin, you can set
  up to use RFS.

Permissions Across RFS
----------------------
  Be careful not to confuse the words ``client'' and ``server''.  A
  client is the process making a remote access to machine ``x''.  The
  server is the process running on machine ``x'' which performs the
  system calls on behalf of the client.

  When the server is started (which also acts as the nameserver, by the
  way), it digests /etc/passwd, /etc/group and .rhosts files associated
  with every user.  In addition, whenever a client does an /etc/rmtmnt
  command, that command sends its /etc/passwd file to the server.
  Similarly, if a server receives a call from a client, but has not
  received an /etc/passwd file, then the server calls the client's
  server, and asks for it.

  When a server receives a message from a client process whose uid number
  is ``n'', it consults the client's /etc/passwd file (which it already
  received).  If it finds a matching uid number, then it checks to see if
  the uid name is allowed login privileges in some user's .rhosts file on
  the server's host.  If a user allows it, then the server for that
  process sets the effective user id to that user's uid number (with
  seteuid(2)) and sets the groups associated with that user (with
  setgroups(2)).

  Most of the checking and searching is done when server first starts up,
  and is kept in LRU lists for fast access.  Mappings from client uid to
  server uid are already done by the time a client makes a connection.

  If more than one user on the server host allows login access to that
  client's user, then the last user in the server's /etc/passwd takes
  precedence.  However, if the one of the users on the server hosts has
  the same uid name as that on the client, that mapping takes precedence
  over all other mappings.  Note that this means the user ``x'' may have
  remote login privileges for users ``y'' and ``z'' on some other host,
  but his access permissions over RFS will be for one or the other, never
  both.

  If a user changes his .rhosts file on a server, that change is not
  noticed until the server is restarted.  Fortunately, restarting the
  server is simple.

  If the server host on which the user wanted to change his .rhosts file
  is currently connected to the client, that connection must be severed
  and a new one started.

Changing Directory to a Remote Host
-----------------------------------
  This is done by simply making the current directory inode (u_cdir) for
  the process to be that mount point file inode described above.  This
  means that paths beginning with ``/'' are handled normally, but
  relative pathnames ``fail'' immediately with EISREMOTE.

Speed Improvements
------------------
  Most system calls are synchronous, i.e. information must pass both ways
  or it is important to wait for the outcome.  This means that a remote
  system call cannot go any faster than a full duplex message takes to
  travel from one host to another.  However, some system calls, like
  close(), fork(), vfork, exit() and sometime lseek() can be done
  asynchronously, or half-duplex, because the outcome is known ahead of
  time by the client or we just don't care.  This tends to add a
  suprising amount of speed to some programs using RFS, particularly ones
  that use lseek() heavily.

Significant Shortfalls
----------------------
  Ioctl() and select() are not implemented.  Simple tape access (i.e. open,
  read, write, close) will probably work, but that's it.

  Other, less impacting bugs, from my perspective, are listed in the
  installation document supplied in mod.sources.  All bugs that I am aware
  of are listed there.