[comp.unix.wizards] seeking information about file system details.

mcgregor@hemlock.Atherton.COM (Scott McGregor) (08/11/90)

I am curious as to how accesses to multiple concurrent types of file
systems are implemented under SysV and BSD.  I've read the Bach
book and the BSD book, as well as books on device drivers.  But what
I am more interested in is at the switchable file system layer.  Explanations
of what goes on in the following example are welcome, recommendations
of books that cover this stuff is even more heartily desired.

I know that mount takes a file type. I know how the file system
tree is traced using inodes, and I have heard of vnodes (which is
what I think I want  to know more about) but I haven't found anything
written on vnodes.  I am aware that several vendors support multiple
varieties of file systems (Bell, BSD, NFS, MS-DOS) all accessable on
the same system, and the files on them can be accessed using standard
o/s calls and stdio library routines.

I guess what I am interested in is if I have a non-unix file system
and I want to allow the this file system to be mounted as a unix
file system, and accessed using open, creat, read, write, close, et al,
what interface translators would I have to create between my own
file system and the unix system calls, and how and where would I
add typically add these translators.  I presume that a new type
would have to be tested for in mount, and that the open, etc. commands
would have to know what type of file system they were operating on
and that a case statement switch would have to be supported by them for each
new type of file system to be supported.  

Is this correct, or is there some other way that the switching  between
file systems is handled.   

Mail can be directed to: 

mcgregor@atherton.com or ...!sun!atherton!mcgregor

Scott McGregor
Atherton Technology

chris@mimsy.umd.edu (Chris Torek) (08/13/90)

In article <28595@athertn.Atherton.COM> mcgregor@hemlock.Atherton.COM
(Scott McGregor) writes:
>I am curious as to how accesses to multiple concurrent types of file
>systems are implemented under SysV and BSD. ... I have heard of vnodes
>(which is what I think I want to know more about) but I haven't found
>anything written on vnodes.

Very little *is* written about them.

The first thing to know is that everyone does it differently.

SysV (through R3 at least) uses the File System Switch.  Each `file'
(represented as an inode? I have never seen the code) has a file system
ID attached, and instead of doing something like

	error = inode_read(struct inode *ip, parameters);

one does something like

	error = (*fssw[ip->i_fstype].fs_read)(ip, parameters);

Ultrix: uses gnodes.  Gnodes are like vnodes or inodes, only different.
(Never having dealt with them I cannot say how so, other than in name.)

SunOS: uses vnodes.  Vnodes are different in just about every release
of SunOS (they keep getting bigger).  The important part is that vnodes
have a pointer to an operations vector, so instead of

	error = vnode_read(struct vnode *vp, parameters);

one writes

	error = (*vp->v_ops->vn_read)(vp, parameters);

These are encapulated as macros to save typing:

	error = VOP_READ(vp, parameters);

4.3BSD-Reno:  uses vnodes that are very similar to SunOS vnodes, but
not identical.  Again vnodes contain pointers to operation vectors, and
again one uses macros to invoke them.  In this case, however,
operations are permitted to store state; callers must call VOP_ABORTOP
when `backing out' of something.  In other words, one does something
like

	- Look up the following name with intent to delete.
	- Oops, never mind, something went wrong; I will not delete the
	  file after all.

Or

	- Look up the following name with intent to delete.
	- Delete the name you found.

(SunOS does not have the `abort' step, thus every operation must be self-
contained.)

>I guess what I am interested in is if I have a non-unix file system
>and I want to allow the this file system to be mounted as a unix
>file system, and accessed using open, creat, read, write, close, et al,
>what interface translators would I have to create between my own
>file system and the unix system calls, and how and where would I
>add typically add these translators.

For 4.3BSD-Reno, you would:

 1. modify <sys/mount.h> to add the parameters needed for a new mount,
    along with a mount type:
	#define	MOUNT_FOOFS	5
	#define	MOUNT_MAXTYPE	5	/* nb: not 6 (go argue with kirk...) */
	struct foofs_args {
		... whatever ...
	};
 2. write VFS operations to implement the following.  `mp' is always a
    `struct mount *', and is used to name the particular foofs file
    system (except during mount, where you have to figure out which
    foofs file system is meant from the args, and store it for later
    operations).

    foofs_mount(mp, char *path, caddr_t args, struct nameidata *ndp)
	mount the foofs file system at `path' using the given arguments.

    foofs_start(mp, int flags)
	start operation on the given foofs file system (after mount completes).

    foofs_unmount(mp, int forcibly)
	unmount the given foofs file system, possibly even if something is
	going on on it.

    foofs_root(mp, struct vnode **vpp)
	set *vpp to the vnode that represents the root of the given foofs
	file system.

    foofs_quotactl(mp, int cmd, uid_t uid, caddr_t arg)
	implement quotas, or (more simply) return EOPNOSUPPORT.

    foofs_statfs(mp, struct statfs *sbp)
	fill in a `statfs' structure.

    foofs_sync(mp, int waitfor)
	write all data to permanent storage, optionally waiting until
	done befor returning.

    foofs_fhtovp(mp, struct fid *fidp, struct vnode **vpp)
	convert a `file identifier' (private data) to a vnode, storing
	the resulting vnode in *vpp.

    foofs_vptofh(struct vnode *vp, struct fid *fidp)
	convert a vnode (vp) to a file identifier that can later be
	used to get vp again.

    foofs_init()
	set up any private data structures at boot time (e.g., inode
	hash chains for a ufs file system).

 3. write vnode operations.  vp is always a `struct vnode *'; vpp is
    always a `struct vnode **' into which the resulting vnode is
    stored; ndp is always a `struct nameidata *'; vap is always a
    `struct vattr *' containing the vnode attributes to apply or
    to be filled in; and cred is always a `struct ucred *' containing
    the (Unix-level) credentials of the person or program doing the
    operation (uid and gids).  (Often the credentials are in the
    nameidata instead.)

    foofs_lookup(vp, ndp)
	look up a path name segment (it contains no slashes).  ndp
	has all the semantics embedded, except that the lookup
	takes place in the directory given by vp.

    foofs_create(ndp, vap)
	create a file (after a lookup with CREATE set); store the new
	vnode in ndp->ni_vp.

    foofs_mknod(ndp, vap, cred)
	create a `node' (for Unix-specific thing like devices).

    foofs_open(vp, int flags, cred)
    	open a file; flags is from the open() syscall, e.g., no-delay.

    foofs_close(vp, int flags, cred)
	close a file.

    foofs_access(vp, int flags, cred)
	test to see if the given user (cred) has the given access mode
	(flags&VREAD, flags&VWRITE, flags&VEXEC).

    foofs_getattr(vp, vap, cred)
	get file attributes (store in *vap).

    foofs_setattr(vp, vap, cred)
	set file attributes (do not change those marked VNOVAL).

    foofs_read(vp, struct uio *uio, int ioflags, cred)
    	read from file; valid flags are IO_NDELAY: do not block.

    foofs_write(vp, struct uio *uio, int ioflags, cred)
	write to file; valid flags are IO_UNIT: write as atomic unit
	(if write fails, undo it); IO_APPEND: append to file; IO_SYNC:
	do synchronous writes; IO_NDELAY: do not block.

    foofs_ioctl(vp, int cmd, caddr_t data, int flags, cred)
	do ioctl operation.

    foofs_select(vp, int which, int flag, cred)
	select for read/write/exception (which==FREAD/FWRITE/0 resp).

    foofs_mmap(vp, not-defined-yet)
	reserved for 4.4BSD.  probably should be replaced with name
	and unname page operations a la SunOS.

    foofs_fsync(vp, int flags, cred, int waitfor)
	push all blocks for the file to permanent storage.  fflags
	is the file table entry flags.  optionally, wait until done.

    foofs_seek(vp, oldoff, newoff, cred)
	I dunno what this is doing in the ops vector.  Seeks are all
	taken care of above the vnode layer.

    foofs_remove(ndp)
	remove a file (after a lookup with DELETE).

    foofs_link(vp, ndp)
	create a link to the file vp (after a lookup with CREATE).

    foofs_rename(struct nameidata *old, struct nameidata *new)
	rename the old name to the new one (after a lookup with RENAME).

    foofs_mkdir(ndp, vap)
	create a directory (after a lookup with CREATE).

    foofs_rmdir(ndp)
	remove a directory (after a lookup with REMOVE).

    foofs_symlink(ndp, vap, char *target)
	create a symbolic link (after a lookup with CREATE).

    foofs_readdir(vp, struct uio *uio, cred, int *eofflag)
	read directory contents (store in `struct dirent' format).
	eofflag is not used for anything.

    foofs_readlink(vp, struct uio *uio, cred)
	read symlink contents.

    foofs_abortop(ndp)
	changed mind after operation with intent to create/remove/rename.

    foofs_inactive(vp)
	last close on file, do whatever is appropriate (e.g., write
	inode back).

    foofs_reclaim(vp)
	vnode vp being reused; disassociate from any cache.

    foofs_lock(vp)
	lock underying object (if possible).

    foofs_unlock(vp)
	unlock underlying object.  (These are not `file locking' locks;
	POSIX locking is not yet implemented.)

    foofs_bmap(vp, daddr_t bn, vpp, daddr_t *mapbn)
	map logical block number to physical block number, for old VM
	code.  this should go away.

    foofs_strategy(struct buf *bp)
	map logical block to physical and do I/O.  (Probably also should
	go away; read/write should handle this.  I do not trust the
	current buffer cache code....)

    foofs_print(vp)
	print contents of vnode, for debugging.

    foofs_islocked(vp)
	return true if underlying object is locked.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris
	(New campus phone system, active sometime soon: +1 301 405 2750)

guy@auspex.auspex.com (Guy Harris) (08/13/90)

>SysV (through R3 at least) uses the File System Switch.

SysV R4: uses vnodes that are very similar to SunOS 4.x vnodes, but not
identical.  They're probably much closer to SunOS 4.x vnodes than are
4.3-Reno vnodes, though.

richard@aiai.ed.ac.uk (Richard Tobin) (08/14/90)

In article <28595@athertn.Atherton.COM> mcgregor@hemlock.Atherton.COM (Scott McGregor) writes:
>I guess what I am interested in is if I have a non-unix file system
>and I want to allow the this file system to be mounted as a unix
>file system, and accessed using open, creat, read, write, close, et al,

If your system already has NFS, you can do this without kernel
modifications, by writing an NFS server for your device.  NFS mounting
works by passing the kernel the address of a socket through which it
can send and receive messages from the filesystem.  I recently hacked
up such a thing so that I can mount Minix floopies on a Sun.

I can send you the code if you're interested.

-- Richard

-- 
Richard Tobin,                       JANET: R.Tobin@uk.ac.ed             
AI Applications Institute,           ARPA:  R.Tobin%uk.ac.ed@nsfnet-relay.ac.uk
Edinburgh University.                UUCP:  ...!ukc!ed.ac.uk!R.Tobin

del@thrush.mlb.semi.harris.com (Don Lewis) (08/15/90)

In article <3199@skye.ed.ac.uk> richard@aiai.UUCP (Richard Tobin) writes:
>In article <28595@athertn.Atherton.COM> mcgregor@hemlock.Atherton.COM (Scott McGregor) writes:
>>I guess what I am interested in is if I have a non-unix file system
>>and I want to allow the this file system to be mounted as a unix
>>file system, and accessed using open, creat, read, write, close, et al,
>
>If your system already has NFS, you can do this without kernel
>modifications, by writing an NFS server for your device.  NFS mounting
>works by passing the kernel the address of a socket through which it
>can send and receive messages from the filesystem.  I recently hacked
>up such a thing so that I can mount Minix floopies on a Sun.
>
>I can send you the code if you're interested.

I did this for an automatic version of /usr/hosts.  It periodically
reads the hosts YP map and emulates a directory of symbolic links
to /usr/ucb/rsh for the map entries.

I have another application in mind where I would like to build
my own filesystem type.  It would not be a complete filesystem
implementation.  The reason that I can't do it with an NFS server
is that I need to know what syscall a process is executing when
the process is doing a lookup in my filesystem.
--
Don "Truck" Lewis                      Harris Semiconductor
Internet:  del@mlb.semi.harris.com     PO Box 883   MS 62A-028
Phone:     (407) 729-5205              Melbourne, FL  32901

andrew@alice.UUCP (Andrew Hume) (08/15/90)

	i agree with richard tobin; a user level file server is the
right way to go, even if it is NFS. it is easier to write, easier to debug
and easier on the other users on your system. as an example of how
easy it can be, given the appropriate libraries, implementing
a 10th edition netb server requires about 350-400 lines of code
if the underlying base looks something like a unix filesystem.
once you understand what is going on and if you need the speed etc.,
then you should plug in the code into your variant of the [gv...]node
stuff.