[net.arch] Magic Cookies and File Systems

sjs@u1100s.UUCP (Stan Switzer) (03/04/85)

[ does this really belong in net.arch? ]

In article <538@rlgvax.UUCP> guy@rlgvax.UUCP (Guy Harris) writes:
> > The problem lies in picking an underlying model sufficiently robust
> > to do all of the things you want to do with record and file locking, ...
> 
> Well, several UNIX versions *do* have record and file locking - System V
> Release 2 Version 2, for one.  They are based on John Bass' byte-stream
> locking mechanism; you lock a given range of bytes in the file.
> 
> VMS, on the other hand, has a locking mechanism that locks magic cookies;
> cooperating applications agree on a name for a resource and they lock
> that name when they want to lock that resource. ...
> ...
> This means you can lock things other than files; perhaps locking should
> have nothing to do with the file system.

On the contrary, all magic cookies should be tied to the file system in
some way.  It really burns me the way some message passing, locking,
and semaphore systems use a globally administered pool of names (or
worse yet, numbers!!) for the names of entities.  The file system directory
structure is the ONLY logical place for global names of ANY kind to reside.

UNIX drew on this idea from MULTICS, and used it to good advantage.  As in
MULTICS, devices and executable programs are just a particular kind of
enrty in the file system.  In MULTICS it went further: memory, in fact
everything, was named in the file system.

Actually I agree about locking resources, but locking ranges of bytes in
files is useful too.  Anyway, for the purpose of locking resources, you
just need a file (special or otherwise) to represent a resource, and you
do your locking against that file.

IF SOMETHING NEEDS A NAME, PUT IT IN THE FILE SYSTEM!!!!!!!!!!!!!!!

--------------------------------------------------------------------------
Stan Switzer     |  "We're gonna have a TV party tonight.
ihnp4!u1100s!sjs |   Alright!"  -- Black Flag

herbie@watdcsu.UUCP (Herb Chong [DCS]) (03/07/85)

In article <190@u1100s.UUCP> sjs@u1100s.UUCP (Stan Switzer) writes:
>On the contrary, all magic cookies should be tied to the file system in
>some way.  It really burns me the way some message passing, locking,
>and semaphore systems use a globally administered pool of names (or
>worse yet, numbers!!) for the names of entities.  The file system directory
>structure is the ONLY logical place for global names of ANY kind to reside.
and you pay for it too unless your entire file directory resides in
memory all the time, in which case you have consistency problems, not
to mention taking up real memory which could be allocated to other
things besides file names.  numbers are perfectly reasonable things to
have as names of entities.  are you going to read and make decisions
based upon these names?  no, a program is going to, and it is just as
happy with a short name(word) as a long one, but a long one takes up so
much more space.

>UNIX drew on this idea from MULTICS, and used it to good advantage.  As in
>MULTICS, devices and executable programs are just a particular kind of
>enrty in the file system.  In MULTICS it went further: memory, in fact
>everything, was named in the file system.
and look at how much overhead MULTICS has.  not that MULTICS is bad
conceptually, but why run a system that permanently takes away more
than half your CPU for overhead when you can run Unix, which is just
about as flexible most of the time, and you get one quarter or more of
your CPU back to boot.

>Actually I agree about locking resources, but locking ranges of bytes in
>files is useful too.  Anyway, for the purpose of locking resources, you
>just need a file (special or otherwise) to represent a resource, and you
>do your locking against that file.
locking bytes, or inability to do so, is as much a function of hardware
as software.  implementation that is both efficient in memory and time
requires some help from both.

>IF SOMETHING NEEDS A NAME, PUT IT IN THE FILE SYSTEM!!!!!!!!!!!!!!!
>
>--------------------------------------------------------------------------
>Stan Switzer     |  "We're gonna have a TV party tonight.
>ihnp4!u1100s!sjs |   Alright!"  -- Black Flag

there is something very fundamentally wrong here with stan's argument.
how do i ensure no two people have locked on the same file at the same
time?  no, you can't do an i/o to find out, because the other process
may be queued just ahead of you for the device so that a query finds it
unlocked, but the locking command (i/o or whatever) encounters an
already locked file.  but this is a dumb disk, it just goes ahead and
locks it anyway.  so now you have two processes with exclusive access
to the lock at the same time.  you can imagine what will happen when
both try to write to the file.  you have to check before you actually
lock in a manner that will guarantee exclusive access by one process so
what do you do?  you lock on a bit (word or whatever) in memory and
THEN lock on the file.  but why bother locking on the file when you
couldn't care less about the file but some other shared resource.  AH,
but a SMART disk drive would stop this.  big deal, the lock is still in
memory, but it's the device controller's now.  still not on disk.
now, let's give the bit a NAME, so it's not fixed to any one location
in storage.  need i go on?

for those of you who don't know, OS/360 (really MVS) has both internal
named locks (ENQ/DEQ) and file locks.  in a multiple processor
environment without shared memory, file locks are used because there is
really no other way.  the controller has the smarts to acknowledge a
RESERVE and prevent all I/O from any other CPU's to that device until
the CPU issuing the RESERVE is finished with it. (i might mention at
this point that a small MVS system will only have 30 or 40 disk devices
attached.)  when the processors share memory, the memory management
units serialize access to shared storage so that a locking technique
works with words (actually, double words) for all shared resources.
speed is of main concern here.  you do not want to tie up a shared
resource more than you have to when a typical i/o instruction can take
tens of thousands of CPU cycles and you need to do at least two to lock
on a file by setting bits in the filesystem.  why two?  one to issue
the lock, and another to see if you really got it.  if you don't have
to do the last one, then you're already got "magic cookies" that don't
use the file system.

there is a very pragmatic reason for using using named "magic cookies"
in a system for interprocess synchronization rather than the file
system.  it takes orders of magitudes less time and the system can
continue to schedule processes instead of waiting around for a lock to
be satisfied.  also, since the lock is in memory and never needs to be
updated to disk, consistency of the locks is ensured with much less
overhead.

this really belongs in net.flame or net.religion.operating-systems, but
i am putting it here since no-one else will pay attention otherwise.

Herb Chong...

I'm user-friendly -- I don't byte, I nybble....
(but i'm sick and tired of ignorance)

UUCP:  {decvax|utzoo|ihnp4|allegra|clyde}!watmath!water!watdcsu!herbie
CSNET: herbie%watdcsu@waterloo.csnet
ARPA:  herbie%watdcsu%waterloo.csnet@csnet-relay.arpa
NETNORTH, BITNET, EARN: herbie@watdcs, herbie@watdcsu

henry@utzoo.UUCP (Henry Spencer) (03/07/85)

Nobody ever said that locks should necessarily be *implemented* using
the file system.  The point is that they should *look* like part of
the file system to the customer.  We have a perfectly good name->resource
mapping mechanism already in existence; why invent another?
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

bass@dmsd.UUCP (John Bass) (03/07/85)

As the author of locking(2) and lockf(2) I have a few comments:

	First the locking system I designed IS a simple semaphore
system in which each byte of the file address space is a magic cookie
to be locked, unlocked, and tested. It so happens that I included
semantics to deal wwith ranges of magic cookies .. thus the byte
stream nature.

	The code started out as a semaphore manager for a LUNDY display
memory system to allow several processes to maintain the display lists
with races. The soft blocking was added to minimize the number of
system calls and improve performance of the graphics applications.

	In 1980 I had a stiff political argument with ONYX management and
a contractor over how to do file locking. The suggested "mail box" semaphores
with a global 64 bit cookie/resource or a P & V semaphore drive ala the
UCLA implementation in the late 70's. I resurected the lundy code and
proposed it be recoded to fit into the filesystem and after nearly being
run out of town it ended up the ONYX locking(2) call. The soft blocking
is generally known "as enforcement mode".

	In 1981 I published the work in the USENIX news letter login:
and it has been in the public domain. After much hassle it went thru the
/usr/group standard proposal with three minor changes. 1) the soft blocking
was changed to be conditionally enabled based on the SGID bit of a file
to remove distructive behavor in the hands of children. Locking /etc/passwd
is generally unhealthy and kids in the educational area like such games
although comercial systems are generally free of such bull. 2) lock/unlock
actions were defined to deal with the previous record when given negative
lengths saving two addititional lseek(2) calls. 3) A testonly call was
added to supplement the test&lock call, thus removing race conditions in
some applications.

	I did not implement shared locks of any kind for several reasons.
I have had a number of requests to do so and flames as why necessary. But
I have never been able to reach an agreement over the semantics or how
to deal with certain race conditions.

	There are several DIFFERING shared locking semantics:

	1) A shared lock to be used for an extended update process.
	The intent is that only one process can hold the update lock
	preventing other update process from gaining access. Any reader
	can access the data.

	2) A shared lock to be used by reading processes to prevent updates.
	Any reader can request a duplicate lock, all writers are blocked.

	3) locks that can be promoted from 1 to 2 and back again.

Type 1 locks are the most common implementation. A payroll record is locked
while the operator updates the name/address on a CRT. With my implementation
this application can be implemented with a clean work around. The update
process locks the record and sets a timestamp for the record (or subrecord)
to be updated then frees the lock. Other updating processes note the record
busy while the timestamp is active. The updating process checks the timestamp
to insure that it is the SAME prior to updating the record, if it changed
the operator took too long (IE went to lunch) and some other process updated
the record. The process can either have the operator redo the transaction
with the new data displayed, or the two records can be diff edited togather.
Where money amounts are involved this is easy.

Type 2 locks are the most needed but cause REAL design problems. A process
traversing an N-Way tree data base needs to protect upper nodes from becoming
split/compacted and thus commiting the traversing process into garbage land.
This effectively prevents ANY insert/delete operations in a B-tree database
since the trama can extend to the root node -- thus a transversal must
lockout add/delete operations for the duration. These type locks can be
implemented with lockf(2) semaphores indirectly or with the aid of
a deamon process -- this is a tough design problem and semaphores of
any type will not help. The solution is to do all database actions in
a deamon process with fifo's to the user interface -- or to put the entire
data base manager into the kernel where the critical regions can be
handled properly. A unix based transaction filesystem is best solution
for many problems of this class.

	The original 1981 code that was published has a race condition between
the locked call and the inode lock for the enforcement hook in rdwri.
The fix is minor but varies between the varities of unix because of
changes in the filesystem code. The fix is to move the inode lock into
the locked procedure -- systems implementors can write and I will provide
the several line fix. I will attempt to repost the lockf.c file and some
of the documents to net.source within a week or so ... It has went thru
some minor changes in the last couple months -- with the help of some
4.2 sites it now runs under 4.2 but is undefined for the network filesystem.
The 4.2 network filesystem is "stateless" to handle certain error problems.
This design choice I think will make complex databases nearly impossible
because of integrity problems after certain failures.

	Lockf can be added as RPC's to the host machine in distributed
filesystems.
-- 
John Bass
DMS Design (System Performance and Arch Consultants)
{dual,fortune,idi,hpda}!dmsd!bass     (408) 996-0557

herbie@watdcsu.UUCP (Herb Chong [DCS]) (03/09/85)

>Nobody ever said that locks should necessarily be *implemented* using
>the file system.  The point is that they should *look* like part of
>the file system to the customer.  We have a perfectly good name->resource
>mapping mechanism already in existence; why invent another?
>				Henry Spencer
--------------------------------------
>However, I believe that in this case you are making an unnecessary
>assumption that "file system" means "disk drives".  The first is a
>logical concept, while the second is a physical concept.  Part of the
>file system (the temporary part) can be kept in (virtual) memory at all
>time, and the magic cookies can be in this part.  The advantage of
>keeping locks in the file system is that the single naming domain
>allows any program to access any object, if necessary.
>Ralph Johnson

point taken, gentlemen.  i alluded to this possibility when i spoke
about keeping the file system in real memory.  why not map to virtual
memory instead?  in fact, why not have an entire file system in virtual
memory?  (here i use file system to mean the unix definition of file
system which corresponds to a logical disk pack rather than the ENTIRE
file system.) let's have a look at the pluses and minuses.  first the
pluses:

1)	consistent object naming convention (a pathname)
2)	high access speed and ease of consistency checking
3)	trivial generalization to virtual files that can be read/written,
	and otherwise manipulated using existing system calls
4)	small changes in kernel code, if any, of existing unix systems

now some minuses:
1)	if the virtual files are being used for anything other than
	locking, yur system MUST not be memory bound and MUST have
	a well tuned pageing system.  otherwise your CPU thrashes
2)	relating file systems to a lock is not as intuitively obvious as
	you think.  it's hard enough to explain interprocess synchronization
	sometimes without bringing in the complexity of using the
	file system for locking.

mostly, i am quibbling over conceptualization.  even if the lock is
viewed as a file, what is actually happening is still that a named
object in storage (usually a bit or word) is what the system
itself will be testing for existence of a lock.  you have just added a
layer of abstraction to the problem to unify a logical description of
the system.  deep down inside, it's still bits in memory that are being
massaged.  (i wonder why AT&T or Berkeley don't implement a virtual file
system for unix.  i think it would be a wonderful thing, if done right.)

as a side note, OS/360 (really MVS) implements a virtual file system if
one desires to by refering to UNIT=VIO in DD statements.  the file
(which cannot be permanent) is mapped to virtual storage and is handled
by paging devices if real I/O needs to be done.  one must not be
storage constrained, have an inefficient pageing system, nor be
accessing large amounts of storage (large files) with it or the system
will page itself to death.  MVS never thrashes due to the design of the
scheduler, but it will spend almost all it's time in page-wait if many
large VIO files are open at once.  a good MVS system administrator will
keep a careful watch on usage of VIO.  MVS never uses the VIO files for
locking, to my knowledge, though there is nothing to stop it from doing
so.  the ENQ/DEQ facility used for locking files is generalized enough
to lock on any resource.

as a further digression, i keep mentioning MVS because i know a lot
about it, and just about everything in operating system design has been
tried on it.  if nothing else, every algorithm that has even a
reasonable chance of working was tried at one time or another on OS/360.

Herb Chong...

I'm user-friendly -- I don't byte, I nybble....

UUCP:  {decvax|utzoo|ihnp4|allegra|clyde}!watmath!water!watdcsu!herbie
CSNET: herbie%watdcsu@waterloo.csnet
ARPA:  herbie%watdcsu%waterloo.csnet@csnet-relay.arpa
NETNORTH, BITNET, EARN: herbie@watdcs, herbie@watdcsu

jlg@lanl.ARPA (03/09/85)

> [ does this really belong in net.arch? ]

Yes, it pertains to the archetecture of operating systems.  There is no
other network forum for this discussion and I don't know how to make one.

> It really burns me the way some message passing, locking,
> and semaphore systems use a globally administered pool of names (or
> worse yet, numbers!!) for the names of entities.  The file system directory
> structure is the ONLY logical place for global names of ANY kind to reside.

This assumes that the shared resources are global, that processes that
don't share resources are willing to put up with the overhead of having
this capability in the file interface, and that the designer agrees with
your religious conviction that the file system should be the only global
entity.  SOME information about shared resources needs to in the system
just as SOME I/O interface must be in the system.  But the place for rarely
used or very sophisticated features is at some level higher than the system
kernal.

> IF SOMETHING NEEDS A NAME, PUT IT IN THE FILE SYSTEM!!!!!!!!!!!!!!!

And therby slow the file system down for ALL users whether they use the
additional global naming capibility or not.

J. Giles

avie@cmu-cs-spice.ARPA (Avadis Tevanian) (03/09/85)

> worse yet, numbers!!) for the names of entities.  The file system directory
> structure is the ONLY logical place for global names of ANY kind to reside.

Would you care to say *why* the file system is the only logical place for
global names to reside?  When I read articles like this, I begin to worry.
As far as I'm concerned, a file is just one of many resources in a system.
In fact, THERE IS NO REASON that the kernel need know what a file is!  Of
course, many of the readers of this newsgroup (hopefully not all) are still
living in the primitive world of "the Unix Way."

We are currently developing a new kernel based on Accent (which was
developed here at CMU).  Our kernel, conceptually, only knows how to
manipulate address maps of processes.  The file system is implemented by
user level processes, which understand the disk format of the file systems,
whether they be on a Fujitsu Eagle, RX01 floppy, or over the ethernet.  When
raw disk I/O needs to be performed, drivers are called.  Even these drivers
need not exist in the kernel proper, except that it is more efficient to put
them in the kernel.

So, in my opinion, it is better to have a general name server, which can
assign names to a variety of objects (files, semaphores, ports, whatever).

Related to locking, it seems to me that doing locking in the file system is
not such a great idea if you want any type of speed.  I wrote up a 4.2
semaphore package (in the kernel) which runs on a VAX 784 (4 780 processors
connected with shared memory).  A P/V combination takes only about 500
microseconds (that includes 2 traps, verification of valid parameters, and
all that ugly stuff).  I'd like to see you do that with file system locking.

	Avie Tevanian

pdbain@wateng.UUCP (Peter Bain) (03/09/85)

>
>point taken, gentlemen.  i alluded to this possibility when i spoke
>about keeping the file system in real memory.  why not map to virtual
>memory instead?  in fact, why not have an entire file system in virtual
>memory?  (here i use file system to mean the unix definition of file
>system which corresponds to a logical disk pack rather than the ENTIRE
>file system.)

Sorry, Herb - it's been tried. The system is Multics,
(which was designed by MIT, Bell Labs, and GE) and had absolutely everything
anyone had ever though about putting in an operating system. :-) )
The problem with files == segments in Multics was that
everything was demand paged, and so if you were going through
the file sequentially, your working set
got cluttered with stale pages. I am not saying that this has to happen,
but that it's not quite as simple as it appears.
-- 
   - peter bain
...!{allegra|decvax|clyde|ihnp4 }!watmath!wateng!pdbain
hard mail:	CCNG, CPH-2369A, University of Waterloo,
	Waterloo, Ont. Canada N2M 5G4
telephone:	(519) 885-1211 x2810

tihor@acf4.UUCP (Stephen Tihor) (03/11/85)

ELXSI did that in Embos, their own operating system, philosphically Unix-based
(much as Algol 68 was ALgol 60-based).  There is one general name service that
covers Files, Devices, even System calls.  Thus there is one general protection
mechanism that covers them all, etc.  A very clean effect.

[Of course their Ads are still filled with Vapourware and Marketing's idea of
benchmarks, but they are a fairly new company so I can expect them to even
behave up to DEC's standards.(;-))]

mat@amdahl.UUCP (Mike Taylor) (03/11/85)

> for those of you who don't know, OS/360 (really MVS) has both internal
> named locks (ENQ/DEQ) and file locks.  in a multiple processor
> environment without shared memory, file locks are used because there is
> really no other way.  the controller has the smarts to acknowledge a
> RESERVE and prevent all I/O from any other CPU's to that device until
> the CPU issuing the RESERVE is finished with it.

It's a small point, but the facts might as well be straight.
Since RESERVE/RELEASE locks an entire volume, not a file, resulting in
unwanted side-effects such as deadlocks, the internal named locks
may be propagated through all sharing processors by a facility
called Global Resource Serialization (GRS).
GRS uses channel-to-channel connection to pass a token around all
connected processors. Possession of the token allows tasks (processes)
on that system to obtain/release global locks. The token is passed every
n milliseconds (n = tunable parameter).
-- 
Mike Taylor                        ...!{ihnp4,hplabs,amd,sun}!amdahl!mat

[ This may not reflect my opinion, let alone anyone else's.  ]

bass@dmsd.UUCP (John Bass) (03/12/85)

From Mr Tevanian:
	"Would you care to say *why* the file system is the only logical
	place for global names to reside? .. In fact, THERE IS NO REASON
	that the kernel need know what a file is! Of course, many of the
	readers of this newsgroup (hopefully not all) are still living in
	the primitive world of "the Unix Way."

More Ivory Tower snobbery we can do with out ... every CSc department has
its pet students and operating system research programs to produce mindless
trash to be ignored by the rest of the sane world (how this for an attention
getter living in the primative world). OS research is the fabric of learning
what and what not do do in real life -- more the latter I feel. I will grant
him that an OS NEED NO KNOW about a file as a fundamental resource for some
applications -- on the other hand the file is the basis of most commercial
applications environments.

From the abstract point of view nami in UNIX provides the mapping from
a character string name space to a resource ID (an inode). A resource
ID in my model of lockf/locking is a semaphore point with a large number of
magic cookies in it. Of course most resource ID's are files, but others are
network interfaces, system services, drivers, etc. It is unfair to say that
every unix file descriptor is a file.

As for performance you quote for PV operations ... the lockf call operates
on a 780 with similar timings -- thus there is nothing particularly sloooow
about locking and semaphores done with inode handles.

Path/file names are a good (if not best) way of describing resource IDs --
the magic number approach is trash ---

-- 
John Bass
DMS Design (System Performance and Arch Consultants)
{dual,fortune,idi,hpda}!dmsd!bass     (408) 996-0557

wapd@houxj.UUCP (Bill Dietrich) (03/14/85)

(Let's keep the discussion friendly and factual, okay ?)


I agree with the side that says locks or resources or whatever
should look like files if possible.  The reason is that the
user should have to deal with as few naming conventions or
ideas as possible.  If everything can be unified into one
naming convention (tree-structured, path-names), the user will
have a managable number (one) of things that must be understood
on the way to solving the problem that is of interest.

So I think it is easier to know how files are named and to remember
that some files are really locks, some are devices and some are data,
than to remember that there is one way to name locks, another to
name devices, another for data.

I really don't care whether you want to define the naming convention
as names of locks (and some locks happen to have data associated
with them) or as names of files (and some files happen to have locks
associated with them).

As a side comment, at least in Unix, using file names to represent
"pure" locks shouldn't slow down access to regular files.  The kernel
already tests whether a file is regular, character special,
block special, pipe, etc.  "Lock" is just another case.


				Bill Dietrich
				houxj!wapd

jlg@lanl.ARPA (03/18/85)

> I agree with the side that says locks or resources or whatever
> should look like files if possible.  The reason is that the
> user should have to deal with as few naming conventions or
> ideas as possible.  If everything can be unified into one
> naming convention (tree-structured, path-names), the user will
> have a managable number (one) of things that must be understood
> on the way to solving the problem that is of interest.

I agree with the above philosophy but not with the terminology.  There
should indeed be only one system pool for name/resource management.  But
this is NOT the file system.  The main function of the file system is to
provide a mechanism for data transfer between a process and certain types
of resources.  Locks and semaphors (as well as perhaps other resources)
don't need this functionality of the file system and should not be
associated with the file system.

Note that the file system will use whatever resource naming convention is
implemented by the appropriate part of the OS.  The file system may also
use locks or semaphors which are implemented within the OS (to prevent two
processes from simultaneously writing to the same place, for example).
Both of these things are functions of the operating system which are
distinct from the file system.

J. Giles