[comp.unix.internals] Ideas for changes to Unix filesystem

jeremy@socs.uts.edu.au (Jeremy Fitzhardinge) (01/30/91)

I've been having a few ideas about changes to the unix filesystem that
may or may not be useful.  I'd like comments, but not flames unless
you're feeling really motivated.

When I refer to "the" unix filesystem, I'm talking about a BSD FFS since
the old "ILike14Charact" filesystem of System V R<4 seems to have faded
out; however it is simpler to explain changes to, so I may use it for
examples.

I have 3 main ideas for change:

1 - a flink(char *path, int fd) system call/operation.

It seems odd to be that you can open a file, unlink it from the
filesystem, and then not be able to put it back as a file unless you
actually copy it out.  What I was thinking about was a system call that
lets you make a new directory entry that refers to the inode of an open
file.  The syscall would allow the link to take place if the user can
create a directory entry at the specified path which can point to the
inode of the open file.  The only security problem I can think of is
this: it would be possible to link a file back into the filesystem into
a publically accessable directory after some time, even if the path to
the original file becomes closed.  If this were a real problem, you'd
have to use a utility like fuser to see what processes have what files
open in the closed off area.  However, this situation is only marginally
different from just copying out the file, which would have less side
effects anyway (like not incrementing the link count).

2 - insertion/deletion in the middle of a file without copying

Inserting and deleting chunks from the middle of a file seems like a
pretty common operation, yet it is algorithmically quite inefficent as a
result of the way the filesystem is designed.  What I was thinking about
is having the logical size of each block in the indirect blocks, as well
as their location.  When I say "block" I'm refering to the smallest
singly writeable unit onto some disk-like device - basically a SysV FS
block as opposed to BSD's myriad of sectors/blocks/clusters etc.

When the file is being used normally (new data being appended to the
end) then all blocks but the last will have valid data in them.  However
when data is added into the middle of the file, a new block is inserted
into the blocklist.  If the insertion is in the middle of a currently
existing block, then the block's logical size is truncated to the offset
of the insertion into the block.  The remainder is copied into the newly
allocated block.  The logical size of the new block is set to the
remainder's size, and the filepointer is set to the end.  Is the file is
read, then it appears exactly the same, until new data is written.  On a
write, instead of overwriting existing data, the data is written to fill
the remainder of the new block, thus increasing its logical size.  When
the logical size matches the physical size another block is inserted
into the file.

Rather then having separate "write with insert" operations (as i implied
above), I think the best way of allowing program support would be an
"insert" system call that inserts a certain amount of empty space into
an open file at the current position.  Naturally, the blocks are only
inserted into the file, but are not actually allocated on disk. If a
negative amount is specified then the space is closed up.  If the file
becomes too fragmented, then it can be just rewritten contigiously,
which would fill up all gaps.

This mechanism saves having to copy any of the actual file larger than a
physical block size, but it does mean that there is quite a bit of
shuffling about of the indirect blocks, which could make the operation
hard to guarantee atomic.  It might also be worth making insertion an
attribute of a file when its created so that only files that need it
have the overhead of logical block sizes in the indirect blocks.

3 - limited sized files

This idea is essentially quite similar to the above - basically I've
been sick of simple log files that grow and grow without bound, often
making serious holes in a file system.  The idea is simply this - create
a file that has a certain maximum size.  If there is a write to the end
of the file that would normally grow the file, then rather than ignoring
it, blocks from the front of the file are reallocated and reordered to
hold the new data.  I suppose the file size would be best be in units of
filesystem blocks, however if implemented in conjunction with
insertion/deletion, then this need not be the case.



These are ideas that may be implemented in a filesystem that's currently
being designed.  I would quite like comments and ideas from fellow
experienced Unix users/hackers.



-- 
Jeremy Fitzhardinge:jeremy@ultima.socs.uts.edu.au jeremy@utscsd.csd.uts.edu.au
Irregular adjective:  I have a moral standpoint
		      You are assertive
		      He is aggressive

mjr@hussar.dco.dec.com (Marcus J. Ranum) (02/04/91)

jeremy@socs.uts.edu.au (Jeremy Fitzhardinge) writes:

>1 - a flink(char *path, int fd) system call/operation.

	Part of the problem with such a system call is that it would
break a lot of fairly clean and elegant interfaces. Presently, you can
(for example) write code that ignores whether it is writing to a
tty, a socket, or a disk file. An flink() system call would break
that because you'd have to generate an error if someone tried to
flink() a socket to a filename. Sure, it'd be do-able, but there
would be lots of grotty special cases to deal with. The question then
becomes "why?" - actual cases where someone would want to do such a
thing are fairly rare, I believe - not worth the cost that would be
incurred. You'd also have the same problem that you'd get an error
if you tried to flink() across a device. In the cross-device case,
copying it back is far more portable.

>2 - insertion/deletion in the middle of a file without copying

	This is another fairly special case. I don't have any hard
statistics, but I suspect most file activity is sequential or random,
and a random sequential :) file wouldn't be used a whole lot. There
are a lot of cases where this would be very nice, and typically
such functionality is fairly easily added to an application via a
set of library routines that manipulates blocks in some form of
linked list. This is probably a good way to do it, since it won't
make the inodes bigger (which means that EVERY file will waste extra
space) - it's also just a simple issue of application support. If
I write my application with a library to handle file management,
I don't have to worry that it won't run on Joe Bob's UNIX which
hasn't got kernel support for chunked files. That counts for a
lot. Generally, it's better to put stuff in the application
layer unless it *HAS* to go into the kernel, or unless it will
somehow dramatically help all the applications running on that
kernel - without breaking portability. For example, implementing
Osterhout's log-based file system and getting a 10% write speed
up would be a bigger win for 95% of the applications on the system
than getting a 95% speedup for 10% of the applications.

>3 - limited sized files

	There are a lot of things that UNIX doesn't do that it
might be nice if it did - but a lot of those are because it'd be
unnecessarily complex or expensive to do them, and the return on
investment is fairly low. Fortunately, kernel hackers have been
one of the last bastions against the "let's just add this feature
because it'd look neet" crowd - otherwise UNIX would look like
X-window or GNUemacs.

mjr.
-- 
	Lutraphiles unite!

lm@slovax.Eng.Sun.COM (Larry McVoy) (02/04/91)

In article <1991Jan30.143326.16676@socs.uts.edu.au> jeremy@socs.uts.edu.au (Jeremy Fitzhardinge) writes:
>1 - a flink(char *path, int fd) system call/operation.

Perfectly reasonable.  Almost everything that takes a pathname as an arg is
already written as

func(char *p, args...)
{
	struct	vnode *vp;

	lookuppn(p, &vp, dirvp...);
	cfunc(vp, args...);
}

ffunc(int fd, args)
{
	struct	file fp;

	fp = GETF(fd);
	cfunc((struct vnode*)fp->f_data, args...);
}

>2 - insertion/deletion in the middle of a file without copying
>
>result of the way the filesystem is designed.  What I was thinking about
>is having the logical size of each block in the indirect blocks, as well
>as their location.  

This is normally known as an extent, i.e, a <bn, length> tuple.

>[much stuff about insertion alg deleted]

I'm not very interested in this idea.  While I agree that it is nice to
be able to say "vi 100MBfile", insert some junk, and write it out, and
have it all happen quickly, I question that this is a common enough
operation that you really want to cram this sort of complexity into the
file system.  If you really need it, build it inot the application using
multiple files.  You can also mitigate the copy stuff (it may be that 
some editors do this already) by rewriting the data from the change on
down.

>3 - limited sized files
>
>This idea is essentially quite similar to the above - basically I've
>been sick of simple log files that grow and grow without bound, often
>making serious holes in a file system.  

There is a per process file size limit.  Find the offending processes and
crank down their limit.  That's what it is there for.  Better yet, write
a crontab entry that goes in, deletes all but the last N lines/bytes/whatever
of data.  This is an administration issue, not a file system issue.

>These are ideas that may be implemented in a filesystem that's currently
>being designed.  I would quite like comments and ideas from fellow
>experienced Unix users/hackers.

You got 'em.  I'd like to know who/where/why this file system is being designed
on when/where/how it will be released.
---
Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com

bzs@world.std.com (Barry Shein) (02/04/91)

Actually, I'm going to take exception with all these old fuddy-duddies
who seem to be defending the status quo and say that the idea of
manipulating blocks within a file is perfectly rational and would be
useful. I don't know why everyone seems to think it's a wild idea.

Think of fixed length record files and inserting into them, it would
be nice to be able to just copy/munge the block numbers rather than
the data.

You'd need operators for inserting and deleting, perhaps one function
could do both (who cares, two functions, or one with flags, easy
enough to use flags.) Moving blocks around (e.g. a sort) would be
handy also.

Of course, you could do most of this virtually (although really
freeing space is a problem) by just writing an application library
which goes through a block table.

I suppose the obvious suggestion would be to try writing and putting
such a library into common use and seeing if it gets used.

Personally, I'd be more interested in a meta-file format where you can
create files which point into other files, a file-type made up of an
(offset, length) tuple list. The hard part is reference counts. But it
is the file system equivalent of a database "view". I could think of
many uses for that, and its operators (e.g. hypertext.)
-- 
        -Barry Shein

Software Tool & Die    | bzs@world.std.com          | uunet!world!bzs
Purveyors to the Trade | Voice: 617-739-0202        | Login: 617-739-WRLD

chip@tct.uucp (Chip Salzenberg) (02/05/91)

According to sef@kithrup.COM (Sean Eric Fagan):
>And, yeah, there have been times when I would have liked to have seen
>a "round" file (i.e., wrapping around the end).

So create a circular file subroutine library that looks at the top of
the file for "maxsize,curpos\n".  We use one here; it's very handy,
and it's usable on all UNIX implementations with file/record locks.
-- 
Chip Salzenberg at Teltronics/TCT     <chip@tct.uucp>, <uunet!pdn!tct!chip>
 "Most of my code is written by myself.  That is why so little gets done."
                 -- Herman "HLLs will never fly" Rubin

jeff@crash.cts.com (Jeff Makey) (02/05/91)

In article <1991Jan30.143326.16676@socs.uts.edu.au> jeremy@socs.uts.edu.au (Jeremy Fitzhardinge) writes:
>2 - insertion/deletion in the middle of a file without copying
 [...]
>3 - limited sized files

Both of these wishes could be granted by combining a variant of
ftruncate() that deletes bytes from arbitrary sections of a file with a
new kernel call that efficiently creates empty space in the middle of a
file.

                            :: Jeff Makey

Department of Tautological Pleonasms and Superfluous Redundancies Department
    Posting from my temporary home at ...
    Domain: jeff@crash.cts.com    UUCP: nosc!crash!jeff

igb@fulcrum.bt.co.uk (Ian G Batten) (02/05/91)

In article <BZS.91Feb4003139@world.std.com> bzs@world.std.com (Barry Shein) writes:
> Think of fixed length record files and inserting into them, it would
> be nice to be able to just copy/munge the block numbers rather than
> the data.

What's needed is a version of streams for filesystems.  With Multics,
the One True Operating System, you could attach modules (== push
modules) such as vfile_ to provide additional functionality over and
above that which you got from initiate_segment_ and its friends.  What
would be nice with Unix would be ISAM, record mode, whatever modules you
could push on top of the mmap interface.  Once you can map files into
your address space most things can be done on top of that.

ian

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (02/05/91)

In article <1991Feb04.004933.17253@kithrup.COM> sef@kithrup.COM (Sean Eric Fagan) writes:
> In article <1991Jan30.143326.16676@socs.uts.edu.au> jeremy@socs.uts.edu.au (Jeremy Fitzhardinge) writes:
> >1 - a flink(char *path, int fd) system call/operation.
> This, while not necessarily a bad idea, is not necessarily a *good* idea.
> You are not going to be able to do it for any arbitrary path and
> file descriptor (since you have problems with mount points still, just like
> normal links), and some of the objects don't make a whole lot of sense as
> files.

You're describing exactly the limitations on link(). What's wrong with
that?

Here's one use of flink(): You run ``rmprotect foo bar'', where foo and
bar are important files that you want to make sure you never delete.
rmprotect periodically checks the number of links on foo and bar; if
they ever disappear, it puts them back and sends you mail. The only way
to do this without flink() is to waste some directory space elsewhere
for extra links, and then you don't get the same reliability.

---Dan

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (02/05/91)

In article <1991Feb04.045330.779@iecc.cambridge.ma.us> johnl@iecc.cambridge.ma.us (John R. Levine) writes:
> In article <1991Jan30.143326.16676@socs.uts.edu.au> jeremy@socs.uts.edu.au (Jeremy Fitzhardinge) writes:
> >1 - a flink(char *path, int fd) system call/operation.
> Currently, a setuid program can open a file, turn off its setuid-ness, and
> exec some other program which can use the open file but not link, chmod, etc.
> This is pretty pointless for disk files, but can sometimes be useful if the
> file is a device or network file.  Being able to invent a name for any open
> file passed to you introduces possible protection holes.  Ugh.

Yeah, but if you give flink() the obvious link() semantics then there's
no security problem.

> I'd be
> interested to hear what real problems this is intended to fix.

Well, you might create temporary files with fdtemp(), manipulate them
outside the directory tree, and then flink() them in their final state.
(fdtemp() would have to take a directory argument so that it could
assign a filesystem to the file.) This prevents several potential
security holes and makes it easier to synchronize separate applications.

---Dan

tchrist@convex.COM (Tom Christiansen) (02/06/91)

From the keyboard of igb@fulcrum.bt.co.uk (Ian G Batten):
:In article <BZS.91Feb4003139@world.std.com> bzs@world.std.com (Barry Shein) writes:
:> Think of fixed length record files and inserting into them, it would
:> be nice to be able to just copy/munge the block numbers rather than
:> the data.
:
:What's needed is a version of streams for filesystems.  With Multics,
:the One True Operating System, you could attach modules (== push
:modules) such as vfile_ to provide additional functionality over and
:above that which you got from initiate_segment_ and its friends.  What
:would be nice with Unix would be ISAM, record mode, whatever modules you
:could push on top of the mmap interface.  Once you can map files into
:your address space most things can be done on top of that.

I think a good watchdog (file/inode daemon) implementation would allow
that.  See the paper is the proceedings from the next-to-the-last USENIX
in Dallas (about 3 years ago) for a description of the idea and one
implementation.

--tom
--
"Still waiting to read alt.fan.dan-bernstein using DBWM, Dan's own AI
window manager, which argues with you 10 weeks before resizing your window." 
### And now for the question of the month:  How do you spell relief?   Answer:
U=brnstnd@kramden.acf.nyu.edu; echo "/From: $U/h:j" >>~/News/KILL; expire -f $U

thorinn@diku.dk (Lars Henrik Mathiesen) (02/06/91)

richard@locus.com (Richard M. Mathews) writes:
>jeremy@socs.uts.edu.au (Jeremy Fitzhardinge) writes:
>>1 - a flink(char *path, int fd) system call/operation.

>>The only security problem I can think of is
>>this: it would be possible to link a file back into the filesystem into
>>a publically accessable directory after some time, even if the path to
>>the original file becomes closed.  If this were a real problem, you'd
>>have to use a utility like fuser to see what processes have what files
>>open in the closed off area.

As it is now, you have to use a utility like ncheck to see what hard
links there are lying around.

>Say there is a
>file protected by directory permissions which some setuid/setgid program
>lets you look at under controlled circumstances.  Your program can now
>create a name for the file not protected by the directory and reopen the
>file with more flexibility.  This may be far fetched, but it should be
>considered.

As it is now, if you get an open file descriptor for a file, you can
copy it anywhere you like; if you want to track changes, fstat every
second and re-copy as needed. This would not be a new hole, it would
just be easier to use.

>Combined with problems like only being able to link it to a name in the
>correct file system, I think this idea needs some work.

With the obvious implementation of flink, the sequence
	fd = open("foo", mode);
	flink(fd, "bar");
will have _exactly_ the same effect as, and fail in the same cases as,
	fd = open("foo", mode);
	link("foo", "bar");
This includes making a hard link to /dev/tty?? or a FIFO inode if
that's what fd was opened on. (Under Sun licensed NFS, and 4.3 BSD.)

If I could think of something to use it for, I'd add it to the kernel
tonight. (Except we also have Suns, so that'd be one more
incompatibility. Or maybe they'd let us buy source if we said we
wanted to ``experiment with kernel extensions''?)

--
Lars Mathiesen, DIKU, U of Copenhagen, Denmark      [uunet!]mcsun!diku!thorinn
Institute of Datalogy -- we're scientists, not engineers.      thorinn@diku.dk

rbj@uunet.UU.NET (Root Boy Jim) (02/06/91)

In article <1991Jan30.143326.16676@socs.uts.edu.au> jeremy@socs.uts.edu.au (Jeremy Fitzhardinge) writes:
>I have 3 main ideas for change:
>
>1 - a flink(char *path, int fd) system call/operation.

Many people have complained about "security problems".
I don't see any. If you have an fd, you have the data, so you
can copy it to your own file anyway. An flink is just faster.

>2 - insertion/deletion in the middle of a file without copying

I don't like this any better than anyone else, with one exception.
I can see extending f?truncate to trim the beginning. The kernel
would keep an beginning pointer for it's own internal use.
Well, the implementation gets a bit tricky, but it could work.

>3 - limited sized files

Hey, pipes already do this! They treat the ten direct block
pointers as a ring buffer. Now the question becomes, what
sizes will be supported, and how do you know where to
start scanning when the ring wraps. Almost certainly you
will be in the middle of a "record" if using variable ones.

Cron jobs to trim log files can lose log entrys. You have to
rename the file, then send a signal to any process that keeps
the file open so it can open the new log. Or lock the file
before renaming it.

All of these have been discussed before. I think the consensus
is that each has its appeal, for about five minutes. However,
ideas stimulate us to look for better ways of doing things.
For example, instead of using a log file, a unix domain socket
could be written to. It could write flat files, circular files,
filter entrys, update a database, send to a secure machine, whatever.

-- 

	Root Boy Jim Cottrell <rbj@uunet.uu.net>
	I got a head full of ideas
	They're driving me insane

richard@locus.com (Richard M. Mathews) (02/07/91)

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:

>sef@kithrup.COM (Sean Eric Fagan) writes:
>> jeremy@socs.uts.edu.au (Jeremy Fitzhardinge) writes:
>> >1 - a flink(char *path, int fd) system call/operation.
>> This, while not necessarily a bad idea, is not necessarily a *good* idea.
>> You are not going to be able to do it for any arbitrary path and
>> file descriptor (since you have problems with mount points still, just like
>> normal links), and some of the objects don't make a whole lot of sense as
>> files.

>You're describing exactly the limitations on link(). What's wrong with
>that?

With link() you have a pathname to the file, so you have some idea where
you can put the new link().  Presumably with flink(), you are using this
call because you DON'T have a pathname.  Since you don't know where the
file came from, you have to do more work to figure out where it can go.

Now I'll rebut my own statements above.  Perhaps the real reason you are
using flink() is that the file has ZERO links.  You know where it WAS,
so you know where you can put it back as well as you do with link().
Flink() has some real use here.

I agree that flink could be useful, but as I've pointed out elsewhere,
I am slightly worried about its possible use to violate security.  On
the other hand, given the weak security of most Unix systems, this small
chance of opening a hole is nothing.

Richard M. Mathews			 Freedom for Lithuania
richard@locus.com				Laisve!
lcc!richard@seas.ucla.edu
...!{uunet|ucla-se|turnkey}!lcc!richard

richard@locus.com (Richard M. Mathews) (02/07/91)

rbj@uunet.UU.NET (Root Boy Jim) writes:

>Many people have complained about "security problems".
>I don't see any. If you have an fd, you have the data, so you
>can copy it to your own file anyway. An flink is just faster.

The question isn't whether you can write your own copy; it is whether you
can write to the "system's" copy.  Say the "system" has a file with mode
666 which is protected only by directory permissions.  Certain setuid
or setgid programs are supplied which provide controlled access to the
file.  A user supplied program can be invoked with the file open for
read.  Only "system" supplied programs can access the file for write.
With flink(), the user could create a name for the file, reopen it for
write, and screw up the whole world.

("system" here refers not necessarily to the Unix system, but to whomever
or whatever is in charge of some application package)

Richard M. Mathews			D efend
richard@locus.com			 E stonian-Latvian-Lithuanian
lcc!richard@seas.ucla.edu		  I ndependence
...!{uunet|ucla-se|turnkey}!lcc!richard

kandall@sgitokyo.nsg.sgi.com (Michael Kandall) (02/07/91)

In article <G2%&-2#@uzi-9mm.fulcrum.bt.co.uk> igb@fulcrum.bt.co.uk (Ian G Batten) writes:
>In article <BZS.91Feb4003139@world.std.com> bzs@world.std.com (Barry Shein) writes:
>> Think of fixed length record files and inserting into them, it would
>> be nice to be able to just copy/munge the block numbers rather than
>> the data.
>
>What's needed is a version of streams for filesystems.  With Multics,
>the One True Operating System, you could attach modules (== push
>modules) such as vfile_ to provide additional functionality over and
>above that which you got from initiate_segment_ and its friends.  What
>would be nice with Unix would be ISAM, record mode, whatever modules you
>could push on top of the mmap interface.  Once you can map files into
>your address space most things can be done on top of that.
>
>ian

I believe SVR4 has this.  In SVR4's enhanced STREAMS, I believe you can
push STREAMS onto arbitrary file descriptors.

-- 
----
Michael Kandall
Independent Consultant
Nihon Silicon Graphics

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (02/07/91)

In article <richard.665896876@fafnir.la.locus.com> richard@locus.com (Richard M. Mathews) writes:
  [ foo is mode 700 root, foo/bar is mode 666 root, some setuid program ]
  [ opens foo/bar for reading and passes the descriptor to user code ]
> With flink(), the user could create a name for the file, reopen it for
> write, and screw up the whole world.

Nah. flink() would only work if you have the file open for writing. End
of security problems. You say this is a limitation? Well---

(The *right* way to do this is to have an entirely separate bit: O_LINK,
perhaps. The privileged program here would just make sure to leave
O_LINK out of the open. See the O_NONE discussion that crops up now and
then: people have proposed good uses for a few other bits.)

---it did occur to you that under the current system, you'd need either
read or write access to open the descriptor for flink() in the first
place. Didn't it? Until there's something like O_NONE to open files for
operations without I/O, this part of the system will never be perfectly
clean. The simplest solution is to make O_LINK synonymous with O_WRONLY.

---Dan

chip@tct.uucp (Chip Salzenberg) (02/08/91)

According to thorinn@diku.dk (Lars Henrik Mathiesen):
>jeremy@socs.uts.edu.au (Jeremy Fitzhardinge) writes:
>>1 - a flink(char *path, int fd) system call/operation.
>
>If I could think of something to use it for, I'd add it to the kernel
>tonight.

It's a convenient way to create lock files -- if, that is, the kernel
also supports fdcreat(), which creates a plain file with no links.

Also, the obvious companion fdunlink(int fd, char *path) is something
I've always wanted.  It unlinks the given path if and only if it is a
name for fd.  With fdunlink(), the UUCP style of lock files can be
used safely and reliably, since the normal race condition -- "how do I
know that the lock I'm removing is the stale one" -- disappears.
-- 
Chip Salzenberg at Teltronics/TCT     <chip@tct.uucp>, <uunet!pdn!tct!chip>
 "Most of my code is written by myself.  That is why so little gets done."
                 -- Herman "HLLs will never fly" Rubin

dhesi%cirrusl@oliveb.ATC.olivetti.com (Rahul Dhesi) (02/08/91)

In <20190:Feb712:13:4391@kramden.acf.nyu.edu>
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:

>Nah. flink() would only work if you have the file open for writing.

Well, writing but not O_APPEND.
--
Rahul Dhesi <dhesi%cirrusl@oliveb.ATC.olivetti.com>
UUCP:  oliveb!cirrusl!dhesi

barmar@think.com (Barry Margolin) (02/08/91)

In article <richard.665896415@fafnir.la.locus.com> richard@locus.com (Richard M. Mathews) writes:
>With link() you have a pathname to the file, so you have some idea where
>you can put the new link().  Presumably with flink(), you are using this
>call because you DON'T have a pathname.  Since you don't know where the
>file came from, you have to do more work to figure out where it can go.

What do pathnames have to do with "where you can put the new link"?  You
can put the new link anywhere on the same file system as the file.  Due to
symbolic links, it is not possible to determine whether two files are on
the same file system simply by looking at the pathnames.

Presumably one of the first things that the link() system call does is
translate the old pathname to a device and inode (or vnode or whatever is
appropriate for the file system).  It then does the same thing for the
directory portion of the new pathname, and compares the device portion.
The device/inode information is presumably stored in the file table entry
that the file descriptor references, so the differences are trivial.  The
kernel code would presumably be structured something like:

link(const char *old_path, const char *new_path)
{
	file_info = namei(old_path);
	file_link(file_info, new_path);
}

flink(unsigned int fd, const char *new_path)
{
	file_info = lookup_fd_info(fd);
	if (file_info.device_type != file)
	    link_wrong_type_attempt();	/* can't flink a pipe, etc. */
	else
	    file_link(file_info, new_path);
}

file_link (FILE_INFO fi, new_path)
{
	device = get_pathname_device(new_path);
	if (device != fi.device)
	    cross_device_link_attempt();
	else
	    create_file(file_info, new_path);
}
--
Barry Margolin, Thinking Machines Corp.

barmar@think.com
{uunet,harvard}!think!barmar

bzs@world.std.com (Barry Shein) (02/08/91)

>The question isn't whether you can write your own copy; it is whether you
>can write to the "system's" copy.  Say the "system" has a file with mode
>666 which is protected only by directory permissions.  Certain setuid
>or setgid programs are supplied which provide controlled access to the
>file.  A user supplied program can be invoked with the file open for
>read.  Only "system" supplied programs can access the file for write.
>With flink(), the user could create a name for the file, reopen it for
>write, and screw up the whole world.

Since all flink() would do is enter a string/i-num pair into a
directory I can't see how any of this applies.

I was trying to think of some trick along the lines of a setuid
program which opens a protected file and then execs a non-priv process
handing down only the open fd, some software does this sort of thing.

Inetd is analogous to this, as an example, since it takes privilege to
bind() a low-numbered port for accepts() but the processes it execs
need not be priv'd in any way (I realize these are sockets, not plain
files, but just in case anyone thought this sort of thing I am
describing is unlikely...)

But if it can be flink()'d at all then we assume you could seek to
zero and copy all the data out of the file to your own file anyhow, so
that's not a new opportunity. And whether you can read or write is
dictated by the setting of the inode and how the original fd was
opened which is independent of flink() entirely.

----------

Hmm, it would also increase the link count of the file. I suppose that
could be a weak security problem. It also would change the change date
in the inode, even if the file and/or directory were otherwise
inaccessible for any modification by other means.

So I suppose someone could use this on a read-only fd handed down from
a priv'd process to maliciously force the file to appear to need
a back-up.
-- 
        -Barry Shein

Software Tool & Die    | bzs@world.std.com          | uunet!world!bzs
Purveyors to the Trade | Voice: 617-739-0202        | Login: 617-739-WRLD

lupienj@hpwadac.hp.com (John Lupien) (02/09/91)

In article <422@bria> uunet!bria!mike (Michael Stefanik) writes:
>In an article, socs.uts.edu.au!jeremy (Jeremy Fitzhardinge) writes:
>>3 - limited sized files
>Although this has some merit, I would much prefer to have cron fire
>up a script that simply trims down my growing log files, rather than
>burden the kernel with the job.

I quite agree that this does not need to be done in the kernel.
The original posting was talking about a kind of "fifo" file with
the data falling off a cliff when it gets to the end. This could
be done quite nicely at the user level with a file-based ring buffer.

---
John R. Lupien
lupienj@hpwarq.hp.com

richard@locus.com (Richard M. Mathews) (02/09/91)

bzs@world.std.com (Barry Shein) writes:
>Since all flink() would do is enter a string/i-num pair into a
>directory I can't see how any of this applies.
>....
>But if it can be flink()'d at all then we assume you could seek to
>zero and copy all the data out of the file to your own file anyhow, so
>that's not a new opportunity. And whether you can read or write is
>dictated by the setting of the inode and how the original fd was
>opened which is independent of flink() entirely.

Whether you can read or write is dictated by the mode setting of the inode
and the effective uid/gid AND THE MODES OF THE DIRECTORIES YOU MUST PASS
THROUGH TO GET TO THE FILE.  All of this is deterimined AT THE TIME OF
AN OPEN, and never again.  By allowing creation of a link, you create an
opportunity to do an open call which would otherwise have been prevented
by the directory permissions.  For example, say directory /user/joe/foo
is mode 700 and file /user/joe/foo/bar is mode 666.  Despite the mode
of the file, I can't open it.  If, however, a setuid-joe program lets
me run a program I wrote while it has /user/joe/foo/bar open for read,
then I can flink the file to /user/richard/gotcha which I can then open
for write.

Adding a restriction that you can do flink only if the file is open
for write is an interesting idea.

Richard M. Mathews			 Freedom for Lithuania
richard@locus.com				Laisve!
lcc!richard@seas.ucla.edu
...!{uunet|ucla-se|turnkey}!lcc!richard

richard@locus.com (Richard M. Mathews) (02/09/91)

barmar@think.com (Barry Margolin) writes:

>In article <richard.665896415@fafnir.la.locus.com> richard@locus.com (Richard M. Mathews) writes:
>>With link() you have a pathname to the file, so you have some idea where
>>you can put the new link().  Presumably with flink(), you are using this
>>call because you DON'T have a pathname.  Since you don't know where the
>>file came from, you have to do more work to figure out where it can go.

>What do pathnames have to do with "where you can put the new link"?  You
>can put the new link anywhere on the same file system as the file.  Due to
>symbolic links, it is not possible to determine whether two files are on
>the same file system simply by looking at the pathnames.

>Presumably one of the first things that the link() system call does is

I get the feeling that you thought I meant that the flink system call
wouldn't be able to figure out where the link is allowed to go.  That
isn't what I meant at all.  The "you" in my quote above refers to
application programs which try to make general use of flink on a random
file descriptor.  I meant that a possible application of flink would NOT
include a general purpose program which would flink its stdin to some file
to allow it to reopen the file in a different mode (e.g., "more" wants to
read stderr (at least it did at one time), so if it gets a stderr which
is open for write-only, it might want an opportunity to reopen the file
(terminal) for read-write).  This would not be a practical use of flink
because of the single file system restriction of link and flink.

On the other hand, a program which knows where a file is because it
created it (and thus it also knows that the original path name was not
through a symbolic link) can make good use of flink to put back a file
even after the link count goes to zero.

Sorry, if I wasn't clear before.

Richard M. Mathews			D efend
richard@locus.com			 E stonian-Latvian-Lithuanian
lcc!richard@seas.ucla.edu		  I ndependence
...!{uunet|ucla-se|turnkey}!lcc!richard

xtdn@levels.sait.edu.au (02/09/91)

bzs@world.std.com (Barry Shein) writes:
> But if it can be flink()'d at all then we assume you could seek to
> zero and copy all the data out of the file to your own file anyhow, so
> that's not a new opportunity. And whether you can read or write is
> dictated by the setting of the inode and how the original fd was
> opened which is independent of flink() entirely.

Copying a file now gives access to the current contents later.  Flinking
a file now could give access to the future contents of that file.  This
may not be desirable.  It is, however, unlikely to be a major problem to
the aware programmer: an appropriate file mode solves all!

I do like the idea of flink().  The suggestion (I forget whose it was)
of fdunlink() seems appropriate, too; it sounds quite balanced.

David Newall, who no longer works       Phone:  +61 8 344 2008
for SA Institute of Technology          E-mail: xtdn@lux.sait.edu.au
                "Life is uncertain:  Eat dessert first"

bzs@world.std.com (Barry Shein) (02/10/91)

From: xtdn@levels.sait.edu.au
>bzs@world.std.com (Barry Shein) writes:
>> But if it can be flink()'d at all then we assume you could seek to
>> zero and copy all the data out of the file to your own file anyhow, so
>> that's not a new opportunity. And whether you can read or write is
>> dictated by the setting of the inode and how the original fd was
>> opened which is independent of flink() entirely.
>
>Copying a file now gives access to the current contents later.  Flinking
>a file now could give access to the future contents of that file.  This
>may not be desirable.  It is, however, unlikely to be a major problem to
>the aware programmer: an appropriate file mode solves all!

Ok, granted, if the path to the original file was not accessible and
was opened by a priv'd program and the fd handed down (e.g. thru an
open(), setuid() and then an exec()), then an flink() would bypass the
original directory protection. Thus, an accessible file in an
inaccessible directory would become accessible. If it were a changing
file it could then be opened any time later, even long after the
program exited, w/o need for the original priv. Whew.

So you're right, that *is* a potential problem. And just subtle enough
that it might bite someone in a big way.

Just thought I'd lay that out before we got a hundred "huh?" messages.

I'd say that pretty much condemns flink() as an idea unless someone
can think of a way around that. I can't think of anything that's not
awful.
-- 
        -Barry Shein

Software Tool & Die    | bzs@world.std.com          | uunet!world!bzs
Purveyors to the Trade | Voice: 617-739-0202        | Login: 617-739-WRLD

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (02/10/91)

In article <27B18AD8.2F15@tct.uucp> chip@tct.uucp (Chip Salzenberg) writes:
  [ on flink() ]
> It's a convenient way to create lock files -- if, that is, the kernel
> also supports fdcreat(), which creates a plain file with no links.

A few months ago I briefly discussed with Keith Bostic the top three
calls on my BSD-extensions list: fdlink(), fdtemp(), and fdunlink(). The
first was the same as flink(); the second was the same as fdcreat(); the
third was the same as unlink(), but returned a file descriptor pointing
to the removed file. He didn't believe that my examples of race
conditions were problems in practice.

> Also, the obvious companion fdunlink(int fd, char *path) is something
> I've always wanted.  It unlinks the given path if and only if it is a
> name for fd.

Hmmm. I already have fdunldilink() listed; it only removes a file if it
has a specified number of hard links, device, and inode, with 0 for no
restriction. I think fdunldilink(0,st.st_dev,st.st_ino,path) would do
the trick after an fstat(fd,&st).

---Dan

jim@segue.segue.com (Jim Balter) (02/11/91)

In article <BZS.91Feb7163251@world.std.com> bzs@world.std.com (Barry Shein) writes:
>Since all flink() would do is enter a string/i-num pair into a
>directory I can't see how any of this applies.

You mean, since all flink would do is let you create an accessible path to a
file that you didn't previously have a accessible path to, you can't see how
any of this applies?  Strange.

"Since all setuid(0) does is clear a few bits somewhere, I can't see how a
discussion of the consequences of letting anyone do so applies."

guy@auspex.auspex.com (Guy Harris) (02/12/91)

In article <1991Feb7.064348.1873@sgitokyo.nsg.sgi.com> kandall@sgitokyo.nsg.sgi.com (Michael Kandall) writes:
>In article <G2%&-2#@uzi-9mm.fulcrum.bt.co.uk> igb@fulcrum.bt.co.uk (Ian G Batten) writes:
>>In article <BZS.91Feb4003139@world.std.com> bzs@world.std.com (Barry Shein) writes:
>>> Think of fixed length record files and inserting into them, it would
>>> be nice to be able to just copy/munge the block numbers rather than
>>> the data.
>>
>>What's needed is a version of streams for filesystems.  With Multics,
>>the One True Operating System, you could attach modules (== push
>>modules) such as vfile_ to provide additional functionality over and
>>above that which you got from initiate_segment_ and its friends.  What
>>would be nice with Unix would be ISAM, record mode, whatever modules you
>>could push on top of the mmap interface.  Once you can map files into
>>your address space most things can be done on top of that.
>>
>>ian
>
>I believe SVR4 has this.  In SVR4's enhanced STREAMS, I believe you can
>push STREAMS onto arbitrary file descriptors.

You believe incorrectly.  What S5R4 *does* have is the ability to attach
a STREAMS-device file descriptor to a "node in the file system name
space", using "fattach()".  This does *NOT* magically turn a regular
file into a STREAMS device; it turns it into a name for a STREAMS
device.  I.e., anybody who opens the file after you've "fattach()"ed
something to it will *NOT* get a file descriptor that reads from or
writes to the underlying file; the underlying file merely provides a
*name* for the stream.

What Ian was describing sounds more like the stuff Apollo did - with the
name "Extensible Streams", but where "Streams" has nothing to do with
"streams" in the Research UNIX sense or "STREAMS" in the S5 sense.  The
low-level means of accessing a file is by mapping into a process's
address space; atop that is built a mechanism for more "conventional"
file access, with each file having an "object type UID".  The "object
type UID" indicates what code acts as a "type manager" for the file;
that "type manager" code implements operations such as "open", "read",
"write", etc..  Dunno if they were "stackable" like (streams|STREAMS)
modules.  (Then again, I don't remember whether modules were stackable
in any of Multics's I/O subsystems, either "ios_" or "iox_".)
I think the type managers all lived in user-mode code.

That doesn't necessarily give you the stuff Barry was referring to; the
"containers" provided by the (probably kernel-level) file system are
arrays of pages, similar to UNIX files.  Unless there was an interface
to that file system that let you insert pages into the middle of a
container, you wouldn't be able to do an insert like that.

richard@locus.com (Richard M. Mathews) (02/12/91)

dhesi%cirrusl@oliveb.ATC.olivetti.com (Rahul Dhesi) writes:
>brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
>>Nah. flink() would only work if you have the file open for writing.
>Well, writing but not O_APPEND.

I don't think an O_APPEND check would be necessary.  Since fcntl() can
be used to change the O_APPEND flag, anything which depends on it for
security would already be broken (unless you have a system which has
O_APPEND but doesn't have fcntl(F_SETFL)).

Richard M. Mathews			 Freedom for Lithuania
richard@locus.com				Laisve!
lcc!richard@seas.ucla.edu
...!{uunet|ucla-se|turnkey}!lcc!richard

chip@tct.uucp (Chip Salzenberg) (02/14/91)

According to bzs@world.std.com (Barry Shein):
>Ok, granted, if the path to the original file was not accessible and
>was opened by a priv'd program and the fd handed down (e.g. thru an
>open(), setuid() and then an exec()), then an flink() would bypass the
>original directory protection.

What if flink() were permitted only on file descriptors open for
O_RDWR without O_APPEND?  After all, if you have a file descriptor
meeting that description, there's almost nothing bad you can do with
the file that you couldn't do with the file descriptor, slower.
-- 
Chip Salzenberg at Teltronics/TCT     <chip@tct.uucp>, <uunet!pdn!tct!chip>
 "I want to mention that my opinions whether real or not are MY opinions."
             -- the inevitable William "Billy" Steinmetz

chip@tct.uucp (Chip Salzenberg) (02/14/91)

According to brnstnd@kramden.acf.nyu.edu (Dan Bernstein):
>I already have fdunldilink() listed; it only removes a file if it
>has a specified number of hard links, device, and inode, with 0 for
>no restriction.  I think fdunldilink(0,st.st_dev,st.st_ino,path)
>would do the trick after an fstat(fd,&st).

Yes; fdunldilink() [what a name!] can simulate my fdunlink(), but not
vice versa; so fdunldilink() is the better choice.  I would suggest -1
for the "don't care" value, though, since st_dev could easily be zero.

It's too bad that Keith doesn't see the need for these operations.
But then, adding features to BSD never made my life any easier.  :-)
-- 
Chip Salzenberg at Teltronics/TCT     <chip@tct.uucp>, <uunet!pdn!tct!chip>
 "I want to mention that my opinions whether real or not are MY opinions."
             -- the inevitable William "Billy" Steinmetz

gsteckel@vergil.East.Sun.COM (Geoff Steckel - Sun BOS Hardware CONTRACTOR) (02/14/91)

In article <BZS.91Feb7163251@world.std.com> bzs@world.std.com (Barry Shein) writes:
>Since all flink() would do is enter a string/i-num pair into a
>directory I can't see how any of this applies.

Ummm... pipes USED to be implemented as `nameless' files on PIPEDEV, with
some strange semantics to make the read/write pointers wrap around at 10
blocks (or whatever size pipes were).  You wanna link one of THOSE into
the file system?  (:-)
	regards,
	geoff steckel (gwes@wjh12.harvard.EDU)
			(...!husc6!wjh12!omnivore!gws)
Disclaimer: I am not affiliated with Sun Microsystems, despite the From: line.
This posting is entirely the author's responsibility.

bzs@world.std.com (Barry Shein) (02/14/91)

>What if flink() were permitted only on file descriptors open for
>O_RDWR without O_APPEND?  After all, if you have a file descriptor
>meeting that description, there's almost nothing bad you can do with
>the file that you couldn't do with the file descriptor, slower.

Except now you can come back later, say a week later, and re-open the
file (assuming the file protexns were ok), without the setuid program.
So there would be no point at which an applications writer or admin
could be sure that no one could open that file (w/o checking links and
searching the file system.) As a more concrete example, let's say the
program which let you in used a list of valid users, and you were
removed from that list. If you flink()'d it, you could still get at
it.

Another way of putting it is, if we allow flink(), how do we ever have
a file which cannot be flink()'d, you'd need to invent a new bit for
open() I guess. I suppose if you had such a bit one could argue that
it's up to the application writer to set it (or unset it) if s/he
cares. Perhaps it should be off (disallowed) by default.
-- 
        -Barry Shein

Software Tool & Die    | bzs@world.std.com          | uunet!world!bzs
Purveyors to the Trade | Voice: 617-739-0202        | Login: 617-739-WRLD

xtdn@levels.sait.edu.au (02/15/91)

chip@tct.uucp (Chip Salzenberg) writes:
> What if flink() were permitted only on file descriptors open for
> O_RDWR without O_APPEND?  After all, if you have a file descriptor
> meeting that description, there's almost nothing bad you can do with
> the file that you couldn't do with the file descriptor, slower.

I think we've already discussed that, in time, the file contents could
be changed: Just because we're allowed to read the file now doesn't
mean that we should be allowed to read the future contents.

Never the less I like the idea of flink() and I don't see much benefit
in discussions along the lines of "only allow flink() on fd's open for
O_xxx".  Sorry to stifle this thread, people, but I believe that we can
say little more than: flink() does have some security implications that
one must be aware of, but being aware of them probably goes most of the
way to circumventing those problems.

Enough said?


David Newall, who no longer works       Phone:  +61 8 344 2008
for SA Institute of Technology          E-mail: xtdn@lux.sait.edu.au
                "Life is uncertain:  Eat dessert first"

chip@tct.uucp (Chip Salzenberg) (02/16/91)

According to bzs@world.std.com (Barry Shein):
>>What if flink() were permitted only on file descriptors open for
>>O_RDWR without O_APPEND?
>
>Except now you can come back later, say a week later, and re-open the
>file (assuming the file protexns were ok), without the setuid program.

Well, okay.

Idea #2: Allow flink() only if you are the owner of the file or the
superuser.
-- 
Chip Salzenberg at Teltronics/TCT     <chip@tct.uucp>, <uunet!pdn!tct!chip>
 "I want to mention that my opinions whether real or not are MY opinions."
             -- the inevitable William "Billy" Steinmetz

xtdn@levels.sait.edu.au (02/16/91)

bzs@world.std.com (Barry Shein) writes:
> Except now you can come back later, say a week later, and re-open the
> file (assuming the file protexns were ok), without the setuid program.

But Barry, you have put you're finger on a very salient point; which is
that one can always protect the file thus disallowing it from being
opened later.  I don't see that flink() would cause any major security
problems.

Just to put this into context: as things stand one could leave the
recipient process running for a week and then read the file.  Really,
excepting that processes can be killed and that machines do sometimes
go down, flink() would not allow any access that one cannot now obtain.


David Newall, who no longer works       Phone:  +61 8 344 2008
for SA Institute of Technology          E-mail: xtdn@lux.sait.edu.au
                "Life is uncertain:  Eat dessert first"

bzs@world.std.com (Barry Shein) (02/17/91)

>>Except now you can come back later, say a week later, and re-open the
>>file (assuming the file protexns were ok), without the setuid program.
>
>Well, okay.
>
>Idea #2: Allow flink() only if you are the owner of the file or the
>superuser.

That still bypasses directory protections. I suppose one would be
hard-pressed to come up with an example of why there would be a file
which was owned by you but otherwise not accessible, but I don't like
to consider that sort of argument, it only belies the limits of
imagination.

However, if it were "only the superuser" it would be possible to pass
fd's around to setuid programs and let them flink() them.

Then it would be up to the setuid application writer to figure out
what rules should be imposed, which sounds about right. The potential
security pitfalls could be explained in a short paragraph in the
manual page, at least in the abstract. The only problem is that the
pitfalls are very hard to work around (e.g. does this fd point at a
file in an otherwise inaccessible path???), so the result would
probably be to not do it on behalf of a non-priv'd program. But it
might be of some direct use to priv'd programs. Particularly in
combination with that BSD feature of passing fd's around on sockets.

(oh boy, now 100 people are gonna say, huh??? look into the access
rights stuff in send(2)/recv(2) on a BSD system.)
-- 
        -Barry Shein

Software Tool & Die    | bzs@world.std.com          | uunet!world!bzs
Purveyors to the Trade | Voice: 617-739-0202        | Login: 617-739-WRLD

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (02/17/91)

In article <BZS.91Feb16112615@world.std.com> bzs@world.std.com (Barry Shein) writes:
> Particularly in
> combination with that BSD feature of passing fd's around on sockets.
> (oh boy, now 100 people are gonna say, huh??? look into the access
> rights stuff in send(2)/recv(2) on a BSD system.)

More precisely, that BSD 4.2-and-later-but-generally-buggy-before-4.3
feature of file descriptor passing on UNIX-domain sockets, usable with
sendmsg(2) and recvmsg(2). Working example: pty's reconnect feature.

---Dan

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (02/17/91)

In article <27BC2E07.673D@tct.uucp> chip@tct.uucp (Chip Salzenberg) writes:
> Idea #2: Allow flink() only if you are the owner of the file or the
> superuser.

Idea #2': Allow link() only if you are the owner of the file or the
superuser.

Of course, these are both subsumed by a link protection bit (and O_LINK,
O_EXEC, O_NONE, etc. bits on open()).

---Dan

jr@oglvee.UUCP (Jim Rosenberg) (02/21/91)

In <BZS.91Feb16112615@world.std.com> bzs@world.std.com (Barry Shein) writes:
>I suppose one would be
>hard-pressed to come up with an example of why there would be a file
>which was owned by you but otherwise not accessible, but I don't like
>to consider that sort of argument, it only belies the limits of
>imagination.

If I remember it correctly, the BRL spooler, MDQS, *routinely* sets up files
for which the owner has no access by virtue of lack of a directory permission.
MDQS protects its spool directory by a "lock" directory.  You have to have
"spooler permissions" to traverse this directory.  But having done that, its
actual spool files have the uid and gid of the submitting user.  [Aside:  MDQS
is a *nice* spooler!  It's amazing to me that it hasn't joined elm and smail
et al among the cast of characters of PD software packages that replace the
respective "standard" package that comes with the operating system.  MDQS
surely beats the socks off the usual System V spooler.]
-- 
Jim Rosenberg             #include <disclaimer.h>      --cgh!amanue!oglvee!jr
Oglevee Computer Systems                                        /      /
151 Oglevee Lane, Connellsville, PA 15425                    pitt!  ditka!
INTERNET:  cgh!amanue!oglvee!jr@dsi.com                      /      /