[comp.unix.wizards] Directory compression

m5@lynx.uucp (Mike McNally) (08/16/90)

Does 4.3BSD (or anything else) dynamically compress directories when
entries are removed?  If so, how does the system deal with
consistency?  Like if I'm reading while something else is unlinking and
the directory is being compressed, how does the world go on living?
-- 
Mike McNally                                    Lynx Real-Time Systems
uucp: {voder,daver}!lynx!m5                     phone: 408 370 2233

            Where equal mind and contest equal, go.

bzs@world.std.com (Barry Shein) (08/19/90)

>Does 4.3BSD (or anything else) dynamically compress directories when
>entries are removed?  If so, how does the system deal with
>consistency?  Like if I'm reading while something else is unlinking and
>the directory is being compressed, how does the world go on living?
>-- 
>Mike McNally                                    Lynx Real-Time Systems

I believe all that is done is the freeing of (completely) emptied
blocks, so there's no problem. There's really no need, that I can see,
to do any further squeezing (I suppose you could try to smash now
small dirs into frags, not really worthwhile.)
-- 
        -Barry Shein

Software Tool & Die    | {xylogics,uunet}!world!bzs | bzs@world.std.com
Purveyors to the Trade | Voice: 617-739-0202        | Login: 617-739-WRLD

chris@mimsy.umd.edu (Chris Torek) (08/19/90)

In article <7913@lynx.UUCP> m5@lynx.uucp (Mike McNally) writes:
>Does 4.3BSD (or anything else) dynamically compress directories when
>entries are removed?  If so, how does the system deal with
>consistency?

I cannot speak for `anything else', but the answer for 4BSD is `no'.
Recent versions of 4BSD will reduce the sizes of directories, but not
when files are removed: only when files are created.  (There is another
answer to the above question, which appears below.)

The code goes something like this (the order is a bit different):

	if (this operation is not CREATE or RENAME)
		we can use the name cache;
	else {
		we have to scan the entire directory anyway;
		we need a slot with enough space for this name;
	}
	if (we can use the cache && we should use the cache) {
		if (cache lookup yeilds valid result)
			return result;
		remove invalid entry, if any, from cache;
	}
	enduseful = 0;
	for (offset = 0; offset < dirsize; offset += cur_record_len) {
		get current record stuff;
		if (entry is valid) {	/* found a nonempty slot */
			if (this entry matches) {
				if (we can use the cache)
					add it to the cache;
				return it;
			}
			enduseful = offset + cur_record_len;
		} else {		/* found an empty slot */
			if (this slot is big enough && we need a slot)
				remember this slot;
		}
	}
	/* entry not found */
	if (we can use the cache) {
		/* 4.3BSD-Reno only: */
		add a `known not present' entry to the cache;
	}
	save slot information, if any;
	return (NOTFOUND);

Later, if we decide to write to the directory:

	if (enduseful < dirsize) {
		/* directory can be shrunk */
		itrunc(inode for directory, enduseful);
	}

The logic is this:  if we are creating a new entry (CREATE or RENAME
operations), a cache lookup will only tell us that we cannot proceed,
not that we can.  This is rarely helpful (the typical case is that we
can proceed, and looking in the cache will just mean spending some time
only to get a `no information' result).  Otherwise (for a LOOKUP or a
DELETE) we can use the cache; however, profiles gathered on ucbvax
showed that cacheing DELETE operations was not a win, so it is not
done.

In any case, if the cache lookup fails or cannot be done, the whole
directory must be scanned.  (This is actually done in two parts:
rather than from 0 to N, the scan goes from M to N, then 0 to M, where
M is the place the last search ended.  This means that, e.g., if you
read a directory, then stat each file in the directory in directory
order, the loop always hits the next entry `perfectly'.)  While we are
scanning it, however, we might as well find the place (if any) that a
new name could be stashed, and---since it only takes one more
variable---the last place there are any non-empty slots in the
directory.  If we later decide to scribble on the directory (which
means we will have to write back the inode), we may as well trim the
directory at the same time.  Since this was all `detritus', any user
program that is about to read this part of the directory is not losing
any data if it gets EOF instead.

This still leaves the other possible question:  `Since records within
4BSD-style directories are variable length, it would be possible to
take an existing directory and make it smaller by packing the names in
a better way.  That is, we have a bin-packing problem, and when the
contents of the bins change a different packing might be more optimal.
Does any 4BSD system do such a repacking?'  The answer to this question
is `sort of'.

In the pseudo-code above, there really should be three states for the
`found a big enough slot' variable: `no', `yes', and `yes, but only
after moving stuff around'.  (These are actually three enumerations,
NONE, FOUND, and COMPACT respectively.)  The `compaction' state,
however, applies only to a single directory block (512 bytes on the
VAX, 1024 bytes on the Tahoe).  Some records in one block might get
moved (uniformly towards the front, in this case) in order to combine
free space from two different records.  That is:

	entry for foo+free space|entry for bar+free space
	<----200----> <---30--->|<----200----> <---82---> (bytes)
	<-------record 0------->|<-------record 1------->
	    (total size 230)	      (total size 282)

(assuming VERY long file names :-) ).  If we wanted to add a 95-byte
entry, it would fit in this 512-byte block, but only if we moved record
1 somewhat leftwards.  In this case we might get a slot status of
`COMPACT'.

When this happens, the directory rewriting code moves entries leftwards
as necessary, then fills in the (newly expanded) hole.  This means that
a program that (stupidly) reads directories one byte at a time may get
an inconsistent picture of the directory block.  Any program that reads
a whole directory block at a time, however, will get a consistent
picture; and any program that uses getdirentries()---this is required
for NFS, among other things, and is used by the current readdir()---will
likewise get a consistent picture.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris
	(New campus phone system, active sometime soon: +1 301 405 2750)

edward@ucbarpa.Berkeley.EDU (Edward Wang) (08/19/90)

In article <26078@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes:
>The logic is this:  if we are creating a new entry (CREATE or RENAME
>operations), a cache lookup will only tell us that we cannot proceed,
>not that we can.  This is rarely helpful (the typical case is that we
>can proceed, and looking in the cache will just mean spending some time
>only to get a `no information' result).  Otherwise (for a LOOKUP or a
>DELETE) we can use the cache; however, profiles gathered on ucbvax
>showed that cacheing DELETE operations was not a win, so it is not
>done.

This brings up a related question.  A result of a study I did
a while ago is that failed lookups often repeat (stats of mail boxes,
for example), and caching this information brings the hit rate
(cache hit / total number of lookups) up quite a bit.
(The cache would contain files known not to exist.)
My impression is that BSD does not do this.
Does anyone here know why?