[comp.sys.amiga] Large directory or many small directories?

rokicki@rocky.STANFORD.EDU (Tomas Rokicki) (09/03/87)

Let's first calculate how many blocks must be read to locate an
executable in a particular directory.  On the Amiga, directories
are organized as bucket hash tables of size, say, N.  (I think
N is something like 74 or 90 or so, but I don't have my manuals
handy.)  Under such a scheme, you have to read the directory, and
you have to read M additional header blocks, where the file you
want is Mth in that particular bucket.  If you have F files in the
directory randomly distributed,

		M_mean = a * (F/2N + 1/2)

where a is some number greater than but close to 1.  (This number
takes into account the fact that the buckets do not fill exactly
evenly.)  If the file isn't found, you need to probe on the average

		M_mean = a * F/N

before you realize that.  So let us say you have f total files,
distributed over d directories.  Each directory has f/d files.
On the average, you will search d/2-1/2 directories before
searching the correct one.  Thus, your total probes are:

	P = (d/2-1/2)(1 + af/dN) + 1 + a(f/2dN + 1/2)
	  = a/2 + 1/2 + d/2 + af/2N

Since f, a, and N are positive constants, you need to reduce d.
Thus, you want to have fewer directories.  Let's put some typical
numbers in there.  If a = 1.1, f = 150, and N=100, then the above
reduces to 1.875 + d/2.

Now we start adding some real world considerations.  First of all,
searching a directory such as ram: or vd0: is very nearly free, so
put any files there that you can.  Secondly, directories might be
on different disks or different partitions of the same disk, so
there is some time involved in going from one directory to the
next.  Thirdly, you want to put the most likely directory first.
Thus, if we can assume that you will only need to search d/5-1/5
directories on average before finding the correct one (which is
much more likely), then the equation becomes

	P = (d/5-1/5)(1 + af/dN) + 1 + a(f/2dN + 1/2)
	  = a/2 + 4/5 + d/5 + af/5N + 3af/10dN

which is slighly more complex.  We can minimize this function
with a little bit of calculus, to show that P is minimized when
it is equal to sqrt(1.5aF/N), which, for our example numbers,
is about 1.58.  So again, few large directories.

There may be errors in this analysis, as I did it extemporaneously
(and all of the math in my head) but I believe it's fairly
accurate.

-tom

ewhac@well.UUCP (Leo 'Bols Ewhac' Schwab) (09/03/87)

In article <550@rocky.STANFORD.EDU> rokicki@rocky.STANFORD.EDU (Tomas
Rokicki) writes an industrial-strength article about relative efficiency in
scanning many directories versus fewer directories.

	This should not be taken to mean that you should not use directories
at all.  Witness Activision's "Portal".  Do a 'dir' on it some time.  You
should get a file list in about three minutes.

	There are over 300 files in that root directory.  When the program
tried to open one, the DOS takes about five hashing collisions before it
finds it.  Evidently, the programmer never heard of a directory before.

	So use the bloody things, and cut down on hashing collisions.

[ Note:  The "Portal" disk is an extreme case, but still a real-world one. ]

_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_
Leo L. Schwab -- The Guy in The Cape	ihnp4!ptsfa -\
 \_ -_	 Bike shrunk by popular demand,	      dual ---> !{well,unicom}!ewhac
O----^o	 But it's still the only way to fly.  hplabs / (pronounced "AE-wack")
"Work FOR?  I don't work FOR anybody!  I'm just having fun."  -- The Doctor

jdm@pnet02.CTS.COM (John Mesiavech) (09/04/87)

This is just mentioned as an aside to Tomas Rokicki's excellent breakdown of
which to use: Large directories or small...BUT there is a kicker in the hole,
and that is the special case of an Amiga user with a hard disk.  When you get
 => 100 files per directory, things become slow, simply because of the large
amount of files on the disk.  So, on a hard disk, having directories of < 100
files would be best.  'Sides, with truly large directories (I have a directory
with 212 files in it), there's not a DirUtil in the known universe that will
handle it! 8^>
 
John


UUCP: {ihnp4!crash, hplabs!hp-sdd!crash}!gryphon!pnet02!jdm
INET: jdm@pnet02.CTS.COM

hcm@hpclla.HP.COM (Harry Muttart) (09/05/87)

There actually is a (generic) "Dirutil" that will handle large directoies...
The one written in Modula-2!  If you happen to have a TDI compiler, you can
bump it up from the default 300-files-per-directory limit, trivially.  

Incidentally, the author (sorry, can't remember name), indicated that he 
wrote this tool, in part, to be able to list the M2: (module library) directory
for TDI, which contains over 200 files.  Other dirutils seem to quietly stop
reading entries!

Oh yeah...the program with source can be found on one the Fish disks in the
50's or 60's.  The name is DuM2, which is probably why few have heard of it 
(who would bother browsing a directory whose name suggests "dumb, too").

Harry Muttart         hpda!hcmutt      

(Just my opinions above, not my employer's.)