[comp.unix.wizards] small files and big directories

dricejb@drilex.UUCP (Craig Jackson drilex1) (05/26/90)

This discussion has been about why Unix tends to have small directories,
and relatively deep trees, vs other systems like VMS which will have
large directories and shallow trees.

Chris Torek pointed out that his home directory was quite small. This is
really to be expected--humans have trouble with large directories.

He also said that his applications directories also tend to be small,
but the inefficiency of large directories on Unix was one of his reasons.
I would assert that the inefficiency of large directories on Unix is
equally as limiting as the awkwardness of deep directory trees in VMS is.

My best anecdote on this comes from BIX.  BIX is implemented using
the CoSy software from the University of Guelph.  CoSy was first
implemented in the early '80s, under version 7.  It had a wide
user community at Guelph, where it ran on Amdahl's UTS, which was then
a Version 7 implementation.  The editors at Byte liked it, so they
bought an Arete box (nice character I/O performance) to run it.
This ran SVR2 (no hissing, please).  CoSy was installed, and they
began testing BIX internally.  Everything basically worked fine.

Then they opened the system up to the public for beta-test.  Lots of
people signed up.  Soon, they found their first scale-up problem:
CoSy kept its per-user information as one directory per user.  Each
of these directories had the name of the user's login name, and lived
in the users/ directory.  Well, System V at the time didn't allow more
than 1000 links to an inode.  BIX quickly went over 1000 users, and
all of those '..' links killed it.  So, some midnight (literally)
programming was done, and the next day joeuser's per-user files
were in users/j/joeuser, rather than users/joeuser.  That got them
going again, but they still had the wall at 26,000 users to worry
about.

I lost touch with the details after this, but I'm pretty sure that they've
gone over the 26,000 user limit since.  I think today, they've bagged using
the Unix file system completely--the per-user data is now in some sort
of database.

This effect also shows up in things like SysV's terminfo database,
where you also get somedirectory/a/adm-3 kinds of things.


My point on this is that some implementations probably should do something
about the large-directory problem.  If your application works most
naturally with 2000 subdirectories, or 20,000 subdirectories, in a single
directory, you shouldn't have to recode to get around system inefficiencies.
Now, maybe the random workstation doesn't need this capability.  But for
the future, at least some implementation of Unix will need to do large
directories well.
-- 
Craig Jackson
dricejb@drilex.dri.mgh.com
{bbn,axiom,redsox,atexnet,ka3ovk}!drilex!{dricej,dricejb}

emv@math.lsa.umich.edu (Edward Vielmetti) (05/28/90)

(bix has plenty of users)

> all of those '..' links killed it.  So, some midnight (literally)
> programming was done, and the next day joeuser's per-user files
> were in users/j/joeuser, rather than users/joeuser.  That got them
> going again, but they still had the wall at 26,000 users to worry
> about.

the umich.edu afs cell has taken this one step further -- my account
there is users/e/m/emv.  unfortunately the top /afs level is totally
flat, I expect that'll change as more and more cells come on line
on some horrible and painful re-naming day.

one world, one filesystem -- did I hear "one vendor" too?  

--Ed

Edward Vielmetti, U of Michigan math dept <emv@math.lsa.umich.edu>
"security through diversity"

tp@decwrl.dec.com (t patterson) (05/29/90)

In article <24523@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes:
>>In article <24483@mimsy.umd.edu> I wrote:
>>>... The ratio of small to big should suggest whether special handling
>>>for small, big, or `very big' files and/or directories would be useful.
>
>In article <1709@cirrusl.UUCP> dhesi%cirrusl@oliveb.ATC.olivetti.com
>(Rahul Dhesi) writes:
>>What you will find will probably reflect the fact that big directories on
>>UNIX machines are already known to be undesirable ... a conclusion like
>>"big directories seldom exist, so we need not worry too much about them" is
>>natural, but probably only because it is a self-fulfilling prophecy.

this may depend on what you consider to be "big", but I can think of 2 
applications where "big" directories are produced:

1. news:
    for example, an ls -ld on /usr/spool/news/comp/unix (minus the files 
    in comp.unix) on my news server:

	   drwxrwxr-x  2 news         2560 May 28 08:08 aix
	   drwxrwxr-x  2 news         2560 May 28 00:33 aux
	   drwxrwxr-x  2 news          512 May 24 00:31 cray
	   drwxrwxr-x  2 news        16384 May 28 16:16 i386
	   drwxrwxr-x  2 news         1024 May 27 23:33 microport
	   drwxrwxr-x  2 news        18432 May 28 17:20 questions
	   drwxrwxr-x  2 news         6144 May 28 10:07 ultrix
	   drwxrwxr-x  2 news        11776 May 28 17:20 wizards
	   drwxrwxr-x  2 news         9728 May 28 14:40 xenix

    comp.unix.{i386,questions,wizards,xenix}, for example, are getting big
    to the point of unwieldy. (yeah, we could shrink them by rebuilding
    the directories, but that's pretty awkward.) the others aren't 
    exactly scrawny either.

2. mh-based mail
     we have more than a few users who have thousands of files (messages)
     per folder; this can be really slow. moreover, when you have been
     accumulating mail for a decade, it is _easy_ for some people to get
     themselves in this kind of bind.

In article <24523@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes:
>Perhaps---but I would note the following:
>
> a. Personally, I prefer to keep my own directories relatively small.
...
> b. As a programmer, I prefer to keep my programs' directories relatively
>    small, ...
...
>All in all, though, I am still unconvinced that adding a special case
>for big directories would help overall.

   well, I would agree that

      a. the problem needs more study -- how can we quantify how much
	 impact the current directory setup has on overall usage?

      b. most naive people learn about directories eventually; in general,
	 people find it unwieldy to try and manage directories with 
	 thousands of entries. more sophisticated people also seem to
	 avoid this problem.

   but

      when the application (news, mh) hides some of that complexity from
      you, it is easy for things to get out of hand.

  maybe the application needs fixing, maybe it's the filesystem.

  (this thread is kinda curious to me, because it stirs memories of an
old Chris Torek posting which I think stated "the filesystem IS the
database" and discussed the advantages of using hashing to build directory
entries.)
--
t. patterson		domain: tp@wsl.dec.com    path: decwrl!tp
            		enet:	wsl::tp
     			icbm:	122 9 41 W / 37 26 35 N