[comp.unix.questions] unix question: files per directory

dxxb@beta.lanl.gov (David W. Barts) (04/11/89)

How many files can there be in a single UNIX directory
(I realize this may depend on the variety of UNIX; I expect
the Berkeley fast file system would allow more)?  I need
a better answer than "a lot" or "at least 2000", if possible.

(This concerns an application program we are currently running
on an Apollo under Aegis; it depends on a LOT of files being
in a single directory and Aegis's limit of 1500 or so can
be a pain.)

I realize that as directories get bigger, they slow down, but
how much?  Just what IS the maximum directory size?

			Thanks in advance,
David W. Barts  N5JRN, Ph. 509-376-1718 (FTS 444-1718), dxxb@beta.lanl.GOV
BCS Richland Inc.                  |       603 1/2 Guernsey St.
P.O. Box 300, M/S A2-90            |       Prosser, WA  99350
Richland, WA  99352                |       Ph. 509-786-1024

grr@cbmvax.UUCP (George Robbins) (04/11/89)

In article <24110@beta.lanl.gov> dxxb@beta.lanl.gov (David W. Barts) writes:
> 
> How many files can there be in a single UNIX directory
> (I realize this may depend on the variety of UNIX; I expect
> the Berkeley fast file system would allow more)?  I need
> a better answer than "a lot" or "at least 2000", if possible.

At least 33,000  8-) 

I recently played with an archive of comp.sys.amiga from day 1 and
it was on this order.  
 
> I realize that as directories get bigger, they slow down, but
> how much?  Just what IS the maximum directory size?

Yeah, it gets real slow and turns the whole system into a dog when
you are accessing the directories.   Still the time is finite, and
the whole restore took maybe 16 hours (I had other stuff going on).

The tape went from almost continual motion, to twitching a several times
a minute...

I seem to recall that the Mach people at CMU were dabbling with some
kind of hashed directories or auxilliary hashing scheme, this would
make it lots quicker.

I don't know if there is a theoreticl maximum, expept that the
directory must be smaller than the maximum possible filesize,
though I am curious about what constitues an efficient limit so
that if I build a directory tree with n entries at each level,
what is a reasonable tradeoff between tree depth and search time.

This was with Ultrix/BSD, I don't know what limits might pertain to
Sys V and other varients.

-- 
George Robbins - now working for,	uucp: {uunet|pyramid|rutgers}!cbmvax!grr
but no way officially representing	arpa: cbmvax!grr@uunet.uu.net
Commodore, Engineering Department	fone: 215-431-9255 (only by moonlite)

chris@mimsy.UUCP (Chris Torek) (04/11/89)

In article <24110@beta.lanl.gov> dxxb@beta.lanl.gov (David W. Barts) writes:
>How many files can there be in a single UNIX directory ....
>I realize that as directories get bigger, they slow down, but
>how much?  Just what IS the maximum directory size?

The maximum size is the same as for files, namely 2^31 - 1 (2147483647)
bytes.  (This is due to the use of a signed 32 bit integer for off_t.
The limit is larger in some Unixes [Cray], but is usually smaller due
to disk space limits.)  Directory performance falls off somewhat at
single indirect blocks, moreso at double indirects, and still more at
triple indirects.  It takes about 96 kbytes to go to single indirects
in a 4BSD 8K/1K file system.

Each directory entry requires a minimum of 12 bytes (4BSD) or exactly
16 bytes (SysV); 16 is a nice `typical' size, so divide 96*1024 by 16 to
get 6144 entries before indirecting on a BSD 8K/1K file system.

The actual slowdown due to indirect blocks is not clear; you will have
to measure that yourself.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

rikki@macom1.UUCP (R. L. Welsh) (04/11/89)

From article <24110@beta.lanl.gov>, by dxxb@beta.lanl.gov (David W. Barts):
> 
> How many files can there be in a single UNIX directory
...

You will undoubtedly run out of inodes before you reach any theoretical
limit.  Every new file you create will use up one inode.  If you are
seriously contemplating having a huge number of files (be they in one
directory or many), you may have to remake a filesystem to have enough
inodes -- see mkfs(1M), in particular the argument blocks:inodes.  The
optional ":inodes" part is often left off and the defaults taken.  My
manual (old ATT Sys V) says that the maximum number of inodes is
65500.

Also (on Sys V) do "df -t" to check how many inodes your filesystem
currently accomodates.
-- 
	- Rikki	(UUCP: grebyn!macom1!rikki)

dxxb@beta.lanl.gov (David W. Barts) (04/11/89)

Thanks to everyone who responded to my question.  As several 
responses have pointed out, the only limit is imposed by file
size; however, things get painfully slow well before the directory
size reaches the maximum file size.

David W. Barts  N5JRN, Ph. 509-376-1718 (FTS 444-1718), dxxb@beta.lanl.GOV
BCS Richland Inc.                  |       603 1/2 Guernsey St.
P.O. Box 300, M/S A2-90            |       Prosser, WA  99350
Richland, WA  99352                |       Ph. 509-786-1024

lm@snafu.Sun.COM (Larry McVoy) (04/12/89)

>In article <24110@beta.lanl.gov> dxxb@beta.lanl.gov (David W. Barts) writes:
>>How many files can there be in a single UNIX directory ....
>>I realize that as directories get bigger, they slow down, but
>>how much?  Just what IS the maximum directory size?

If you are on a POSIX system, try this

#include <unistd.h>

dirsentries(dirpath)
	char *dirpath;
{
	return pathconf(dirpath, _PC_LINK_MAX);
}

Unfortunately, on systems that allow entries up to the file size, pathconf
will almost certainly return -1 (indicating "infinity").  But machines with
a hard limit should give you that limit.

Larry McVoy, Lachman Associates.			...!sun!lm or lm@sun.com

kremer@cs.odu.edu (Lloyd Kremer) (04/12/89)

In article <24110@beta.lanl.gov> dxxb@beta.lanl.gov (David W. Barts) writes:
>How many files can there be in a single UNIX directory

In article <16839@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>The maximum size is the same as for files, namely 2^31 - 1 (2147483647)
>bytes.  (This is due to the use of a signed 32 bit integer for off_t.


Point of curiosity:

Why was it decided that off_t should be signed?  Why should it not be
unsigned long where unsigned longs are supported, or unsigned int where int
is a 32 bit quantity?  It seems that signed long imposes an unnecessary
2GB limit on file size.

There are many devices having a capacity greater than 4 or 5 GB.  It seems
reasonable that one might want a file greater than 2GB on such a device,
such as the product of something akin to 'tar -cf' of a whole filesystem.

And it doesn't make sense to have a negative offset into a file.  The
only exception that comes to mind is that of returning an error code from
a function like lseek(), and this special case could be macro'd like

	#define SEEK_ERR ((off_t)(-1))

in <sys/types.h> or <sys/stat.h>.

					Just curious,

					Lloyd Kremer
					Brooks Financial Systems
					{uunet,sun,...}!xanth!brooks!lloyd

andrew@alice.UUCP (Andrew Hume) (04/13/89)

in the fifth edition, directories that could no longer fit in the directly
mapped blocks caused unix to crash.

nowadays, the only reason not have huge directories is that they
make a lot of programs REAL slow; it takes time to scan all those dirents.

rec@dg.dg.com (Robert Cousins) (04/14/89)

In article <9195@alice.UUCP> andrew@alice.UUCP (Andrew Hume) writes:
>
>
>in the fifth edition, directories that could no longer fit in the directly
>mapped blocks caused unix to crash.
>
>nowadays, the only reason not have huge directories is that they
>make a lot of programs REAL slow; it takes time to scan all those dirents.

There is a more real limit to directory sizes in the System V file system:
There can only be 64K inodes per file system.  As I recall (and it has
been a while since I actually looked at it), the directory entry was
something like this:

	struct dirent {
		unsigned short inode; /* or some special 16 bit type */
		char filename[14];
	}

which yielded a 16 byte entry.  Since there is a maximum number of links
to a file (2^10 or 1024?), then the absolute maximum would be:

	64K * 1024 * 16 = 2 ^ 16 * 2 ^ 10 * 2 ^ 4 = 2 ^ 22 = 4 megabytes

This brings up one of the major physical limiations of the System V 
file system:  if you can have 2 ^ 24 blocks, and only 2 ^ 16 discrete
files, then to harness the entire file system space, each file will
(on average) have to be 2 ^ 8 blocks long or 128 K.  Since we know that
about 85% of all files on most unix systems are less than 8K and about 
half are under 1K, I personnally feel that the 16 bit inode number is
a severe handicap.

Comments?

Robert Cousins

Speaking for myself alone.

andrew@frip.wv.tek.com (Andrew Klossner) (04/15/89)

Larry McVoy writes:

>> How many files can there be in a single UNIX directory ....

> If you are on a POSIX system, try this
> #include <unistd.h>
> dirsentries(dirpath)
> 	char *dirpath;
> {
> 	return pathconf(dirpath, _PC_LINK_MAX);
> }

This will tell you how many directories a directory can contain, not
how many files.  Adding a file to a directory does not increment its
link count.

  -=- Andrew Klossner   (uunet!tektronix!orca!frip!andrew)      [UUCP]
                        (andrew%frip.wv.tek.com@relay.cs.net)   [ARPA]

bph@buengc.BU.EDU (Blair P. Houghton) (04/15/89)

In article <127@dg.dg.com> rec@dg.UUCP (Robert Cousins) writes:
>
>This brings up one of the major physical limiations of the System V 
>file system:  if you can have 2 ^ 24 blocks, and only 2 ^ 16 discrete
>files, then to harness the entire file system space, each file will
>(on average) have to be 2 ^ 8 blocks long or 128 K.  Since we know that
>about 85% of all files on most unix systems are less than 8K and about 
>half are under 1K, I personnally feel that the 16 bit inode number is
>a severe handicap.
>
>Robert Cousins
>
>Speaking for myself alone.

I'll stand behind you, big guy.  I just hacked up a program to check out my
filesizes, and I'll be damned if I didn't think my thing was real big...

On the system I checked (the only one where I'm remotely "typical" :),
I have 854 files, probably two dozen of them zero-length (the result of
some automated VLSI-data-file processing).  The mean is 10.2k, stdev is 60k
(warped by a few megabyte-monsters), and the median is 992 bytes (do
you also guess peoples' weight? :)  Of these 854 files of mine, 84% are
under 8000 bytes, and a paltry eight exceed the 128k "manufacturer's
suggested inode load" you compute above.

For another machine:
1740 files
Median 1304 bytes
Mean   7752
StDev 28857
77% < 8kB
And only 4 (that's FOUR) over the 128k optimal mean.

Hrmph.  And I thought I was more malevolent than that.  At least the
sysadmins can't accuse me of being a rogue drain on the resources...

Consider that "block" can be 1,2,4kB or more, and you're talking some
BIIIG files we have to generate to be efficient with those blocknumbers.

				--Blair
				  "...gon' go lick my wounded ego...
				   and ponder ways to make file
				   systems more efficient, or at least
				   more crowded.  ;-)"

allbery@ncoast.ORG (Brandon S. Allbery) (04/18/89)

As quoted from <6576@cbmvax.UUCP> by grr@cbmvax.UUCP (George Robbins):
+---------------
| In article <24110@beta.lanl.gov> dxxb@beta.lanl.gov (David W. Barts) writes:
| > How many files can there be in a single UNIX directory
| > (I realize this may depend on the variety of UNIX; I expect
| > the Berkeley fast file system would allow more)?  I need
| > a better answer than "a lot" or "at least 2000", if possible.
| 
| At least 33,000  8-) 
| 
| I recently played with an archive of comp.sys.amiga from day 1 and
| it was on this order.  
+---------------

System V has no limit, aside from maximum file size (as modified by ulimit,
presumably).  As a PRACTICAL limit, when your directory goes triple-indirect,
it is too slow to search in a reasonable amount of time.  Assuming the
standard 2K block size of SVR3, this is (uhh, let's see... 2048 bytes/block
/ 16 bytes/dirent = 128 dirent/block; times 10 is 1280 dirent direct, add
single-indirect = 128 * 512 pointers/block [2048 / 4 bytes/pointer] = 65,536
entries single-direct; multiply that by 512 to get double-indirect)
33,621,248 directory entries before you go triple-indirect.  (I personally
think that even going single-indirect gets too slow; 1280 directory entries
is more than I ever wish to see in a single directory!  But even limiting to
single-indirect blocks, you get 66,816 directory entries.)  (I included the
math deliberately; that number looks way too large to me, even though I
worked the math twice.  Maybe someone else in this newsgroup can
double-check.  Of course, I'm no Obnoxious Math Grad Student ;-)

The Berkeley FFS is still based on direct and indirect blocks (it's how
they're arranged on the disk that speeds things up); however, directory
entries are not fixed in size in the standard FFS.  (I have seen FFS with
System V directory entries; the two aren't necessarily linked.  But they
usually are, as flexnames are nicer than a 14-character maximum.)  You can't
simply calculate a number; you must figure the lengths of filenames -- and
the order of deletions and additions combined with file name lengths can
throw in jokers, at least on systems without directory compaction.

I have no doubt that if I screwed up somewhere, we'll both hear about it. ;-)

++Brandon
-- 
Brandon S. Allbery, moderator of comp.sources.misc	     allbery@ncoast.org
uunet!hal.cwru.edu!ncoast!allbery		    ncoast!allbery@hal.cwru.edu
      Send comp.sources.misc submissions to comp-sources-misc@<backbone>
NCoast Public Access UN*X - (216) 781-6201, 300/1200/2400 baud, login: makeuser

root@helios.toronto.edu (Operator) (04/19/89)

In article <4822@macom1.UUCP> rikki@macom1.UUCP (R. L. Welsh) writes:
>From article <24110@beta.lanl.gov>, by dxxb@beta.lanl.gov (David W. Barts):
>> 
>> How many files can there be in a single UNIX directory
>
>You will undoubtedly run out of inodes before you reach any theoretical
>limit.  

Another thing you may run into is that some UNIX utilities seem to store
the names of all of the files somewhere before they do anything with them,
and if there are a lot of files in the directory, you won't be able to
run the utility on all of them at once. (This won't prevent you from creating
them, though). In particular I am thinking of "rm". When cleaning up after
installing the NAG library, I tried to "rm *" in the source code directory.
It refused (I think the error was "too many files"). I had to go through and 
"rm a*", "rm b*" etc. until it was down to a level that rm would accept. I 
found this surprising. In at least the case of wildcard matching, why wouldn't 
it just read each name from the directory file in sequence, comparing each for 
a match, and deleting it if it was? Having to buffer *all* the names builds in 
an inherent limit such as the one I ran into, unless one uses a linked list 
or some such.

Does anyone know:
     1. why "rm" does it this way, and
     2. are there other utilities similarly affected?

I don't know exactly how many files were in the directory, but it was many
hundreds.
-- 
 Ruth Milner          UUCP - {uunet,pyramid}!utai!helios.physics!sysruth
 Systems Manager      BITNET - sysruth@utorphys
 U. of Toronto        INTERNET - sysruth@helios.physics.utoronto.ca
  Physics/Astronomy/CITA Computing Consortium

les@chinet.chi.il.us (Leslie Mikesell) (04/20/89)

In article <776@helios.toronto.edu> sysruth@helios.physics.utoronto.ca (Ruth Milner) writes:

[ rm * fails with large number of files..]

>Does anyone know:
>     1. why "rm" does it this way, and
>     2. are there other utilities similarly affected?

Actually the shell expands the * and can't pass the resulting list to
rm because there is a fixed limit to command line arguments.  All
programs would be affected in the same way, except those where you
quote the wildcard to prevent shell expansion (find -name '*' would
be the common case, and the -exec operator can be used to operate
on each file, or if you have xargs you can -print |xargs command).

However, if your version of unix doesn't automatically compress
directories (SysV doesn't) you should rm -r the whole directory
or the empty entries will continue to waste space.

Les Mikesell

rwhite@nusdhub.UUCP (Robert C. White Jr.) (04/21/89)

> In article <776@helios.toronto.edu> sysruth@helios.physics.utoronto.ca (Ruth Milner) writes:
>>When cleaning up after
>>installing the NAG library, I tried to "rm *" in the source code directory.
>>It refused (I think the error was "too many files").

The shell cant make an argumetn list that long... do the following:

ls | xargs rm

The ls will produce a list of files to standard output and xargs will
repeatedly call it's arguments as a command with as many additional
arguments as it can, taking these additional arguments from it's
standard input...

WALLHAH! rm of a long directory.

weaver@prls.UUCP (Michael Weaver) (04/22/89)

Note that although Aegis 9 and below had strict limits on the number 
of directory entries, Aegis 10, the latest version, is supposed to 
allow any number of files, as long as you've got the disk space.
'Just like real Unix' (almost, no inodes).


-- 
Michael Gordon Weaver                     Phone: (408) 991-3450
Signetics/Philips Components              Usenet: ...!mips!prls!weaver
811 East Arques Avenue, Bin 75
Sunnyvale CA 94086 USA

news@brian386.UUCP (Wm. Brian McCane) (04/27/89)

In article <776@helios.toronto.edu> sysruth@helios.physics.utoronto.ca (Ruth Milner) writes:
>In article <4822@macom1.UUCP> rikki@macom1.UUCP (R. L. Welsh) writes:
=>From article <24110@beta.lanl.gov>, by dxxb@beta.lanl.gov (David W. Barts):
==> 
==> How many files can there be in a single UNIX directory
=>
=>You will undoubtedly run out of inodes before you reach any theoretical
=>limit.  
>
>Another thing you may run into is that some UNIX utilities seem to store
>the names of all of the files somewhere before they do anything with them,
>and if there are a lot of files in the directory, you won't be able to
>run the utility on all of them at once. (This won't prevent you from creating
>them, though). In particular I am thinking of "rm". When cleaning up after
>installing the NAG library, I tried to "rm *" in the source code directory.
>It refused (I think the error was "too many files"). I had to go through and 
>"rm a*", "rm b*" etc. until it was down to a level that rm would accept. I 
>
>Does anyone know:
>     1. why "rm" does it this way, and
>     2. are there other utilities similarly affected?
>
> Ruth Milner          UUCP - {uunet,pyramid}!utai!helios.physics!sysruth


You didn't actually run into a "rm" bug/feature, you hit a shell
FEECHER.  The shell expands for the regexp, and then passes the
generated list to the exec'd command as the arguments.  "rm" can only
handle a limited number of files, (or it may be the shell will only pass
a limited number, who knows, its a FEECHER after all ;-), so rm then
gave the error message of too many filenames.  I would like it if "rm"
were similar to most other commands, ie. you could rm "*", preventing
the expansion of the * to all file names until "rm" got it, but it
returns the message "rm: * non-existent" on my machine, Sys5r3.0.

	brian

(HMmmm.  That new version of "rm" I mentioned sounded kinda useful, I
wonder if anyone out there has 1 already?? HINT ;-)


-- 
Wm. Brian McCane                    | Life is full of doors that won't open
                                    | when you knock, equally spaced amid
Disclaimer: I don't think they even | those that open when you don't want
            admit I work here.      | them to. - Roger Zelazny "Blood of Amber"

guy@auspex.auspex.com (Guy Harris) (05/02/89)

>I would like it if "rm" were similar to most other commands, ie. you
>could rm "*", preventing the expansion of the * to all file names
>until "rm" got it,

Uhh, to what other commands are you referring?  Most UNIX commands don't
know squat about expanding "*"; they rely on the shell to do so, and
merely know about taking lists of file names as arguments.  Other OSes
do things differently; perhaps that's what you're thinking of?

allbery@ncoast.ORG (Brandon S. Allbery) (05/05/89)

As quoted from <432@brian386.UUCP> by news@brian386.UUCP (Wm. Brian McCane):
+---------------
| >Another thing you may run into is that some UNIX utilities seem to store
| >the names of all of the files somewhere before they do anything with them,
| 
| You didn't actually run into a "rm" bug/feature, you hit a shell
| FEECHER.  The shell expands for the regexp, and then passes the
| generated list to the exec'd command as the arguments.  "rm" can only
| handle a limited number of files, (or it may be the shell will only pass
| a limited number, who knows, its a FEECHER after all ;-), so rm then
+---------------

Sorry, it's a kernel limitation.

The combined size of all elements of argv[] must be less than some size
(I have seen 1024, 5120, and 10240 bytes on various systems).  This limit is
enforced by the execve() system call (from which all the other exec*() calls
are derived).  If the argument list is longer than this limit, exec()
returns an error which the shell (NOT rm) reports back to the user.

+---------------
| gave the error message of too many filenames.  I would like it if "rm"
| were similar to most other commands, ie. you could rm "*", preventing
| the expansion of the * to all file names until "rm" got it, but it
| returns the message "rm: * non-existent" on my machine, Sys5r3.0.
+---------------

Most other WHAT commands?  MS-DOS?  VMS?  *Certainly* not Unix commands.

The advantage of making the shell expand wildcards like * is that the code
need only be in the shell, and not enlarging the size of every utility which
might have to parse filenames.  In these days of shared libraries, that may
not be as necessary as it used to be; however, having it in one place does
insure that all utilities expand filenames in the same consistent way
without any extra work on the part of the programmer.

++Brandon
-- 
Brandon S. Allbery, moderator of comp.sources.misc	     allbery@ncoast.org
uunet!hal.cwru.edu!ncoast!allbery		    ncoast!allbery@hal.cwru.edu
      Send comp.sources.misc submissions to comp-sources-misc@<backbone>
NCoast Public Access UN*X - (216) 781-6201, 300/1200/2400 baud, login: makeuser