[comp.unix.wizards] Filename length statistics

bet@orion.mc.duke.edu (Bennett Todd) (06/15/89)

In article <4530@ficc.uu.net>, peter@ficc (Peter da Silva) writes:
>I've added a cumulative total
>
> 392  1   392   0.57%
>   ...
>1533 14 65848  94.96%   <-- Covers most bases.
>   ...
>  23 30 69284  99.92%   <-- Covers virtually all bases.
>   ...
>   1 51 69340 100.00%
>
>14 corresponds to SysV. 30 corresponds to SysV with DIRSIZ doubled. There were
>56 files, or 0.08%, that were longer than this.

Out of curiosity I ran this over the 2187 files under my home directory;
some of the statistics came out a little differently. Specifically, the
ones shown above come out like so for me:

 1   237 10.84%   237  10.84%
  ...
14    55  2.51%  1805  82.53%
  ...
30     4  0.18%  2166  99.04%
  ...
53     1  0.05%  2187 100.00%

(I just noticed my column ordering is different; I used the awk program
someone posted, which I append at the end).

The 14-character long names only handle ~83% of my filenames (this
includes directory names, and in particular includes "." and ".." for
every directory, so there is some structural weighting acting against my
statistics here).  Further, the 30 character names still left nearly 1%
of my choices, 21 out of 2187, chopped. Some of our users would show
much higher filename length distributions, others lower. Having a shell
with filename completion certainly removes much of the incentive for
short, cryptic filenames.

Also, I personally think that collecting statistics like this should be
done over home directories, not over everything below root,  since many
of the filenames in the root and /usr filesystems are inherited from
the original UNIX system, rather than chosen since. Further, the most
useful place for really large filenames I've seen is in organizing
personal archives, where you can make the name sufficiently descriptive
to make it easier to find later.

For completeness, here's the program I used (a shell script I wrapped
around an awk program someone else posted):

#!/bin/sh

progname=`basename $0`
awkprg=/tmp/$progname$$

trap "rm -f $awkprg;exit 1" 0 1 2 3

cat >$awkprg <<'EOF'
BEGIN {FS = "/"}
{
	l = length($NF)
	c[l]++
	if(l>max) max=l
}
END {
	for(i=1; i<=max; i++) {
		s += c[i]
		printf("%2d %5d %5.2f%% %5d %6.2f%%\n", i, c[i], c[i]/NR*100, s, s/NR*100)
	}
}
EOF

if test $# -eq 0
then
	set '.'
fi

find "$@" -print | awk -f $awkprg

rm -f $awkprg
trap "" 0 1 2 3
exit 0

-Bennett
bet@orion.mc.duke.edu
P.S. Tonight I'm going to run the same thing over everyone's home
directories on our system, as well as over everything from the root
down; I'll post the results tomorrow if all goes well.

bet@orion.mc.duke.edu (Bennett Todd) (06/15/89)

In article <14749@duke.cs.duke.edu>, I wrote:
>[...]
>The 14-character long names only handle ~83% of my filenames (this
>includes directory names, and in particular includes "." and ".." for
>every directory, so there is some structural weighting acting against my
>statistics here).

...which is of course completely wrongo. Thanks to Matt Crawford for
pointing this out to me in a very polite letter. Sorry about this
misinformation. Find(1) is of course smart enough to refrain from
reporting "." and ".."; indeed, I shouldn't have even had to check to
see what its behavior is. Upon thinking about it even briefly, it
becomes obvious that many, even most of the uses to which find(1) is put
would be broken if it didn't omit "." and ".." (and emitted them :-).

-Bennett
bet@orion.mc.duke.edu

	(1) Start brain
	(2) Engage mouth
	Do not perform in reverse order.

bet@orion.mc.duke.edu (Bennett Todd) (06/15/89)

I said I would post statistics over all our home directories, and over
the whole system, when the runs finished. Well, they finished much
quicker than I had feared, and I looked them over. Basically, they
agreed much more closely with Peter da Silva's figures than with those
over my home directory. I guess I'm an anomaly:-). I still think
fixed-length filenames belong in the same category as fixed-length line
buffers in editors and suchlike; a reasonable design compromise for a
first revision, to prove the concept and keep the prototype
implementation simple, but not a desirable limitation in a final
production version. And, like with line length limits in editors, I
believe that a GOOD implementation won't inflict either undue code bulk
or undue speed degradation, as the price of going to a dynamically
allocated varying length implementation.

-Bennett
bet@orion.mc.duke.edu

dik@cwi.nl (Dik T. Winter) (06/15/89)

In article <14752@duke.cs.duke.edu> bet@orion.mc.duke.edu (Bennett Todd) writes:
 > ...which is of course completely wrongo. Thanks to Matt Crawford for
 > pointing this out to me in a very polite letter. Sorry about this
 > misinformation. Find(1) is of course smart enough to refrain from
 > reporting "." and ".."; indeed,
...
Wrongo again.  Of course find is smart enough to include ".".
-- 
dik t. winter, cwi, amsterdam, nederland
INTERNET   : dik@cwi.nl
BITNET/EARN: dik@mcvax

dik@cwi.nl (Dik T. Winter) (06/15/89)

In article <8192@boring.cwi.nl> I write:
 > Wrongo again.  Of course find is smart enough to include ".".

Wrong of course.  I should never have written this.  If I could cancel,
I would.
-- 
dik t. winter, cwi, amsterdam, nederland
INTERNET   : dik@cwi.nl
BITNET/EARN: dik@mcvax

allbery@ncoast.ORG (Brandon S. Allbery) (06/20/89)

As quoted from <8192@boring.cwi.nl> by dik@cwi.nl (Dik T. Winter):
+---------------
| In article <14752@duke.cs.duke.edu> bet@orion.mc.duke.edu (Bennett Todd) writes:
|  > misinformation. Find(1) is of course smart enough to refrain from
|  > reporting "." and ".."; indeed,
| ...
| Wrongo again.  Of course find is smart enough to include ".".
+---------------

Sigh.  Find includes "." ONLY if you say "find . (...)".  ONE instance,
maximum.  (If you were correct then find would have output like:

		/foo/bar
		/foo/bar/.
		/foo/bar/baz
		...

and watch everything that uses find break!)

++Brandon
-- 
Brandon S. Allbery, moderator of comp.sources.misc	     allbery@ncoast.org
uunet!hal.cwru.edu!ncoast!allbery		    ncoast!allbery@hal.cwru.edu
      Send comp.sources.misc submissions to comp-sources-misc@<backbone>
NCoast Public Access UN*X - (216) 781-6201, 300/1200/2400 baud, login: makeuser