[comp.unix.wizards] Long filenames

barnett@crdgw1.crd.ge.com (Bruce G. Barnett) (06/06/89)

In article <9422@alice.UUCP>, andrew@alice (Andrew Hume) writes:

>	my point is that if you have structured names like
><machine><report>-<option>.<time_period> (barnett's example),
>the "lazy part" is putting that in the filename and using the
>shell's pattern matching to select files. The alternative
>(there are obviously bunches) is to put this database in a file
>that looks like
><datafile>\t<machine>\t<report>\t<option>\t<time_period>
>and use awk (or cut) to select on arbitrary fields. e.g.
>	more `awk '$2=="mymachine"{print $1}'`
>
>this is only slighlty more work, much more flexible AND
>doesn't require the kernel to support gargantuan filenames.

Wrong. It would require much more work and be much less flexible.

Example:
	If I wanted to print out all weekly sa -in and sa -im reports
for machines vaxA and SunB that ocurred in January, I could type:
My Method:
	print {vaxA,sunB}*sa-i[nm]*Jan*WEEK
Your method:
	print `awk '$2 ~ /vaxA|sunB/ && $3 == "sa" && $4 ~/i[nm]/ && \
		$5 ~ /Jan.*WEEK/  {print $1}' data `

Disadvantages with your method:
	1. Simple queries now require either an AWK programmer or a
	   a sophisticated script.
	2. The file "data" must be keep up to date. If 50 files were created
	   a day, and files could be deleted whenever the disk filled up,
	   keeping this file up to date requires an extra step.
	3. The biggest disadvantage is that if dozens of scripts were written,
	   and it became necessary to change the database, all of the scripts
	   would have to be re-written.

If I had to re-implement my report scheme on a system with filenames
less than 14 characters, it would have taken me twice as long to do it.

There are so many advantages to long filenames:

I have never had a problem with a shell script that did
	mv $1 $1.orig

I have enabled GNU emacs's numbered backups mechanism, so that
old files are renamed file.~1~ file.~2~ ... file.~12~ etc.
(By default GNUemacs keeps the two oldest and two newest versions).
If I used this scheme, then all of my non-SCCS awk scripts would be limited to
five characters: (e.g. abcde.awk.~12~)

I can also use filenames to indicate the function of the script.
(e.g. "archive-to-tape-old-newsgroups" vs. "ar2tape-oldng")

But the biggest win is the ability to use the filename for the data.

Another example is the large USENET archive I keep.
First of all, I store old articles using the format
	./news.group/yy-mm/article-id
(The top directory is the newsgroup. The next directory tells me the year
and month of the posting. The filename is the article-ID of the article).

There is a one-line summary of the subject line in the file
	./LOGS/news.group
which contains the filename and subject line. When I archive the big-old
newsgroups to tape, the log file is renamed (appending the current date
to the filename).

There are so many advantages to this scheme. Articles are always a known
depth from the top (comp.binaries.ibm.pc.d vs. comp/binaries/ibm/pc/d).
There is a simple way to determine if an article is archived twice.
The filename contains all of the information needed. I don't have to
search another file to determine the name for the archive.

I now have the following pieces of data available:
	The newsgroup
	The year and month of the posting
	The article-ID
	The machine the article was posted from
	The filename of the article on the disk or tape
If the article has been archived to disk, I also have:
	The name of the tape the archive is on
	When I created the above tape

Now ALL of the above information is stored in filenames. The log file
is the filename and the subject line. I don't need files to keep information
about files, especially when I have to keep track of 400,000 files.

My database queries are done with grep, cat and find. Once I find the
file I am interested in, I use a very simple awk command to do something
with the files.

In short, the fact that I have hundreds of thousands of files with the
length of 30 characters (which is NOT gargantuan in my mind) allows me
a simple, elegant method of organizing data.

The same task on a machine with the archaic limit of 14 characters
would have make the task more difficult, more complex, more inflexible
and more inefficient.

--
Bruce G. Barnett	<barnett@crdgw1.ge.com>  a.k.a. <barnett@[192.35.44.4]>
			uunet!crdgw1.ge.com!barnett barnett@crdgw1.UUCP

peter@ficc.uu.net (Peter da Silva) (06/07/89)

In article <629@crdgw1.crd.ge.com>, barnett@crdgw1.crd.ge.com (Bruce G. Barnett) writes:
> In article <9422@alice.UUCP>, andrew@alice (Andrew Hume) writes:
> Example:
> 	If I wanted to print out all weekly sa -in and sa -im reports
> for machines vaxA and SunB that ocurred in January, I could type:

> My Method:
> 	print {vaxA,sunB}*sa-i[nm]*Jan*WEEK

> Your method:
> 	print `awk '$2 ~ /vaxA|sunB/ && $3 == "sa" && $4 ~/i[nm]/ && \
> 		$5 ~ /Jan.*WEEK/  {print $1}' data `

My method:
	print {vaxA,sunB}/sa/i[nm]/Jan/*WEEK

Disadvantages with your method:

	You need lots of long file names.

	'ls -C' is useless.

	'ls' takes forever.

> If I had to re-implement my report scheme on a system with filenames
> less than 14 characters, it would have taken me twice as long to do it.

Not at all. It would take you no longer... hierarchical directories are
a wonderful tool.

14 characters is getting a little cramped, but I've never run out of 30.

> Another example is the large USENET archive I keep.
> First of all, I store old articles using the format
> 	./news.group/yy-mm/article-id

Why not /news/group/...?

> There is a one-line summary of the subject line in the file
> 	./LOGS/news.group

./LOGS/news/group...

> There are so many advantages to this scheme. Articles are always a known
> depth from the top (comp.binaries.ibm.pc.d vs. comp/binaries/ibm/pc/d).

The only problem with this is that UNIX wildcards don't support ellipses.
One of the very few VMS features I genuinely miss... and a much more useful
tool than superlongfilenames. But surely you don't find "find" to be THAT
hard to use.

> [14 characters] would have make the task more difficult, more complex, more
> inflexible and more inefficient.

Not at all.
-- 
Peter da Silva, Xenix Support, Ferranti International Controls Corporation.

Business: uunet.uu.net!ficc!peter, peter@ficc.uu.net, +1 713 274 5180.
Personal: ...!texbell!sugar!peter, peter@sugar.hackercorp.com.

flee@shire.cs.psu.edu (Felix Lee) (06/09/89)

In article <4439@ficc.uu.net>,
   peter@ficc.uu.net (Peter da Silva) writes:
>	You need lots of long file names.
>	'ls -C' is useless.
>	'ls' takes forever.

'ls -C' is only useless on pathologically small terminals (such as
24x80 :-).  The columnizer could also be improved to let exceptionally
long names violate column boundaries.

'ls' takes forever only if you use the '-l', '-F', or similar options
that have to stat() each file.  'ls' without those options is
marginally slower than 'sort'.

Curiously, 'csh' filecompletion listing is quite fast, much faster
than 'ls -F'.  This is mostly illusion.  'ls' stats each file as it
reads the directory and then prints the list out.  'csh' reads the
directory and then stats each file as it's printed out.  The effect is
that 'csh' seems faster because of the better average response time.
'ls' could do something similar with specialcase code (albeit the most
common case).  Might be worth it.
--
Felix Lee	flee@shire.cs.psu.edu	*!psuvax1!shire!flee

fuat@cunixc.cc.columbia.edu (Fuat C. Baran) (06/09/89)

In article <FLEE.89Jun8143358@shire.cs.psu.edu> flee@shire.cs.psu.edu (Felix Lee) writes:
>'ls' takes forever only if you use the '-l', '-F', or similar options
>that have to stat() each file.  'ls' without those options is
>marginally slower than 'sort'.

One reason that "ls -l" is slow on some systems is because of the
getpwuid() call to figure out the file owner's name.  This is a
problem on systems were these lookups don't go through dbm but by
opening /etc/passwd and reading entries.  To see how slow it can get,
try doing an "ls -l" in /usr/spool/mail on a large system with lots of
unread mail.  I remember seeing this take 25 minutes on our VAX 8700
running Ultrix 2.0.  We then replaced the getpw* family of routines
with dbm'ized versions and the time dropped to under a minute.

						--Fuat
-- 
INTERNET: fuat@columbia.edu          U.S. MAIL: Columbia University
BITNET:   fuat@cunixc.cc.columbia.edu           Center for Computing Activities
USENET:   ...!rutgers!columbia!cunixc!fuat      712 Watson Labs, 612 W115th St.
PHONE:    (212) 854-5128                        New York, NY 10025

ugkamins@sunybcs.uucp (Dr. R. Chandra) (06/09/89)

First I wish to respond to some of the letters I have received and
some postings about how I don't favor symln.  The bulk of all this is
complaining about linking across filesystems.  Yes, true, I knew that,
but as long as filesystems can be umounted and remounted in different
spots, this can cause inconsistentcies.  I still am unsure about the
efficiency of such constructs, which was the discussion of the
original post.

My main gripe is the fact that I wished (and knew it
probably wouldn't work but I tried it anyway) that a symln in my home
dir would hold some data in /tmp/john after its reference was trashed
due to a cleanout of /tmp, without taking up my quota.  Dream on I
thought and dream on I would have to do.  Now if there were some way
to still keep those referenced inodes and disk blocks used, that would
be wonderful.  But alas, it would probably instantly be added on to my
quota the instant it was the last link to the file.

barnett@crdgw1.ge.com (Bruce G. Barnett) wrote:
.
.
.
=>The filename contains all of the information needed. I don't have to
=>search another file to determine the name for the archive.

Another plus not necessarily stressed enough is that the information
is current, as long as the system "survives" to write everything to
disk.  A while back I wrote what amounted to a BBS for a Prime 750
using CPL for the computer science club at Erie (County) Community College,
North Campus.  I used the Primos file system to store and keep track
of almost everything so that in the event of the failure of my
program, or the failure of the system, everything would be basically
intact.  If I used my own files to keep track of everything, as many
BBS programs do, the version of what files were available in my files
and the version of what is actually there could be two totally
different things.  Although this did not totally eliminate the need
for error checking upon opening a file ferinstance, it did eliminate
the need for a utility program to go through the filesystem and update
the BBS's idea of what the real world looked like.  Perhaps the most
important part in all this is that a filing system is there to USE,
not just to store your information.  Do you want a new message area?
Simply create a new directory, and when the program goes to list the
message areas available by asking for a list of directories in that
particular subdirectory, it is there.  No need to update an "available
message area" data file.  No source code modification (if I were
stupid or inept or whatever enough to hard code the list of message
areas into the program).  (Now that I think of it, I could have slowed
things down a bit but enhanced generality by doing similar things with
the available commands list.)

Use the filesystem to its fullest advantage, just as make does by
checking the timestamps on files (oh, no...did I just reopen the
make/gnumake/nmake war?).
---
From one "super" user to another:
Where do we keep our bison and yacc?  In a zoo of course!
"Answer my question, tell me no lies.
 Is this the real real world, or a fool's paradise?" 
  -- Eric Woolfson & Alan Parsons
(Lately, I'm beginning to believe the truth is the second case.)
ugkamins@sunybcs.UUCP

barnett@crdgw1.crd.ge.com (Bruce G. Barnett) (06/09/89)

In article <4439@ficc.uu.net>, peter@ficc (Peter da Silva) writes:
>My method:
>	print {vaxA,sunB}/sa/i[nm]/Jan/*WEEK
>
>Disadvantages with your method:
>
>	You need lots of long file names.

So? What's the problem with that? I didn't see any smoke coming out of
the disk.

I said:
>> If I had to re-implement my report scheme on a system with filenames
>> less than 14 characters, it would have taken me twice as long to do it.
>
>Not at all. It would take you no longer...

and I said:

>> [14 characters] would have make the task more difficult, more complex, more
>> inflexible and more inefficient.
>
>Not at all.

I am amazed that you know *SO* much about my programs, and the conditions
I had to develop them under.

>hierarchical directories are a wonderful tool.

That's why most operating systems have them. The only systems I have
ever used that didn't have them were bootstrapped from paper tape.

I am tired and I'm afraid I am repeating myself, but it should be
obvious that if you have to write 50 scripts that are tightly
integrated, (I'm talking about reports used as data for reports that
are used in other reports. Summaries of summaries of summaries.)
and you change the database around (i.e. change the depth of the
directories, the locations of the files, the pattern used to match the
filenames, etc.), the scripts will break.

On the otherhand, if I wanted to add a "field" in the middle of a
filename, the regular expressions I use to "query" the database remain
the same.

Understand? I can change the database, and my scripts don't break.

And since I had to do this project in 1/10th the time I would prefer
to allocate, while my boss keep asking for reports that required
dozens of modifications to the "database", I believe I am more of an
authority of the effort required than you are.

If I had to do it all over again, I would have used the same mechanism
to organize the database.

Of course if I had to develop a portable, maintainable, and CPU
efficient package, I would have designed a completely different system.
But that's not what I am talking about, nor the point I am trying to make.

The point is, if you never had a system with long filenames,
you are never given an opportunity to discover how useful long
filenames can be.

--
Bruce G. Barnett	<barnett@crdgw1.ge.com>  a.k.a. <barnett@[192.35.44.4]>
			uunet!crdgw1.ge.com!barnett barnett@crdgw1.UUCP