barnett@crdgw1.crd.ge.com (Bruce G. Barnett) (06/06/89)
In article <9422@alice.UUCP>, andrew@alice (Andrew Hume) writes: > my point is that if you have structured names like ><machine><report>-<option>.<time_period> (barnett's example), >the "lazy part" is putting that in the filename and using the >shell's pattern matching to select files. The alternative >(there are obviously bunches) is to put this database in a file >that looks like ><datafile>\t<machine>\t<report>\t<option>\t<time_period> >and use awk (or cut) to select on arbitrary fields. e.g. > more `awk '$2=="mymachine"{print $1}'` > >this is only slighlty more work, much more flexible AND >doesn't require the kernel to support gargantuan filenames. Wrong. It would require much more work and be much less flexible. Example: If I wanted to print out all weekly sa -in and sa -im reports for machines vaxA and SunB that ocurred in January, I could type: My Method: print {vaxA,sunB}*sa-i[nm]*Jan*WEEK Your method: print `awk '$2 ~ /vaxA|sunB/ && $3 == "sa" && $4 ~/i[nm]/ && \ $5 ~ /Jan.*WEEK/ {print $1}' data ` Disadvantages with your method: 1. Simple queries now require either an AWK programmer or a a sophisticated script. 2. The file "data" must be keep up to date. If 50 files were created a day, and files could be deleted whenever the disk filled up, keeping this file up to date requires an extra step. 3. The biggest disadvantage is that if dozens of scripts were written, and it became necessary to change the database, all of the scripts would have to be re-written. If I had to re-implement my report scheme on a system with filenames less than 14 characters, it would have taken me twice as long to do it. There are so many advantages to long filenames: I have never had a problem with a shell script that did mv $1 $1.orig I have enabled GNU emacs's numbered backups mechanism, so that old files are renamed file.~1~ file.~2~ ... file.~12~ etc. (By default GNUemacs keeps the two oldest and two newest versions). If I used this scheme, then all of my non-SCCS awk scripts would be limited to five characters: (e.g. abcde.awk.~12~) I can also use filenames to indicate the function of the script. (e.g. "archive-to-tape-old-newsgroups" vs. "ar2tape-oldng") But the biggest win is the ability to use the filename for the data. Another example is the large USENET archive I keep. First of all, I store old articles using the format ./news.group/yy-mm/article-id (The top directory is the newsgroup. The next directory tells me the year and month of the posting. The filename is the article-ID of the article). There is a one-line summary of the subject line in the file ./LOGS/news.group which contains the filename and subject line. When I archive the big-old newsgroups to tape, the log file is renamed (appending the current date to the filename). There are so many advantages to this scheme. Articles are always a known depth from the top (comp.binaries.ibm.pc.d vs. comp/binaries/ibm/pc/d). There is a simple way to determine if an article is archived twice. The filename contains all of the information needed. I don't have to search another file to determine the name for the archive. I now have the following pieces of data available: The newsgroup The year and month of the posting The article-ID The machine the article was posted from The filename of the article on the disk or tape If the article has been archived to disk, I also have: The name of the tape the archive is on When I created the above tape Now ALL of the above information is stored in filenames. The log file is the filename and the subject line. I don't need files to keep information about files, especially when I have to keep track of 400,000 files. My database queries are done with grep, cat and find. Once I find the file I am interested in, I use a very simple awk command to do something with the files. In short, the fact that I have hundreds of thousands of files with the length of 30 characters (which is NOT gargantuan in my mind) allows me a simple, elegant method of organizing data. The same task on a machine with the archaic limit of 14 characters would have make the task more difficult, more complex, more inflexible and more inefficient. -- Bruce G. Barnett <barnett@crdgw1.ge.com> a.k.a. <barnett@[192.35.44.4]> uunet!crdgw1.ge.com!barnett barnett@crdgw1.UUCP
peter@ficc.uu.net (Peter da Silva) (06/07/89)
In article <629@crdgw1.crd.ge.com>, barnett@crdgw1.crd.ge.com (Bruce G. Barnett) writes: > In article <9422@alice.UUCP>, andrew@alice (Andrew Hume) writes: > Example: > If I wanted to print out all weekly sa -in and sa -im reports > for machines vaxA and SunB that ocurred in January, I could type: > My Method: > print {vaxA,sunB}*sa-i[nm]*Jan*WEEK > Your method: > print `awk '$2 ~ /vaxA|sunB/ && $3 == "sa" && $4 ~/i[nm]/ && \ > $5 ~ /Jan.*WEEK/ {print $1}' data ` My method: print {vaxA,sunB}/sa/i[nm]/Jan/*WEEK Disadvantages with your method: You need lots of long file names. 'ls -C' is useless. 'ls' takes forever. > If I had to re-implement my report scheme on a system with filenames > less than 14 characters, it would have taken me twice as long to do it. Not at all. It would take you no longer... hierarchical directories are a wonderful tool. 14 characters is getting a little cramped, but I've never run out of 30. > Another example is the large USENET archive I keep. > First of all, I store old articles using the format > ./news.group/yy-mm/article-id Why not /news/group/...? > There is a one-line summary of the subject line in the file > ./LOGS/news.group ./LOGS/news/group... > There are so many advantages to this scheme. Articles are always a known > depth from the top (comp.binaries.ibm.pc.d vs. comp/binaries/ibm/pc/d). The only problem with this is that UNIX wildcards don't support ellipses. One of the very few VMS features I genuinely miss... and a much more useful tool than superlongfilenames. But surely you don't find "find" to be THAT hard to use. > [14 characters] would have make the task more difficult, more complex, more > inflexible and more inefficient. Not at all. -- Peter da Silva, Xenix Support, Ferranti International Controls Corporation. Business: uunet.uu.net!ficc!peter, peter@ficc.uu.net, +1 713 274 5180. Personal: ...!texbell!sugar!peter, peter@sugar.hackercorp.com.
flee@shire.cs.psu.edu (Felix Lee) (06/09/89)
In article <4439@ficc.uu.net>, peter@ficc.uu.net (Peter da Silva) writes: > You need lots of long file names. > 'ls -C' is useless. > 'ls' takes forever. 'ls -C' is only useless on pathologically small terminals (such as 24x80 :-). The columnizer could also be improved to let exceptionally long names violate column boundaries. 'ls' takes forever only if you use the '-l', '-F', or similar options that have to stat() each file. 'ls' without those options is marginally slower than 'sort'. Curiously, 'csh' filecompletion listing is quite fast, much faster than 'ls -F'. This is mostly illusion. 'ls' stats each file as it reads the directory and then prints the list out. 'csh' reads the directory and then stats each file as it's printed out. The effect is that 'csh' seems faster because of the better average response time. 'ls' could do something similar with specialcase code (albeit the most common case). Might be worth it. -- Felix Lee flee@shire.cs.psu.edu *!psuvax1!shire!flee
fuat@cunixc.cc.columbia.edu (Fuat C. Baran) (06/09/89)
In article <FLEE.89Jun8143358@shire.cs.psu.edu> flee@shire.cs.psu.edu (Felix Lee) writes: >'ls' takes forever only if you use the '-l', '-F', or similar options >that have to stat() each file. 'ls' without those options is >marginally slower than 'sort'. One reason that "ls -l" is slow on some systems is because of the getpwuid() call to figure out the file owner's name. This is a problem on systems were these lookups don't go through dbm but by opening /etc/passwd and reading entries. To see how slow it can get, try doing an "ls -l" in /usr/spool/mail on a large system with lots of unread mail. I remember seeing this take 25 minutes on our VAX 8700 running Ultrix 2.0. We then replaced the getpw* family of routines with dbm'ized versions and the time dropped to under a minute. --Fuat -- INTERNET: fuat@columbia.edu U.S. MAIL: Columbia University BITNET: fuat@cunixc.cc.columbia.edu Center for Computing Activities USENET: ...!rutgers!columbia!cunixc!fuat 712 Watson Labs, 612 W115th St. PHONE: (212) 854-5128 New York, NY 10025
ugkamins@sunybcs.uucp (Dr. R. Chandra) (06/09/89)
First I wish to respond to some of the letters I have received and some postings about how I don't favor symln. The bulk of all this is complaining about linking across filesystems. Yes, true, I knew that, but as long as filesystems can be umounted and remounted in different spots, this can cause inconsistentcies. I still am unsure about the efficiency of such constructs, which was the discussion of the original post. My main gripe is the fact that I wished (and knew it probably wouldn't work but I tried it anyway) that a symln in my home dir would hold some data in /tmp/john after its reference was trashed due to a cleanout of /tmp, without taking up my quota. Dream on I thought and dream on I would have to do. Now if there were some way to still keep those referenced inodes and disk blocks used, that would be wonderful. But alas, it would probably instantly be added on to my quota the instant it was the last link to the file. barnett@crdgw1.ge.com (Bruce G. Barnett) wrote: . . . =>The filename contains all of the information needed. I don't have to =>search another file to determine the name for the archive. Another plus not necessarily stressed enough is that the information is current, as long as the system "survives" to write everything to disk. A while back I wrote what amounted to a BBS for a Prime 750 using CPL for the computer science club at Erie (County) Community College, North Campus. I used the Primos file system to store and keep track of almost everything so that in the event of the failure of my program, or the failure of the system, everything would be basically intact. If I used my own files to keep track of everything, as many BBS programs do, the version of what files were available in my files and the version of what is actually there could be two totally different things. Although this did not totally eliminate the need for error checking upon opening a file ferinstance, it did eliminate the need for a utility program to go through the filesystem and update the BBS's idea of what the real world looked like. Perhaps the most important part in all this is that a filing system is there to USE, not just to store your information. Do you want a new message area? Simply create a new directory, and when the program goes to list the message areas available by asking for a list of directories in that particular subdirectory, it is there. No need to update an "available message area" data file. No source code modification (if I were stupid or inept or whatever enough to hard code the list of message areas into the program). (Now that I think of it, I could have slowed things down a bit but enhanced generality by doing similar things with the available commands list.) Use the filesystem to its fullest advantage, just as make does by checking the timestamps on files (oh, no...did I just reopen the make/gnumake/nmake war?). --- From one "super" user to another: Where do we keep our bison and yacc? In a zoo of course! "Answer my question, tell me no lies. Is this the real real world, or a fool's paradise?" -- Eric Woolfson & Alan Parsons (Lately, I'm beginning to believe the truth is the second case.) ugkamins@sunybcs.UUCP
barnett@crdgw1.crd.ge.com (Bruce G. Barnett) (06/09/89)
In article <4439@ficc.uu.net>, peter@ficc (Peter da Silva) writes: >My method: > print {vaxA,sunB}/sa/i[nm]/Jan/*WEEK > >Disadvantages with your method: > > You need lots of long file names. So? What's the problem with that? I didn't see any smoke coming out of the disk. I said: >> If I had to re-implement my report scheme on a system with filenames >> less than 14 characters, it would have taken me twice as long to do it. > >Not at all. It would take you no longer... and I said: >> [14 characters] would have make the task more difficult, more complex, more >> inflexible and more inefficient. > >Not at all. I am amazed that you know *SO* much about my programs, and the conditions I had to develop them under. >hierarchical directories are a wonderful tool. That's why most operating systems have them. The only systems I have ever used that didn't have them were bootstrapped from paper tape. I am tired and I'm afraid I am repeating myself, but it should be obvious that if you have to write 50 scripts that are tightly integrated, (I'm talking about reports used as data for reports that are used in other reports. Summaries of summaries of summaries.) and you change the database around (i.e. change the depth of the directories, the locations of the files, the pattern used to match the filenames, etc.), the scripts will break. On the otherhand, if I wanted to add a "field" in the middle of a filename, the regular expressions I use to "query" the database remain the same. Understand? I can change the database, and my scripts don't break. And since I had to do this project in 1/10th the time I would prefer to allocate, while my boss keep asking for reports that required dozens of modifications to the "database", I believe I am more of an authority of the effort required than you are. If I had to do it all over again, I would have used the same mechanism to organize the database. Of course if I had to develop a portable, maintainable, and CPU efficient package, I would have designed a completely different system. But that's not what I am talking about, nor the point I am trying to make. The point is, if you never had a system with long filenames, you are never given an opportunity to discover how useful long filenames can be. -- Bruce G. Barnett <barnett@crdgw1.ge.com> a.k.a. <barnett@[192.35.44.4]> uunet!crdgw1.ge.com!barnett barnett@crdgw1.UUCP