pag@hao.UUCP (07/13/83)
echo extracting: echo READ.ME cat > READ.ME << 'NEWFILE' FAST FIND MODIFICATIONS (see ;login: 3/83) Run the shell on this file to separate into components (poor man's network archive style). They are: READ.ME news.item -- brief announcement for local news find.1 -- manual page Makefile find.squeeze -- csh script to build the pathname database find.bigram.c -- called from find.squeeze to list bigrams find.code.c -- does the actual filename coding additions to find.c: find.c.mod1 -- header find.c.mod2 -- invocation to fastfind() find.c.mod3 -- fastfind() itself Procedure (to be done in a local source directory): (1) in 'Makefile': (a) define BINDIR to be the directory to contain the new executable 'find', which overrides the standard (Bell/Berkeley) command. (b) LIBDIR is to house the auxiliary programs 'find.squeeze', 'find.bigram', and 'find.code'. Create this directory if it does not already exist. (c) by changing 'chown root' to 'chown daemon', and by running as 'daemon' instead of 'root' in step (4) below, the contents of non-searchable directories will be kept out of the filelist. We do NOT do this. in 'find.squeeze': (a) specify LIBDIR as above. (b) set STDFIND to the standard 'find' command. (c) set FINDHONCHO to the name of some user who can act on any [rare] error messages emanating from the script. (d) fix a daily time for the 'at' command (ATTIME). Alternatively, wire it into /usr/lib/crontab for finer control. The script takes about 20 min. on our 11/70 running v7. (e) on 4.1/4.2 systems where 'at' runs the Berkeley shell as standard, take out the lines containing 'EOF'. (2) cp /usr/src/cmd/find.c find.c.old Integrate the three find.c mods. Start with cat find.c.mod1 find.c.old find.c.mod3 > find.c Then edit find.c to incorporate find.c.mod2--place this after the main() declarations and before the time() call. FCODES, as in the Makefile, tells where the database lives. Our 20000 or so files (100 MB content) takes up ~100K bytes in /etc. Remove find.c.mod[1-3] when done. (3) Run the Makefile as root. (4) Do (as root) at [suitable time] find.squeeze & Come back in awhile, then try something like find find or maybe find / | wc (to yield total number of files on system) (5) Install the 'man' page. Announce 'find' to the world with something like 'news.item'. NEWFILE echo news.item cat > news.item << 'NEWFILE' >From jaw Tue Jun 9 12:35:48 1981 To: news Subj: The Ames fast file finder To locate files on this system in seconds, type find <name> where <name> is any character string. E.g. find xyz lists all files whose pathname contains 'xyz'. Shell filename "globbing" is also permitted, i.e.: find '????' will list all four-letter files on the system. vtroff -man `find '*man*ls.[0-9u]'` would typeset all manual pages for 'ls' variants. The code is actually a superfast implementation of find / -mtime +0 -name "*<name>*" -print with a much simplified form. Note that the two-argument invocation of 'find' does not conflict with the syntax of the standard utility, so that the more verbose 'find' will behave as expected. The .cshrc idiom alias f 'set noglob; find \*; unset noglob' proves handy for saving keystrokes. NEWFILE echo find.1 cat > find.1 << 'NEWFILE' .TH FIND AMES .SH NAME find \- find files .SH SYNOPSIS .B find pathname-list expression .br .B find name .SH DESCRIPTION .I Find recursively descends the directory hierarchy for each pathname in the .I pathname-list (i.e., one or more pathnames) seeking files that match a boolean .I expression written in the primaries given below. In the descriptions, the argument .I n is used as a decimal integer where .I +n means more than .I n, .I \-n means less than .I n and .I n means exactly .IR n . .PP The second simplified form will list all files on the system whose pathname contains .I name. This is similar to find / -mtime +0 -name "*<name>*" -print but much faster. As with .B -name below, shell syntax may be used for .I name. .TP 10n .BR \-name " filename" True if the .I filename argument matches the current file name. Normal Shell argument syntax may be used if escaped (watch out for `[', `?' and `*'). .TP .BR \-perm " onum" True if the file permission flags exactly match the octal number .I onum (see .IR chmod (1)). If .I onum is prefixed by a minus sign, more flag bits (017777, see .IR stat (2)) become significant and the flags are compared: .IR (flags&onum)==onum . .TP .BR \-type " c" True if the type of the file is .I c, where .I c is .B "b, c, d" or .B f for block special file, character special file, directory or plain file. .TP .BR \-links " n" True if the file has .I n links. .TP .BR \-user " uname" True if the file belongs to the user .I uname (login name or numeric user ID). .TP .BR \-group " gname" True if the file belongs to group .I gname (group name or numeric group ID). .TP .BR \-size " n" True if the file is .I n blocks long (512 bytes per block). .TP .BR \-inum " n" True if the file has inode number .I n. .TP .BR \-atime " n" True if the file has been accessed in .I n days. .TP .BR \-mtime " n" True if the file has been modified in .I n days. .TP .BR \-exec " command" True if the executed command returns a zero value as exit status. The end of the command must be punctuated by an escaped semicolon. A command argument `{}' is replaced by the current pathname. .TP .BR \-ok " command" Like .B \-exec except that the generated command is written on the standard output, then the standard input is read and the command executed only upon response .BR y . .TP .B \-print Always true; causes the current pathname to be printed. .TP .BR \-newer " file" True if the current file has been modified more recently than the argument .I file. .PP The primaries may be combined using the following operators (in order of decreasing precedence): .TP 4 1) A parenthesized group of primaries and operators (parentheses are special to the Shell and must be escaped). .TP 4 2) The negation of a primary (`!' is the unary .I not operator). .TP 4 3) Concatenation of primaries (the .I and operation is implied by the juxtaposition of two primaries). .TP 4 4) Alternation of primaries .RB "(`" \-o "' is the" .I or operator). .SH EXAMPLES To generate nancy's file types: .IP find nancy | pump file .PP To typeset all variants of manual pages for 'ls': .IP vtroff -man `find '*man*/ls.?'` .PP To remove all files named `a.out' or `*.o' that have not been accessed for a week: .IP find / \\( \-name a.out \-o \-name '*.o' \\) \-atime +7 \-exec rm {} \\; .SH FILES /etc/passwd .br /etc/group .br /etc/find.codes coded filenames .SH "SEE ALSO" sh(1), test(1), filsys(5) .br Relevant paper in February, 1983 issue of .I ;login:. .SH BUGS The syntax (except for the second form), is painful. NEWFILE echo Makefile cat > Makefile << 'NEWFILE' BINDIR=/usr/local LIBDIR=/usr/lib/find find: find.c find.bigram.c find.code.c cc -n -s -O find.c -o $(BINDIR)/find cc -s -O find.bigram.c -o $(LIBDIR)/find.bigram cc -s -O find.code.c -o $(LIBDIR)/find.code cp find.squeeze $(LIBDIR) chown root $(LIBDIR)/find.squeeze chmod og-wx $(LIBDIR)/find.squeeze NEWFILE echo find.squeeze cat > find.squeeze << 'NEWFILE' #! /bin/csh --this line for VAX BSD--next and last line for V7 only /bin/csh << 'EOF' set LIBDIR = /usr/lib/find # for subprograms set STDFIND = /bin/find # in /usr/bin on some systems set ATTIME = 0130 # daily set FINDHONCHO = root # for error messages set FCODES = /etc/find.codes # the database set path = ( $LIBDIR /usr/ucb /bin /usr/bin ) set bigrams = /tmp/f.bigrams$$ set filelist = /tmp/f.list$$ set errs = /tmp/f.errs$$ # Make a file list and compute common bigrams. # Alphabetize '/' before any other char with 'tr'. # If the system is very short of sort space, 'find.bigram' can be made # smarter to accumulate common bigrams directly without sorting # ('awk', with its associative memory capacity, can do this in several # lines, but is too slow, and runs out of string space on small machines). at $ATTIME $LIBDIR/find.squeeze nice +10 $STDFIND / -print | tr '/' '\001' | \ (sort -f; echo $status > $errs) | \ tr '\001' '/' | tee $filelist | $LIBDIR/find.bigram | \ (sort; echo $status >> $errs) | uniq -c | sort -nr | \ awk '{ if (NR <= 128) print $2 }' | tr -d '\012' > $bigrams # code the file list if { grep -s -v 0 $errs } then echo 'find.squeeze error: out of sort space' | mail $FINDHONCHO else $LIBDIR/find.code $bigrams < $filelist > $FCODES rm $bigrams $filelist $errs endif 'EOF' NEWFILE echo find.bigram.c cat > find.bigram.c << 'NEWFILE' /* ******************************************************************************** find.bigram < text > bigrams List bigrams for 'find.squeeze' script. Use 'find.code' to encode a file using this output. ******************************************************************************** */ #include <stdio.h> #define MAXPATH 1024 /* maximum pathname length */ char path[MAXPATH]; char oldpath[MAXPATH] = " "; main ( ) { register int count, j; while ( gets ( path ) != NULL ) { count = prefix_length ( oldpath, path ); /* output post-residue bigrams only */ for ( j = count; path[j] != NULL; j += 2 ) { if ( path[j + 1] == NULL ) break; putchar ( path[j] ); putchar ( path[j + 1] ); putchar ( '\n' ); } strcpy ( oldpath, path ); } } prefix_length ( s1, s2 ) /* return length of longest common prefix */ char *s1, *s2; /* ... of strings s1 and s2 */ { register char *start; for ( start = s1; *s1 == *s2; s1++, s2++ ) if ( *s1 == NULL ) break; return ( s1 - start ); } NEWFILE echo find.code.c cat > find.code.c << 'NEWFILE' /* ******************************************************************************** NAME: find.code PURPOSE: sorted list compressor (works with a modified 'find' to encode/decode a filename database) USAGE: find.bigram < list > bigrams process bigrams (see find.squeeze) > common_bigrams find.code common_bigrams < list > squozen_list METHOD: Uses 'front compression' (see ";login:", March 1983, p. 8 ). Output format is, per line, an offset differential count byte followed by a partially bigram-encoded ascii residue. The codes are: 0-28 likeliest differential counts + offset to make nonnegative 30 escape code for out-of-range count to follow in next word 128-255 bigram codes, (128 most common, as determined by 'find.squeeze') 32-127 single character (printable) ascii residue SEE ALSO: find.squeeze, find.bigram.c, find.c AUTHOR: James A. Woods, Informatics General Corp., NASA Ames Research Center, 10/82 ******************************************************************************** */ #include <stdio.h> #define MAXPATH 1024 /* maximum pathname length */ #define RESET 30 /* switch code */ char path[MAXPATH]; char oldpath[MAXPATH] = " "; char bigrams[257] = { 0 }; main ( argc, argv ) int argc; char *argv[]; { int count, oldcount, diffcount; int j, code; char bigram[3]; FILE *fp; oldcount = 0; bigram[2] = NULL; if ( (fp = fopen ( argv[1], "r" )) == NULL ) printf ( "Usage: find.code common_bigrams < list > coded_list\n" ), exit ( 1 ); fgets ( bigrams, 257, fp ); fwrite ( bigrams, 1, 256, stdout ); while ( gets ( path ) != NULL ) { /* squelch unprintable chars so as not to botch decoding */ for ( j = 0; path[j] != NULL; j++ ) { path[j] &= 0177; if ( path[j] < 040 || path[j] == 0177 ) path[j] = '?'; } count = prefix_length ( oldpath, path ); diffcount = count - oldcount; if ( (diffcount < -14) || (diffcount > 14) ) { putc ( RESET, stdout ); putw ( diffcount + 14, stdout ); } else putc ( diffcount + 14, stdout ); for ( j = count; path[j] != NULL; j += 2 ) { if ( path[j + 1] == NULL ) { putchar ( path[j] ); break; } bigram[0] = path[j]; bigram[1] = path[j + 1]; /* linear search for specific bigram in string table */ if ( (code = strindex ( bigrams, bigram )) % 2 == 0 ) putchar ( (code / 2) | 0200 ); else fputs ( bigram, stdout ); } strcpy ( oldpath, path ); oldcount = count; } } strindex ( string, pattern ) /* return location of pattern in string or -1 */ char *string, *pattern; { register char *s, *p, *q; for ( s = string; *s != NULL; s++ ) if ( *s == *pattern ) { /* fast first char check */ for ( p = pattern + 1, q = s + 1; *p != NULL; p++, q++ ) if ( *q != *p ) break; if ( *p == NULL ) return ( q - strlen ( pattern ) - string ); } return ( -1 ); } prefix_length ( s1, s2 ) /* return length of longest common prefix */ char *s1, *s2; /* ... of strings s1 and s2 */ { register char *start; for ( start = s1; *s1 == *s2; s1++, s2++ ) if ( *s1 == NULL ) break; return ( s1 - start ); } NEWFILE echo find.c.mod1 cat > find.c.mod1 << 'NEWFILE' /* ******************************************************************************* NAME: find.c -- locate files USAGE: find pathname-list expression, or find <name> (Ames modification) INSTALL: edit/run Makefile run find.squeeze as root FILES: /etc/find.codes SEE ALSO: find.squeeze, find.bigram.c, find.code.c Usenix ;login:, February/March, 1983, p. 8. REVISIONS: James A. Woods, Informatics General Corporation, NASA Ames Research Center, 6/81. The second form searches a pre-computed filelist (constructed nightly by /usr/lib/crontab) which is compressed by find.squeeze (v.i.z.) The effect of find <name> is similar to find / +0 -name "*<name>*" -print but much faster. 8/82 faster yet + incorporation of bigram coding -- jaw 1/83 incorporate glob-style matching -- jaw ******************************************************************************** */ #define AMES 1 NEWFILE echo find.c.mod2 cat > find.c.mod2 << 'NEWFILE' #ifdef AMES if ( argc < 2 ) { fprintf ( stderr, "Usage: find name, or find path-list predicate-list\n" ); exit ( 1 ); } if ( argc == 2 ) { fastfind ( argv[1] ); exit ( 0 ); } #endif NEWFILE echo find.c.mod3 cat > find.c.mod3 << 'NEWFILE' #ifdef AMES /* 'fastfind' scans a file list for the full pathname of a file given only a piece of the name. The list has been processed with with "front-compression" and bigram coding. Front compression reduces space by a factor of 4-5, bigram coding by a further 20-25%. The codes are: 0-28 likeliest differential counts + offset to make nonnegative 30 escape code for out-of-range count to follow in next word 128-255 bigram codes, (128 most common, as determined by 'find.squeeze') 32-127 single character (printable) ascii residue A novel two-tiered string search technique is employed: First, a metacharacter-free subpattern and partial pathname is matched BACKWARDS to avoid full expansion of the pathname list. The time savings is 40-50% over forward matching, which cannot efficiently handle overlapped search patterns and compressed path residue. Then, the actual shell glob-style regular expression (if in this form) is matched against the candidate pathnames using the slower routines provided in the standard 'find'. */ #define FCODES "/etc/find.codes" #define YES 1 #define NO 0 #define OFFSET 14 #define ESCCODE 30 fastfind ( pathpart ) char pathpart[]; { register char *p, *s; register int c; char *q, *index(), *patprep(); int i, count, globflag; FILE *fp, *fopen(); char *patend, *cutoff; char path[1024]; char bigram1[128], bigram2[128]; int found = NO; if ( (fp = fopen ( FCODES, "r" )) == NULL ) { fprintf ( "find: can't open %s\n", FCODES ); exit ( 1 ); } for ( i = 0; i < 128; i++ ) bigram1[i] = getc ( fp ), bigram2[i] = getc ( fp ); if ( index ( pathpart, '*' ) || index ( pathpart, '?' ) || index ( pathpart, '[' ) ) globflag = YES; patend = patprep ( pathpart ); c = getc ( fp ); for ( ; ; ) { count += ( (c == ESCCODE) ? getw ( fp ) : c ) - OFFSET; for ( p = path + count; (c = getc ( fp )) > ESCCODE; ) /* overlay old path */ if ( c < 0200 ) *p++ = c; else /* bigrams are parity-marked */ *p++ = bigram1[c & 0177], *p++ = bigram2[c & 0177]; if ( c == EOF ) break; *p-- = NULL; cutoff = ( found ? path : path + count); for ( found = NO, s = p; s >= cutoff; s-- ) if ( *s == *patend ) { /* fast first char check */ for ( p = patend - 1, q = s - 1; *p != NULL; p--, q-- ) if ( *q != *p ) break; if ( *p == NULL ) { /* success on fast match */ found = YES; if ( globflag == NO || amatch ( path, pathpart ) ) puts ( path ); break; } } } } /* extract first glob-free subpattern for fast pre-match; prepend NULL for backwards match; return end of pattern */ static char globfree[100]; char * patprep ( p ) register char *p; { register char *subp = globfree; while ( *p == '*' || *p == '?' ) p++; if ( *p == '[' ) { while ( *p != ']' && *p != NULL ) p++; p++; } *subp++ = NULL; if ( *p != NULL ) /* copy first noglob string */ while ( *p != '*' && *p != '?' && *p != '[' && *p != NULL && subp < (globfree + sizeof(globfree)) ) *subp++ = *p++; else /* pattern has noglob chars only */ *subp++ = '/'; /* ... check every path */ *subp = NULL; return ( --subp ); } #endif NEWFILE