[net.sources] Fast "find" sources

pag@hao.UUCP (07/13/83)

echo extracting: 
echo READ.ME
cat > READ.ME << 'NEWFILE'

		FAST FIND MODIFICATIONS (see ;login: 3/83)

Run the shell on this file to separate into components
(poor man's network archive style).  They are:

	READ.ME 
	news.item -- brief announcement for local news 
	find.1 -- manual page
	Makefile 
	find.squeeze -- csh script to build the pathname database
	find.bigram.c -- called from find.squeeze to list bigrams
	find.code.c -- does the actual filename coding

		additions to find.c:

	find.c.mod1 -- header 
	find.c.mod2 -- invocation to fastfind()
	find.c.mod3 -- fastfind() itself
	
Procedure (to be done in a local source directory):

(1) in 'Makefile':

	(a) define BINDIR to be the directory to contain the new executable
    	    'find', which overrides the standard (Bell/Berkeley) command.

	(b) LIBDIR is to house the auxiliary programs
	    'find.squeeze', 'find.bigram', and 'find.code'.
            Create this directory if it does not already exist. 

 	(c) by changing 'chown root' to 'chown daemon', and by running
	    as 'daemon' instead of 'root' in step (4) below,	
	    the contents of non-searchable directories will be kept out
	    of the filelist.  We do NOT do this.

    in 'find.squeeze':

	(a) specify LIBDIR as above.

	(b) set STDFIND to the standard 'find' command.

	(c) set FINDHONCHO to the name of some user who can act on
	    any [rare] error messages emanating from the script.

    	(d) fix a daily time for the 'at' command (ATTIME).
	    Alternatively, wire it into /usr/lib/crontab for finer control.
	    The script takes about 20 min. on our 11/70 running v7.

	(e) on 4.1/4.2 systems where 'at' runs the Berkeley shell as standard,
	    take out the lines containing 'EOF'. 
   
(2) cp /usr/src/cmd/find.c find.c.old

    Integrate the three find.c mods.  Start with

    	cat find.c.mod1 find.c.old find.c.mod3 > find.c

    Then edit find.c to incorporate find.c.mod2--place this after the
    main() declarations and before the time() call.
    FCODES, as in the Makefile, tells where the database lives.
    Our 20000 or so files (100 MB content) takes up ~100K bytes in /etc.

    Remove find.c.mod[1-3] when done.  

(3) Run the Makefile as root. 

(4) Do (as root)

	at [suitable time] find.squeeze & 

    Come back in awhile, then try something like

	find find	or maybe
	find / | wc	(to yield total number of files on system) 

(5) Install the 'man' page.
    Announce 'find' to the world with something like 'news.item'.

NEWFILE

echo news.item
cat > news.item << 'NEWFILE'
>From jaw Tue Jun  9 12:35:48 1981
To: news
Subj: The Ames fast file finder

To locate files on this system in seconds, type

	find <name>

where <name> is any character string.  E.g.

	find xyz

lists all files whose pathname contains 'xyz'.
Shell filename "globbing" is also permitted, i.e.:

	find '????'  		

will list all four-letter files on the system.

	vtroff -man `find '*man*ls.[0-9u]'`

would typeset all manual pages for 'ls' variants.
The code is actually a superfast implementation of

	find / -mtime +0 -name "*<name>*" -print

with a much simplified form.  Note that the two-argument
invocation of 'find' does not conflict with the syntax of 
the standard utility, so that the more verbose 'find'
will behave as expected.  The .cshrc idiom

	alias f 'set noglob; find \*; unset noglob'

proves handy for saving keystrokes.
NEWFILE

echo find.1
cat > find.1 << 'NEWFILE'
.TH FIND AMES 
.SH NAME
find \- find files
.SH SYNOPSIS
.B find
pathname-list  expression
.br 
.B find
name
.SH DESCRIPTION
.I Find
recursively descends
the directory hierarchy for
each pathname in the
.I pathname-list
(i.e., one or more pathnames)
seeking files that match a boolean
.I expression
written in the primaries given below.
In the descriptions, the argument
.I n
is used as a decimal integer
where
.I +n
means more than
.I n,
.I \-n
means less than
.I n
and
.I n
means exactly
.IR n .
.PP
The second simplified form will list all files on the system
whose pathname contains
.I name.
This is similar to

	find / -mtime +0 -name "*<name>*" -print

but much faster.
As with
.B -name
below, shell syntax may be used for
.I name.
.TP 10n
.BR \-name " filename"
True if the
.I filename
argument matches the current file name.
Normal
Shell
argument syntax may be used if escaped (watch out for
`[', `?' and `*').
.TP
.BR \-perm " onum"
True if the file permission flags
exactly
match the
octal number
.I onum
(see
.IR  chmod (1)).
If
.I onum
is prefixed by a minus sign,
more flag bits (017777, see
.IR   stat (2))
become significant and
the flags are compared:
.IR (flags&onum)==onum .
.TP
.BR \-type " c"
True if the type of the file
is
.I c,
where
.I c
is
.B "b, c, d"
or
.B f
for
block special file, character special file,
directory or plain file.
.TP
.BR \-links " n"
True if the file has
.I n
links.
.TP
.BR \-user " uname"
True if the file belongs to the user
.I uname
(login name or numeric user ID).
.TP
.BR \-group " gname"
True if the file belongs to group
.I gname
(group name or numeric group ID).
.TP
.BR \-size " n"
True if the file is
.I n
blocks long (512 bytes per block).
.TP
.BR \-inum " n"
True if the file has inode number
.I n.
.TP
.BR \-atime " n"
True if the file has been accessed in
.I n
days.
.TP
.BR \-mtime " n"
True if the file has been modified in
.I n
days.
.TP
.BR \-exec " command"
True if the executed command returns
a zero value as exit status.
The end of the command must be punctuated by an escaped
semicolon.
A command argument `{}' is replaced by the
current pathname.
.TP
.BR \-ok " command"
Like
.B \-exec
except that the generated command is written on
the standard output, then the standard input is read
and the command executed only upon response
.BR y .
.TP
.B  \-print
Always true;
causes the current pathname to be printed.
.TP
.BR \-newer " file"
True if
the current file has been modified more recently than the argument
.I file.
.PP
The primaries may be combined using the following operators
(in order of decreasing precedence):
.TP 4
1)
A parenthesized group of primaries and operators
(parentheses are special to the Shell and must be escaped).
.TP 4
2)
The negation of a primary
(`!' is the unary
.I not
operator).
.TP 4
3)
Concatenation of primaries
(the
.I and
operation
is implied by the juxtaposition of two primaries).
.TP 4
4)
Alternation of primaries
.RB "(`" \-o "' is the"
.I or
operator).
.SH EXAMPLES
To generate nancy's file types:
.IP 
find nancy | pump file
.PP
To typeset all variants of manual pages for 'ls':
.IP 
vtroff -man `find '*man*/ls.?'`
.PP
To remove all files named
`a.out' or `*.o' that have not been accessed for a week:
.IP 
find / \\( \-name a.out \-o \-name '*.o' \\)
\-atime +7 \-exec rm {} \\;
.SH FILES
/etc/passwd
.br
/etc/group
.br
/etc/find.codes		coded filenames
.SH "SEE ALSO"
sh(1), test(1), filsys(5)
.br
Relevant paper in February, 1983 issue of
.I ;login:.
.SH BUGS
The syntax (except for the second form), is painful.
NEWFILE

echo Makefile
cat > Makefile << 'NEWFILE'
BINDIR=/usr/local
LIBDIR=/usr/lib/find

find:  find.c find.bigram.c find.code.c
	cc -n -s -O find.c -o $(BINDIR)/find
	cc -s -O find.bigram.c -o $(LIBDIR)/find.bigram
	cc -s -O find.code.c -o $(LIBDIR)/find.code
	cp find.squeeze $(LIBDIR)
	chown root $(LIBDIR)/find.squeeze
	chmod og-wx $(LIBDIR)/find.squeeze
NEWFILE

echo find.squeeze
cat > find.squeeze << 'NEWFILE'
#! /bin/csh --this line for VAX BSD--next and last line for V7 only
/bin/csh << 'EOF' 
set LIBDIR = /usr/lib/find	# for subprograms
set STDFIND = /bin/find		# in /usr/bin on some systems
set ATTIME = 0130		# daily 
set FINDHONCHO = root		# for error messages
set FCODES = /etc/find.codes	# the database 

set path = ( $LIBDIR /usr/ucb /bin /usr/bin )
set bigrams = /tmp/f.bigrams$$
set filelist = /tmp/f.list$$
set errs = /tmp/f.errs$$

# Make a file list and compute common bigrams.
# Alphabetize '/' before any other char with 'tr'.
# If the system is very short of sort space, 'find.bigram' can be made
# smarter to accumulate common bigrams directly without sorting
# ('awk', with its associative memory capacity, can do this in several
# lines, but is too slow, and runs out of string space on small machines).

at $ATTIME $LIBDIR/find.squeeze

nice +10
$STDFIND / -print | tr '/' '\001' | \
   (sort -f; echo $status > $errs) | \
   tr '\001' '/' | tee $filelist | $LIBDIR/find.bigram | \
   (sort; echo $status >> $errs) | uniq -c | sort -nr | \
   awk '{ if (NR <= 128) print $2 }' | tr -d '\012' > $bigrams

# code the file list

if { grep -s -v 0 $errs } then
	echo 'find.squeeze error: out of sort space' | mail $FINDHONCHO
else
	$LIBDIR/find.code $bigrams < $filelist > $FCODES
	rm $bigrams $filelist $errs
endif
'EOF'
NEWFILE

echo find.bigram.c
cat > find.bigram.c << 'NEWFILE'
/*
********************************************************************************
  find.bigram < text > bigrams

  List bigrams for 'find.squeeze' script.
  Use 'find.code' to encode a file using this output.
********************************************************************************
*/
#include <stdio.h>
#define MAXPATH	1024		/* maximum pathname length */

char path[MAXPATH];
char oldpath[MAXPATH] = " ";	

main ( )
{
  	register int count, j;

     	while ( gets ( path ) != NULL ) {

		count = prefix_length ( oldpath, path );
		/*
		   output post-residue bigrams only
		*/
		for ( j = count; path[j] != NULL; j += 2 ) {
			if ( path[j + 1] == NULL ) 
				break;
			putchar ( path[j] );
			putchar ( path[j + 1] );
			putchar ( '\n' );
		}
		strcpy ( oldpath, path );
   	}
}

prefix_length ( s1, s2 )	/* return length of longest common prefix */
	char *s1, *s2;		/* ... of strings s1 and s2 */
{
	register char *start;

    	for ( start = s1; *s1 == *s2; s1++, s2++ )	
		if ( *s1 == NULL )		
	    		break;
    	return ( s1 - start );
}
NEWFILE

echo find.code.c 
cat > find.code.c << 'NEWFILE'
/*
********************************************************************************
  NAME:		find.code

  PURPOSE:	sorted list compressor (works with a modified 'find'
		to encode/decode a filename database)

  USAGE:	find.bigram < list > bigrams
		process bigrams (see find.squeeze) > common_bigrams
		find.code common_bigrams < list > squozen_list

  METHOD:	Uses 'front compression' (see ";login:", March 1983, p. 8 ).
		Output format is, per line, an offset differential count byte
		followed by a partially bigram-encoded ascii residue. 

   	The codes are:

	0-28	likeliest differential counts + offset to make nonnegative 
	30	escape code for out-of-range count to follow in next word
	128-255 bigram codes, (128 most common, as determined by 'find.squeeze')
	32-127  single character (printable) ascii residue

  SEE ALSO:	find.squeeze, find.bigram.c, find.c
 
  AUTHOR:	James A. Woods, Informatics General Corp.,
		NASA Ames Research Center, 10/82
********************************************************************************
*/
#include <stdio.h>
#define MAXPATH 1024		/* maximum pathname length */
#define	RESET	30		/* switch code */

char path[MAXPATH];
char oldpath[MAXPATH] = " ";	
char bigrams[257] = { 0 };

main ( argc, argv )
	int argc; char *argv[];
{
  	int count, oldcount, diffcount;
	int j, code;
	char bigram[3];
	FILE *fp;

	oldcount = 0;
	bigram[2] = NULL;

	if ( (fp = fopen ( argv[1], "r" )) == NULL )
		printf ( "Usage: find.code common_bigrams < list > coded_list\n" ), exit ( 1 );
	fgets ( bigrams, 257, fp );
	fwrite ( bigrams, 1, 256, stdout );

     	while ( gets ( path ) != NULL ) {
		/*
		   squelch unprintable chars so as not to botch decoding
		*/
		for ( j = 0; path[j] != NULL; j++ ) {	
			path[j] &= 0177;		
			if ( path[j] < 040 || path[j] == 0177 )
				path[j] = '?';
		}
		count = prefix_length ( oldpath, path );
		diffcount = count - oldcount;
		if ( (diffcount < -14) || (diffcount > 14) ) {
			putc ( RESET, stdout );
			putw ( diffcount + 14, stdout );
		}
		else
			putc ( diffcount + 14, stdout );	

		for ( j = count; path[j] != NULL; j += 2 ) {
			if ( path[j + 1] == NULL ) {
				putchar ( path[j] );
				break;
			}
			bigram[0] = path[j];
			bigram[1] = path[j + 1];
			/*
			    linear search for specific bigram in string table
			*/
			if ( (code = strindex ( bigrams, bigram )) % 2 == 0 )
				putchar ( (code / 2) | 0200 );	
			else 		
				fputs ( bigram, stdout );
		}
		strcpy ( oldpath, path );	
		oldcount = count;
	}
}

strindex ( string, pattern )	/* return location of pattern in string or -1 */
	char *string, *pattern;
{
	register char *s, *p, *q;

	for ( s = string; *s != NULL; s++ ) 
		if ( *s == *pattern ) {		/* fast first char check */
			for ( p = pattern + 1, q = s + 1; *p != NULL; p++, q++ )
				if ( *q != *p )
					break;
			if ( *p == NULL )	
				return ( q - strlen ( pattern ) - string );
		}
	return ( -1 );
}

prefix_length ( s1, s2 )	/* return length of longest common prefix */
	char *s1, *s2;		/* ... of strings s1 and s2 */
{
	register char *start;

    	for ( start = s1; *s1 == *s2; s1++, s2++ )	
		if ( *s1 == NULL )		
	    		break;
    	return ( s1 - start );
}
NEWFILE

echo find.c.mod1
cat > find.c.mod1 << 'NEWFILE'
/*
*******************************************************************************
  NAME:		find.c -- locate files

  USAGE: 	find pathname-list expression, or
		find <name> (Ames modification)

  INSTALL:	edit/run Makefile
		run find.squeeze as root

  FILES:	/etc/find.codes

  SEE ALSO:	find.squeeze, find.bigram.c, find.code.c
		Usenix ;login:, February/March, 1983, p. 8.

  REVISIONS: 	James A. Woods, Informatics General Corporation,
		NASA Ames Research Center, 6/81.

		The second form searches a pre-computed filelist
		(constructed nightly by /usr/lib/crontab) which is
		compressed by find.squeeze (v.i.z.)  The effect of
			find <name>
		is similar to
			find / +0 -name "*<name>*" -print
		but much faster.

		8/82 faster yet + incorporation of bigram coding -- jaw

		1/83 incorporate glob-style matching -- jaw
********************************************************************************
*/
#define	AMES	1
NEWFILE

echo find.c.mod2
cat > find.c.mod2 << 'NEWFILE'
#ifdef  AMES
	if ( argc < 2 ) {
		fprintf ( stderr, "Usage: find name, or find path-list predicate-list\n" );
		exit ( 1 );
	}
	if ( argc == 2 ) {
		fastfind ( argv[1] );
		exit ( 0 );
	}
#endif
NEWFILE

echo find.c.mod3 
cat > find.c.mod3 << 'NEWFILE'
#ifdef	AMES
/*
   'fastfind' scans a file list for the full pathname of a file
   given only a piece of the name.  The list has been processed with
   with "front-compression" and bigram coding.  Front compression reduces
   space by a factor of 4-5, bigram coding by a further 20-25%.
   The codes are:

	0-28	likeliest differential counts + offset to make nonnegative 
	30	escape code for out-of-range count to follow in next word
	128-255 bigram codes, (128 most common, as determined by 'find.squeeze')
	32-127  single character (printable) ascii residue

   A novel two-tiered string search technique is employed: 

   First, a metacharacter-free subpattern and partial pathname is
   matched BACKWARDS to avoid full expansion of the pathname list.
   The time savings is 40-50% over forward matching, which cannot efficiently
   handle overlapped search patterns and compressed path residue.

   Then, the actual shell glob-style regular expression (if in this form)
   is matched against the candidate pathnames using the slower routines
   provided in the standard 'find'.
*/

#define	FCODES 	"/etc/find.codes"
#define	YES	1
#define	NO	0
#define	OFFSET	14
#define	ESCCODE	30

fastfind ( pathpart )	
	char pathpart[];
{
	register char *p, *s;
	register int c; 
	char *q, *index(), *patprep();
	int i, count, globflag;
	FILE *fp, *fopen();
	char *patend, *cutoff;
	char path[1024];
	char bigram1[128], bigram2[128];
	int found = NO;

	if ( (fp = fopen ( FCODES, "r" )) == NULL ) {
		fprintf ( "find: can't open %s\n", FCODES );
		exit ( 1 );
	}
	for ( i = 0; i < 128; i++ ) 
		bigram1[i] = getc ( fp ),  bigram2[i] = getc ( fp );
	
	if ( index ( pathpart, '*' ) || index ( pathpart, '?' ) || index ( pathpart, '[' ) )
		globflag = YES;
	patend = patprep ( pathpart );

	c = getc ( fp );
	for ( ; ; ) {

		count += ( (c == ESCCODE) ? getw ( fp ) : c ) - OFFSET;

		for ( p = path + count; (c = getc ( fp )) > ESCCODE; )	/* overlay old path */
			if ( c < 0200 )	
				*p++ = c;
			else		/* bigrams are parity-marked */
				*p++ = bigram1[c & 0177],  *p++ = bigram2[c & 0177];
		if ( c == EOF )
			break;
		*p-- = NULL;
		cutoff = ( found ? path : path + count);

		for ( found = NO, s = p; s >= cutoff; s-- ) 
			if ( *s == *patend ) {		/* fast first char check */
				for ( p = patend - 1, q = s - 1; *p != NULL; p--, q-- )
					if ( *q != *p )
						break;
				if ( *p == NULL ) {	/* success on fast match */
					found = YES;
					if ( globflag == NO || amatch ( path, pathpart ) )
						puts ( path );
					break;
				}
			}
	}
}

/*
    extract first glob-free subpattern for fast pre-match;
    prepend NULL for backwards match; return end of pattern
*/
static char globfree[100];

char *
patprep ( p )
	register char *p;
{
	register char *subp = globfree;

	while ( *p == '*' || *p == '?' )
		p++;
	if ( *p == '[' ) {
		while ( *p != ']' && *p != NULL )
			p++;
		p++;
	}
	*subp++ = NULL;
	if ( *p != NULL )		/* copy first noglob string */
		while ( *p != '*' && *p != '?' && *p != '[' && *p != NULL &&
		      subp < (globfree + sizeof(globfree)) )
			*subp++ = *p++;
	else				/* pattern has noglob chars only */
		*subp++ = '/';		/* ... check every path */
	*subp = NULL;
	return ( --subp );
}
#endif
NEWFILE