[comp.sys.sun] Setting the record straight on SunOS 4.0 'fastfind'

mlandau@bbn.com (Matt Landau) (12/30/88)

In comp.sys.sun (<2359@kalliope.rice.edu>), Our Moderator, in response
to some questions about find, says:

>Uhhh, I think you both came from the twilight zone.  Neither "find
>filename" nor "find '*filename*'" will have much effect.  The first
>argument(s) to find are the directories in which to start the recursive
>search.  

Actually, if you give find a single non-option argument, it will look in
/usr/lib/find/find.codes for strings that contain that argument as a
substring.  The find.codes file is essentially a pre-cached view of the
filesystem space, arranged for very fast lookup.

The trick is knowing how to build /usr/lib/find/find.codes, since there is
absolutely no documentation on it!  However, if you look in the
/usr/lib/find directory, you'll find a csh script called "updatedb", which
builds find.codes for you.  I arrange to have it run every night from
cron, after which I can go to some server and say "find core" to find
every file that contains the string "core" without waiting for find to
traverse all the filesystems.

Updatedb only works on type 4.2 filesystems, so you have to run it on each
of you servers, and it only builds a cache for 4.2 filesystems, so you
have to do "find string" on each server to find all instances of what
you're looking for.  In spite of that, it's a big win over waiting for
find to walk 3 gigabytes of disk every time you want to hunt something
down.

[[ Yeah, I kind of blew it.  I was unaware of the "fast find" code that
had been incorporated in 4.3.  I've been too busy using Suns that I have
not been able to keep up with the rest of the Unix world.  Sun did indeed
incorporate the fast find code in their distributions of "find", but they
could not come up with an easy way of making the updatedb stuff work
properly in a distributed (NFS) environment.  So, they never documented
that usage of find and never set up "updatedb" in crontab.  That's why I
got confused.  --wnl ]]

jaw@eos.arc.nasa.gov (James A. Woods) (12/30/88)

Under BSD 4.3 Unix, "find filename" is short for

	find / -name '*filename*' -print

(by far the most common usage of "find"), except that it runs in seconds
rather than minutes.  It is documented in the manual page.

I wrote this code years ago; it was actually part of BSD 4.2, but got lost
in the shuffle at Sun when they went the SVID route.  Though undocumented
in the manual page, it half-way works under SunOS 4.0, after first
installing a nightly call to 'updatedb' from crontab.  Unfortunately, as
distributed, the compressed filename database is not portable across
architectures because of bit-order problems with calls to putw()/getw().
[In the days of the DEC PDP, these word writes permitted the code to work
on both DEC architectures-- NFS, with disparate machines reading one
database, didn't yet exist.]

However, the Sun 4 / Sun 3 'fastfind' problem is easily fixed by replacing
the reference to

	c = getw(fp)

in find.c with something like:

	c = getc (fp);
	c = ((unsigned char) c << 8) | getc(fp);

plus a similar change to putw() in code.c.

As to the syntax issues, it could be argued that filename matching "glob
style" should be like 'egrep' rather than 'sh' -- this takes somewhat more
work.

Another area for improvement is database build time, now slow partly due
to the use of 'awk'.  Finally, the largest crime committed in my design of
five-year old ffind is that it is not "eight-bit clean" for international
character sets.  I may remedy this someday (at the slight expense of
compression efficiency), and donate the resulting code to the GNU project,
unless someone has already done such.

James A. Woods (ames!jaw)
NASA Ames Research Center

[[ My thanks to other readers who have also pointed out my gaffe.  Good to
see that everyone is still on their toes!  --wnl ]]

ndd@sunbar.mc.duke.edu (Ned Danieley) (12/30/88)

(To the theme from "The Twilight Zone") Do de do do, do de do do.

It turns out that, at least under 3.5 [[ as early as 3.2, actually
--wnl ]], 'find' allowed (but Sun did not document) the 4.3 behaviour of

find filename

which depends on running /usr/lib/find/updatedb periodically; this sets up
a database of names, allowing 'find' to work VERY quickly. I found this
about a year ago, and told Sun; I think it even made it into an STB. Note
that /usr/lib/find exists under 4.0, but that the man page still doesn't
mention it, and that

find filename

only works as

find '*filename*'

As jbm says, Sun knows about it, and has acknowledged that it is a bug. It
seems to work under Sys4-3.2, so you probably could get 'find' from that
release and have it work.

Ned Danieley (ndd@sunbar.mc.duke.edu)
Basic Arrhythmia Laboratory
Box 3140, Duke University Medical Center
Durham, NC  27710
(919) 684-6807 or 684-6942

guy@uunet.uu.net (Guy Harris) (01/12/89)

>I wrote this code years ago; it was actually part of BSD 4.2, but got lost
>in the shuffle at Sun when they went the SVID route.

4.2 or 4.3?  It was added into SunOS 3.2 when the 4.3BSD "find" stuff was
merged into the S5R2 "find" to make the 3.2 "find".  (I know that for a
fact; I did the merging.)

>Though undocumented in the manual page,

...because the claim that

	find <file>

will find all files whose names match "<file>" is an assertion about the
local system administrator's policy as much as it is a statement about the
behavior of "find"; Sun wasn't in a position to control the former,
especially given that the "fast find" stuff doesn't scale in an
immediately obvious way for NFS.  (Has anybody actually tried

	1) putting the appropriate "crontab" entry in on *all* machines
	   on a network with many diskless workstations

and

	2) changing "updatedb" not to stop at NFS mount points

If so, how much of a load does it impose?)

[[ Which is precisely why Sun neither documented fast find nor put
updatedb in crontab for their distributions.  --wnl ]]

It's also not clear how it should work if you use the automounter....

>Finally, the largest crime committed in my design of five-year old
>ffind is that it is not "eight-bit clean" for international
>character sets.

Which is a problem in SunOS 4.0, which supports 8-bit characters in file
names (such as the symlink "/UNIX(R)" that I had to "/vmunix", where "(R)"
is the ISO Latin #1 "registered trademark symbol" character).

[[ The Unix kernel has always supported 8 bit characters in file names (I
*know* that BSD 4.1 did, and I think that pretty much every version of BSD
and Bell Unix did).  It's just that certain shells stepped on the eighth
bit for their own devious reasons.  But in C you've always been able to do
'creat("A\302C\304");'.  --wnl ]]

david@sun.com (01/13/89)

In article <12397@silica.BBN.COM> mlandau@bbn.com (Matt Landau) writes:
>Updatedb only works on type 4.2 filesystems, so you have to run it on each
>of you servers, and it only builds a cache for 4.2 filesystems, so you
>have to do "find string" on each server to find all instances of what
>you're looking for.

Well, not really.  Updatedb is a (pretty simple) shell script and you can
make it do whatever you want.  For example, I have a diskful workstation
but my home directory is on a server.  Here's the updatedb I use; I only
run it once a week, but actually it isn't that big a load on the server...

#!/bin/csh -f
#
#	@(#)updatedb.csh 1.1 86/07/08 SMI; from UCB 4.6 85/04/22
#
set SRCHPATHS = ( / /usr )		# directories to be put in the database
set EXCLUDE = '^/tmp|^/dev|^/usr/tmp'	# directories to exclude
set NFSPATHS = ~david			# NFS directories
set NFSUSER = daemon			# userid for NFS find
set LIBDIR = /usr/lib/find		# for subprograms
set FINDHONCHO = root			# for error messages
set FCODES = $LIBDIR/find.codes		# the database 

set path = ( $LIBDIR /usr/ucb /bin /usr/bin )
set bigrams = /tmp/f.bigrams$$
set filelist = /tmp/f.list$$
set errs = /tmp/f.errs$$

# Make a file list and compute common bigrams.
# Alphabetize '/' before any other char with 'tr'.
# If the system is very short of sort space, 'bigram' can be made
# smarter to accumulate common bigrams directly without sorting
# ('awk', with its associative memory capacity, can do this in several
# lines, but is too slow, and runs out of string space on small machines).

nice +6

( find ${SRCHPATHS} -xdev -print ; \
	su $NFSUSER -c "find ${NFSPATHS} -xdev -print" -f ) | \
	egrep -v "$EXCLUDE" | \
	tr '/' '\001' | \
	(sort -f; echo $status > $errs) | \
	tr '\001' '/' > $filelist

$LIBDIR/bigram <$filelist | \
	(sort; echo $status >> $errs) | uniq -c | sort -nr | \
	awk '{ if (NR <= 128) print $2 }' | tr -d '\012' > $bigrams

if { grep -s -v 0 $errs } then
	echo "Subject: updatedb failed on `hostname`" | \
	/bin/mail $FINDHONCHO
	exit 1
endif

# code the file list
$LIBDIR/code $bigrams < $filelist > $FCODES
chmod 644 $FCODES
rm -f $bigrams $filelist $errs
exit 0

--
David DiGiacomo, Sun Microsystems, Mt. View, CA  sun!david david@sun.com