[comp.unix.misc] Fast file scan

pen@lysator.liu.se (Peter Eriksson) (09/26/90)

I`d like to know how to scan all files in a directory (and it's sub-
directories) for a specific string (without regular expressions) as fast
as possible with standard Unix tools and/or some special programs.

(I've written such a program, and would like to compare my implementation
of it with others.)

(Using the find+fgrep combination is slooooow....)

Any ideas?


--
Peter Eriksson                                              pen@lysator.liu.se
Lysator Computer Club                             ...!uunet!lysator.liu.se!pen
University of Linkoping, Sweden                               "Seize the day!"

jtc@van-bc.wimsey.bc.ca (J.T. Conklin) (09/28/90)

In article <299@lysator.liu.se> pen@lysator.liu.se (Peter Eriksson) writes:
>I`d like to know how to scan all files in a directory (and it's sub-
>directories) for a specific string (without regular expressions) as fast
>as possible with standard Unix tools and/or some special programs.
>
>(Using the find+fgrep combination is slooooow....)

I have found that find+xargs+fgrep to be quite reasonable when I need
to perform such a task as yours.   Are you possibly ommiting the xargs,
thus forcing find to exec a fgrep for each file?

	--jtc


-- 
J.T. Conklin	UniFax Communications Inc.
		...!{uunet,ubc-cs}!van-bc!jtc, jtc@wimsey.bc.ca

pfalstad@phoenix.Princeton.EDU (Paul John Falstad) (09/28/90)

In article <2165@van-bc.wimsey.bc.ca> jtc@van-bc.wimsey.bc.ca (J.T. Conklin) writes:
>In article <299@lysator.liu.se> pen@lysator.liu.se (Peter Eriksson) writes:
>>I`d like to know how to scan all files in a directory (and it's sub-
>>directories) for a specific string (without regular expressions) as fast
>>as possible with standard Unix tools and/or some special programs.
>>
>>(Using the find+fgrep combination is slooooow....)
>I have found that find+xargs+fgrep to be quite reasonable when I need
>to perform such a task as yours.   Are you possibly ommiting the xargs,
>thus forcing find to exec a fgrep for each file?

I agree that find+xargs should do the job, but fgrep...  On our system,
at least (SunOS 4.1) the 'f' in fgrep does NOT stand for fast.  fgrep is
designed to grep for one or more Fixed strings.  egrep is actually the
fastest if you have a lot of input.  I just tested fgrep and egrep by
making a file consisting of 42 copies of /usr/dict/words and grepping
for the string 'dictionar'.  egrep did it in 1/8 the time of fgrep.

If Lafontaine's elk would spurn Tom Jones, the engine must be our head, the
dining car our esophagus, the guardsvan our left lung, the kettle truck our
shins, the first class compartment the piece of skin at the nape of the neck,
and the level crossing an electric elk called Simon.

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (09/28/90)

In article <1990Sep27.195749.2552@iwarp.intel.com> merlyn@iwarp.intel.com (Randal Schwartz) writes:
: In article <299@lysator.liu.se>, pen@lysator (Peter Eriksson) writes:
: | I`d like to know how to scan all files in a directory (and it's sub-
: | directories) for a specific string (without regular expressions) as fast
: | as possible with standard Unix tools and/or some special programs.
: | 
: | (I've written such a program, and would like to compare my implementation
: | of it with others.)
: | 
: | (Using the find+fgrep combination is slooooow....)
: | 
: | Any ideas?
: 
: I assume your objection to find+fgrep is that you must start an fgrep
: on each filename (or set of filenames if you use xargs).  Here's a
: solution in Perl that uses 'find' to spit out the filenames (because
: it is fast at that) and Perl to scan the text files.
: 
: ================================================== snip
: #!/usr/bin/perl
: $lookfor = shift;
: open(FIND,"find topdir -type f -print|") || die "Cannot find: $!";
: MAIN: while($FILE = <FIND>) {
: 	open(FILE) || next MAIN; # skip it if I can't open it for read
: 	{ local($/); undef $/; # slurp fast
: 		while (<FILE>) {
: 			(print "$FILE\n"), next MAIN if index($_,$lookfor);
: 		}
: 	}
: }
: ================================================== snip
: 
: This will have a problem if $lookfor straddles buffers, but it'll find
: everything else.  (A small matter of programming to fix the straddling
: problem.)

A bit hasty there, Randal.  First, you have to say index($_,$lookfor) < 0.
Second, there's no buffer straddling problem--undeffing $/ will cause
it to slurp in the whole file.  Thus, the inner while loop is
unnecessary.  Thirdly, you didn't chop the filename.

Another consideration is that Perl doesn't do Boyer-Moore indexing (currently)
except on literals.  So you probably want to use an eval around the loop,
to tell Perl it's okay to compile the BM search table for the string.

I think I'd do it something like this:

#!/usr/bin/perl

($lookfor = shift) =~ s/(\W)/\\$1/g;	# quote metas

open(FIND,"find topdir -type f -print|") || die "Can't run find: $!\n";

eval <<EOB
    while(<FIND>) {
	chop;
	open(FILE, \$_) || next;
	\$size = (stat(FILE))[7];
	read(FILE,\$buf,\$size) || next;
	print "\$_\n" if \$buf =~ /$lookfor/;
    }
EOB

Using read() may or may not beat using <FILE>--it depends on how efficiently
your fread() function is coded.  Some are real smart, and read directly
into your variable.  Others are real stupid, and copy the data a time or
two first.

You might also want to throw a -T FILE test in there after the open if
you want to reject binary files outright.

Larry

cpcahil@virtech.uucp (Conor P. Cahill) (09/28/90)

In article <299@lysator.liu.se> pen@lysator.liu.se (Peter Eriksson) writes:
>I`d like to know how to scan all files in a directory (and it's sub-
>directories) for a specific string (without regular expressions) as fast
>as possible with standard Unix tools and/or some special programs.
>
>(Using the find+fgrep combination is slooooow....)

Try find + xargs + fgrep.

Try find + xargs + gnu-grep


-- 
Conor P. Cahill            (703)430-9247        Virtual Technologies, Inc.,
uunet!virtech!cpcahil                           46030 Manekin Plaza, Suite 160
                                                Sterling, VA 22170 

lm@slovax.Sun.COM (Larry McVoy) (10/01/90)

In article <299@lysator.liu.se> pen@lysator.liu.se (Peter Eriksson) writes:
>I`d like to know how to scan all files in a directory (and it's sub-
>directories) for a specific string (without regular expressions) as fast
>as possible with standard Unix tools and/or some special programs.
>
>(I've written such a program, and would like to compare my implementation
>of it with others.)
>
>(Using the find+fgrep combination is slooooow....)
>
>Any ideas?

I would probably use ftw(3).  I've found it a useful tool.  
---
Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com

bruce@balilly.UUCP (Bruce Lilly) (10/02/90)

In article <143198@sun.Eng.Sun.COM> lm@sun.UUCP (Larry McVoy) writes:
>In article <299@lysator.liu.se> pen@lysator.liu.se (Peter Eriksson) writes:
>>I`d like to know how to scan all files in a directory (and it's sub-
>>directories) for a specific string (without regular expressions) as fast
>>as possible with standard Unix tools and/or some special programs.
>>
>>(I've written such a program, and would like to compare my implementation
>>of it with others.)
>>
>>(Using the find+fgrep combination is slooooow....)

Depends on how you do it. Try:
find . -print | xargs fgrep string

>>
>>Any ideas?
>
>I would probably use ftw(3).  I've found it a useful tool.  

I believe find uses ftw.

--
	Bruce Lilly		blilly!balilly!bruce@sonyd1.Broadcast.Sony.COM

mikep@dirty.csc.ti.com (Michael A. Petonic) (10/02/90)

In article <143198@sun.Eng.Sun.COM> lm@slovax.Sun.COM (Larry McVoy) writes:
>Lines: 14
>
>In article <299@lysator.liu.se> pen@lysator.liu.se (Peter Eriksson) writes:
>>I`d like to know how to scan all files in a directory (and it's sub-
>>directories) for a specific string (without regular expressions) as fast
>>as possible with standard Unix tools and/or some special programs.
>>
>>(I've written such a program, and would like to compare my implementation
>>of it with others.)
>>
>>(Using the find+fgrep combination is slooooow....)
>>
>>Any ideas?
>
>I would probably use ftw(3).  I've found it a useful tool.  

I can't see how 
	find . -type f -print | xargs fgrep foobar 
is too slow...  If there were any speed advantage in hand crafting
a program to do the same, I'm not sure it would be worth doing
so.

-MikeP

georgn@sco.COM (Georg Nikodym) (10/03/90)

In article <299@lysator.liu.se> pen@lysator.liu.se (Peter Eriksson) writes:
>I`d like to know how to scan all files in a directory (and it's sub-
>directories) for a specific string (without regular expressions) as fast
>as possible with standard Unix tools and/or some special programs.
>
>(I've written such a program, and would like to compare my implementation
>of it with others.)
>
>(Using the find+fgrep combination is slooooow....)
>
>Any ideas?

Well if there aren't to many files (ie, less the 200-300) you can do
the following:

	fgrep "SEARCH_STRING" `find . -type f -print`

This will save the over head of restarting fgrep for each file.

Another way is:
	FILES=`find . -type f -print`
	for FILENAME in $FILES
	do
	  fgrep "SEARCH_STRING" $FILENAME
	done

This method is immune to list size limitations (because there is no list),
but fork/exec's a new fgrep for each file.

Another way is:
	find . -type f -exec fgrep "SEARCH_STRING" {} \;

This method is probably the one you first tried.  It's similar to the
example above and probably takes just as long (but I won't bet money
on either ;-) .

But the best method (ie. the one I used to use):

	dirlist=`find . -type d -print`

	for dir in $dirlist
	do
	  fgrep "SEARCH_STRING" $dir/*
	done

This method having a speed advantage over the second two examples,
without the list size limitation of the first (of course this limitation
really depends on your implementation of UNIX).

Hope that satisfied your curiousity,

Georg S. Nikodym  --  (416) 922-1937	|
SCO Canada, Inc.			| "Language is virus from outer space"
Toronto, Ontario			|             -William S. Burroughs
georgn@sco.COM				|

None of this information has been tested at all, USE CAUTION!!!!   ;-)

gt0178a@prism.gatech.EDU (Jim Burns) (10/03/90)

in article <MIKEP.90Oct2103839@dirty.csc.ti.com>, mikep@dirty.csc.ti.com (Michael A. Petonic) says:

> In article <143198@sun.Eng.Sun.COM> lm@slovax.Sun.COM (Larry McVoy) writes:
>>Lines: 14

>>I would probably use ftw(3).  I've found it a useful tool.  

> I can't see how 
> 	find . -type f -print | xargs fgrep foobar 
> is too slow...  If there were any speed advantage in hand crafting
> a program to do the same, I'm not sure it would be worth doing so.

Be warned that (usually) find (and du) don't follow symbolic links -
ftw(3) does.

-- 
BURNS,JIM
Georgia Institute of Technology, Box 30178, Atlanta Georgia, 30332
uucp:	  ...!{decvax,hplabs,ncar,purdue,rutgers}!gatech!prism!gt0178a
Internet: gt0178a@prism.gatech.edu

guy@auspex.auspex.com (Guy Harris) (10/05/90)

>>I would probably use ftw(3).  I've found it a useful tool.  
>
>I believe find uses ftw.

Unless you're talking about S5R4 or 4.4BSD or maybe 4.3-Reno, you
believe incorrectly; while the "find" in those releases may be built
atop "ftw" (or atop an enhanced version of same), in most UNIXes out
there "find" doesn't use "ftw".

What *is* the case is that the code of "ftw" is basically the guts of
"find" ripped out and converted into a subroutine, but that's a
different matter.

tif@doorstop.austin.ibm.com (Paul Chamberlain) (10/06/90)

In article <1990Oct2.041451.3929@blilly.UUCP> bruce@balilly.UUCP (Bruce Lilly) writes:
>In article <299@lysator.liu.se> pen@lysator.liu.se (Peter Eriksson) writes:
>>(Using the find+fgrep combination is slooooow....)
>find . -print | xargs fgrep string

Hello?  Anybody home?

Fgrep means Fixed, egrep means Exponential (like in Exponentially faster)!
If you care one tiny bit about the speed you'll only use fgrep when you
have to.  Now can we stop discussing the fastest way to use fgrep?

This lesson is a review for those that didn't believe it the first time.

Paul Chamberlain | I do NOT represent IBM.     tif@doorstop, sc30661 at ausvm6
512/838-7008     | ...!cs.utexas.edu!ibmaus!auschs!doorstop.austin.ibm.com!tif

fuchs@it.uka.de (Harald Fuchs) (10/09/90)

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes:
>open(FIND,"find topdir -type f -print|") || die "Can't run find: $!\n";
Shouldn't that read something like
  $p = open(FIND,"find topdir -type f -print|");
  $x = $!;
  die "Can't run find: $x\n" unless kill 0, $p;
--

Harald Fuchs <fuchs@it.uka.de> <fuchs%it.uka.de@relay.cs.net> ...

shawn@ncrcae.Columbia.NCR.COM (Shawn Shealy) (10/11/90)

>In article <299@lysator.liu.se> pen@lysator.liu.se (Peter Eriksson) writes:
>>I`d like to know how to scan all files in a directory (and it's sub-
>>directories) for a specific string (without regular expressions) as fast
>>as possible with standard Unix tools and/or some special programs.

My favorite is:

find <Directory_Name> -type f -print | xargs fgrep "<Search_String>"

but I would not dare say that it is "as fast as possible".

jbm@celebr.uucp (John B. Milton) (10/18/90)

In article <1990Oct2.192028.29731@sco.COM> georgn@sco.COM (Georg Nikodym) writes:
>In article <299@lysator.liu.se> pen@lysator.liu.se (Peter Eriksson) writes:
>>I`d like to know how to scan all files in a directory (and it's sub-
>>directories) for a specific string (without regular expressions) as fast
>>as possible with standard Unix tools and/or some special programs.
>>
>>(I've written such a program, and would like to compare my implementation
>>of it with others.)
>>
>>(Using the find+fgrep combination is slooooow....)
>
...
>	fgrep "SEARCH_STRING" `find . -type f -print`
...
>	FILES=`find . -type f -print`
>	for FILENAME in $FILES
>	do
>	  fgrep "SEARCH_STRING" $FILENAME
>	done
...
>	find . -type f -exec fgrep "SEARCH_STRING" {} \;
...
>	dirlist=`find . -type d -print`
>	for dir in $dirlist
>	do
>	  fgrep "SEARCH_STRING" $dir/*
>	done

And the winner is:

find . -type -f -print | xargs grep SEARCH

Use whichever grep works best (fastest, does what you want, etc.)

It will work on an unlimited number of files. The fork/exec of the grep is not
too bad (xargs does NOT build 5k arg lists, but rather 470 character, so that
part could be improved). The grep will put out file names with each match,
which the last example won't do. The list of files is easily reduced with
filters between the find and the xargs (grep -v spool/news). All the find
qualifiers are available. You can watch the progress with ps -f. If there are
a lot of files in a directory, the $dir/* in the last example will blow up.
Watch (") on grep args, as shell wildcards are expanded inside ("), but not (').

Another fun one is to avoid binary executables, which can slow down grep a lot:

find . -type -f -print | xargs file | grep -v ':.*executable' | cut -d: -f1 |
  xargs grep SEARCH

You may have to tune the "executable" to whatever file(1) puts out. You could
also add a "! -name '*.[aZz]'" to the find to avoid more non-text stuff.

John

-- 
John Bly Milton IV, jbm@uncle.UUCP, n8emr!uncle!jbm@osu-cis.cis.ohio-state.edu
(614) h:252-8544, w:469-1990; N8KSN, AMPR: 44.70.0.52; Don't FLAME, inform!

wcs) (10/21/90)

In article <1990Oct17.183310.29684@celebr.uucp>, jbm@celebr.uucp (John B. Milton) writes:
> >In article <299@lysator.liu.se> pen@lysator.liu.se (Peter Eriksson) writes:
> >>I`d like to know how to scan all files in a directory (and it's sub-
> >>directories) for a specific string (without regular expressions) as fast
> >>as possible with standard Unix tools and/or some special programs.
> And the winner is:
> find . -type -f -print | xargs grep SEARCH
> Use whichever grep works best (fastest, does what you want, etc.)

For this case, you probably want a grep based on the Boyer-Moore
pattern-matching algorithm; several were posted to *.sources a few years ago.
Essentially, what B-M does is preprocess the search pattern, and
start looking in the text where the END of the pattern would be, e.g.
	Pat.:	foobarbaz
	Ptr:            |
	Text:	Now is the time for all good parties to foobarbaz to an end.
			 foobarbaz
The 'h' in the text doesn't match the 'z', and since there are no 'h's
anywhere in the pattern, you can advance the pointer 9 positions,
avoiding a lot of comparisons.  The next look gets you:
	Pat:		 foobarbaz
	Ptr:	                 |
	Text:	Now is the time for all good parties to foobarbaz to an end.
'o' doesn't match 'z', but IS in the pattern, so you advance by 6.
	Pat:		       foobarbaz
	Ptr:	                       |
	Text:	Now is the time for all good parties to foobarbaz to an end.
And so on, until you find it or reach the end.
-- 
					Thanks; Bill
# Bill Stewart 908-949-0705 erebus.att.com!wcs AT&T Bell Labs 4M-312 Holmdel NJ
Government is like an elephant on drugs: It's very confused, makes lots of noise,
can't do anything well, stomps on anyone in its way, and it sure eats a lot.

cudcv@warwick.ac.uk (Rob McMahon) (10/23/90)

In article <1990Oct17.183310.29684@celebr.uucp> jbm@celebr.uucp (John B. Milton) writes:
>find . -type -f -print | xargs grep SEARCH

I always use

	find ... -print | xargs grep SEARCH /dev/null

in case the last file is all on its own, so that grep doesn't bother to report
the filename.

Rob
--
UUCP:   ...!mcsun!ukc!warwick!cudcv	PHONE:  +44 203 523037
JANET:  cudcv@uk.ac.warwick             INET:   cudcv@warwick.ac.uk
Rob McMahon, Computing Services, Warwick University, Coventry CV4 7AL, England