pen@lysator.liu.se (Peter Eriksson) (09/26/90)
I`d like to know how to scan all files in a directory (and it's sub- directories) for a specific string (without regular expressions) as fast as possible with standard Unix tools and/or some special programs. (I've written such a program, and would like to compare my implementation of it with others.) (Using the find+fgrep combination is slooooow....) Any ideas? -- Peter Eriksson pen@lysator.liu.se Lysator Computer Club ...!uunet!lysator.liu.se!pen University of Linkoping, Sweden "Seize the day!"
jtc@van-bc.wimsey.bc.ca (J.T. Conklin) (09/28/90)
In article <299@lysator.liu.se> pen@lysator.liu.se (Peter Eriksson) writes: >I`d like to know how to scan all files in a directory (and it's sub- >directories) for a specific string (without regular expressions) as fast >as possible with standard Unix tools and/or some special programs. > >(Using the find+fgrep combination is slooooow....) I have found that find+xargs+fgrep to be quite reasonable when I need to perform such a task as yours. Are you possibly ommiting the xargs, thus forcing find to exec a fgrep for each file? --jtc -- J.T. Conklin UniFax Communications Inc. ...!{uunet,ubc-cs}!van-bc!jtc, jtc@wimsey.bc.ca
pfalstad@phoenix.Princeton.EDU (Paul John Falstad) (09/28/90)
In article <2165@van-bc.wimsey.bc.ca> jtc@van-bc.wimsey.bc.ca (J.T. Conklin) writes: >In article <299@lysator.liu.se> pen@lysator.liu.se (Peter Eriksson) writes: >>I`d like to know how to scan all files in a directory (and it's sub- >>directories) for a specific string (without regular expressions) as fast >>as possible with standard Unix tools and/or some special programs. >> >>(Using the find+fgrep combination is slooooow....) >I have found that find+xargs+fgrep to be quite reasonable when I need >to perform such a task as yours. Are you possibly ommiting the xargs, >thus forcing find to exec a fgrep for each file? I agree that find+xargs should do the job, but fgrep... On our system, at least (SunOS 4.1) the 'f' in fgrep does NOT stand for fast. fgrep is designed to grep for one or more Fixed strings. egrep is actually the fastest if you have a lot of input. I just tested fgrep and egrep by making a file consisting of 42 copies of /usr/dict/words and grepping for the string 'dictionar'. egrep did it in 1/8 the time of fgrep. If Lafontaine's elk would spurn Tom Jones, the engine must be our head, the dining car our esophagus, the guardsvan our left lung, the kettle truck our shins, the first class compartment the piece of skin at the nape of the neck, and the level crossing an electric elk called Simon.
lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (09/28/90)
In article <1990Sep27.195749.2552@iwarp.intel.com> merlyn@iwarp.intel.com (Randal Schwartz) writes: : In article <299@lysator.liu.se>, pen@lysator (Peter Eriksson) writes: : | I`d like to know how to scan all files in a directory (and it's sub- : | directories) for a specific string (without regular expressions) as fast : | as possible with standard Unix tools and/or some special programs. : | : | (I've written such a program, and would like to compare my implementation : | of it with others.) : | : | (Using the find+fgrep combination is slooooow....) : | : | Any ideas? : : I assume your objection to find+fgrep is that you must start an fgrep : on each filename (or set of filenames if you use xargs). Here's a : solution in Perl that uses 'find' to spit out the filenames (because : it is fast at that) and Perl to scan the text files. : : ================================================== snip : #!/usr/bin/perl : $lookfor = shift; : open(FIND,"find topdir -type f -print|") || die "Cannot find: $!"; : MAIN: while($FILE = <FIND>) { : open(FILE) || next MAIN; # skip it if I can't open it for read : { local($/); undef $/; # slurp fast : while (<FILE>) { : (print "$FILE\n"), next MAIN if index($_,$lookfor); : } : } : } : ================================================== snip : : This will have a problem if $lookfor straddles buffers, but it'll find : everything else. (A small matter of programming to fix the straddling : problem.) A bit hasty there, Randal. First, you have to say index($_,$lookfor) < 0. Second, there's no buffer straddling problem--undeffing $/ will cause it to slurp in the whole file. Thus, the inner while loop is unnecessary. Thirdly, you didn't chop the filename. Another consideration is that Perl doesn't do Boyer-Moore indexing (currently) except on literals. So you probably want to use an eval around the loop, to tell Perl it's okay to compile the BM search table for the string. I think I'd do it something like this: #!/usr/bin/perl ($lookfor = shift) =~ s/(\W)/\\$1/g; # quote metas open(FIND,"find topdir -type f -print|") || die "Can't run find: $!\n"; eval <<EOB while(<FIND>) { chop; open(FILE, \$_) || next; \$size = (stat(FILE))[7]; read(FILE,\$buf,\$size) || next; print "\$_\n" if \$buf =~ /$lookfor/; } EOB Using read() may or may not beat using <FILE>--it depends on how efficiently your fread() function is coded. Some are real smart, and read directly into your variable. Others are real stupid, and copy the data a time or two first. You might also want to throw a -T FILE test in there after the open if you want to reject binary files outright. Larry
cpcahil@virtech.uucp (Conor P. Cahill) (09/28/90)
In article <299@lysator.liu.se> pen@lysator.liu.se (Peter Eriksson) writes: >I`d like to know how to scan all files in a directory (and it's sub- >directories) for a specific string (without regular expressions) as fast >as possible with standard Unix tools and/or some special programs. > >(Using the find+fgrep combination is slooooow....) Try find + xargs + fgrep. Try find + xargs + gnu-grep -- Conor P. Cahill (703)430-9247 Virtual Technologies, Inc., uunet!virtech!cpcahil 46030 Manekin Plaza, Suite 160 Sterling, VA 22170
lm@slovax.Sun.COM (Larry McVoy) (10/01/90)
In article <299@lysator.liu.se> pen@lysator.liu.se (Peter Eriksson) writes: >I`d like to know how to scan all files in a directory (and it's sub- >directories) for a specific string (without regular expressions) as fast >as possible with standard Unix tools and/or some special programs. > >(I've written such a program, and would like to compare my implementation >of it with others.) > >(Using the find+fgrep combination is slooooow....) > >Any ideas? I would probably use ftw(3). I've found it a useful tool. --- Larry McVoy, Sun Microsystems (415) 336-7627 ...!sun!lm or lm@sun.com
bruce@balilly.UUCP (Bruce Lilly) (10/02/90)
In article <143198@sun.Eng.Sun.COM> lm@sun.UUCP (Larry McVoy) writes: >In article <299@lysator.liu.se> pen@lysator.liu.se (Peter Eriksson) writes: >>I`d like to know how to scan all files in a directory (and it's sub- >>directories) for a specific string (without regular expressions) as fast >>as possible with standard Unix tools and/or some special programs. >> >>(I've written such a program, and would like to compare my implementation >>of it with others.) >> >>(Using the find+fgrep combination is slooooow....) Depends on how you do it. Try: find . -print | xargs fgrep string >> >>Any ideas? > >I would probably use ftw(3). I've found it a useful tool. I believe find uses ftw. -- Bruce Lilly blilly!balilly!bruce@sonyd1.Broadcast.Sony.COM
mikep@dirty.csc.ti.com (Michael A. Petonic) (10/02/90)
In article <143198@sun.Eng.Sun.COM> lm@slovax.Sun.COM (Larry McVoy) writes: >Lines: 14 > >In article <299@lysator.liu.se> pen@lysator.liu.se (Peter Eriksson) writes: >>I`d like to know how to scan all files in a directory (and it's sub- >>directories) for a specific string (without regular expressions) as fast >>as possible with standard Unix tools and/or some special programs. >> >>(I've written such a program, and would like to compare my implementation >>of it with others.) >> >>(Using the find+fgrep combination is slooooow....) >> >>Any ideas? > >I would probably use ftw(3). I've found it a useful tool. I can't see how find . -type f -print | xargs fgrep foobar is too slow... If there were any speed advantage in hand crafting a program to do the same, I'm not sure it would be worth doing so. -MikeP
georgn@sco.COM (Georg Nikodym) (10/03/90)
In article <299@lysator.liu.se> pen@lysator.liu.se (Peter Eriksson) writes: >I`d like to know how to scan all files in a directory (and it's sub- >directories) for a specific string (without regular expressions) as fast >as possible with standard Unix tools and/or some special programs. > >(I've written such a program, and would like to compare my implementation >of it with others.) > >(Using the find+fgrep combination is slooooow....) > >Any ideas? Well if there aren't to many files (ie, less the 200-300) you can do the following: fgrep "SEARCH_STRING" `find . -type f -print` This will save the over head of restarting fgrep for each file. Another way is: FILES=`find . -type f -print` for FILENAME in $FILES do fgrep "SEARCH_STRING" $FILENAME done This method is immune to list size limitations (because there is no list), but fork/exec's a new fgrep for each file. Another way is: find . -type f -exec fgrep "SEARCH_STRING" {} \; This method is probably the one you first tried. It's similar to the example above and probably takes just as long (but I won't bet money on either ;-) . But the best method (ie. the one I used to use): dirlist=`find . -type d -print` for dir in $dirlist do fgrep "SEARCH_STRING" $dir/* done This method having a speed advantage over the second two examples, without the list size limitation of the first (of course this limitation really depends on your implementation of UNIX). Hope that satisfied your curiousity, Georg S. Nikodym -- (416) 922-1937 | SCO Canada, Inc. | "Language is virus from outer space" Toronto, Ontario | -William S. Burroughs georgn@sco.COM | None of this information has been tested at all, USE CAUTION!!!! ;-)
gt0178a@prism.gatech.EDU (Jim Burns) (10/03/90)
in article <MIKEP.90Oct2103839@dirty.csc.ti.com>, mikep@dirty.csc.ti.com (Michael A. Petonic) says: > In article <143198@sun.Eng.Sun.COM> lm@slovax.Sun.COM (Larry McVoy) writes: >>Lines: 14 >>I would probably use ftw(3). I've found it a useful tool. > I can't see how > find . -type f -print | xargs fgrep foobar > is too slow... If there were any speed advantage in hand crafting > a program to do the same, I'm not sure it would be worth doing so. Be warned that (usually) find (and du) don't follow symbolic links - ftw(3) does. -- BURNS,JIM Georgia Institute of Technology, Box 30178, Atlanta Georgia, 30332 uucp: ...!{decvax,hplabs,ncar,purdue,rutgers}!gatech!prism!gt0178a Internet: gt0178a@prism.gatech.edu
guy@auspex.auspex.com (Guy Harris) (10/05/90)
>>I would probably use ftw(3). I've found it a useful tool. > >I believe find uses ftw. Unless you're talking about S5R4 or 4.4BSD or maybe 4.3-Reno, you believe incorrectly; while the "find" in those releases may be built atop "ftw" (or atop an enhanced version of same), in most UNIXes out there "find" doesn't use "ftw". What *is* the case is that the code of "ftw" is basically the guts of "find" ripped out and converted into a subroutine, but that's a different matter.
tif@doorstop.austin.ibm.com (Paul Chamberlain) (10/06/90)
In article <1990Oct2.041451.3929@blilly.UUCP> bruce@balilly.UUCP (Bruce Lilly) writes: >In article <299@lysator.liu.se> pen@lysator.liu.se (Peter Eriksson) writes: >>(Using the find+fgrep combination is slooooow....) >find . -print | xargs fgrep string Hello? Anybody home? Fgrep means Fixed, egrep means Exponential (like in Exponentially faster)! If you care one tiny bit about the speed you'll only use fgrep when you have to. Now can we stop discussing the fastest way to use fgrep? This lesson is a review for those that didn't believe it the first time. Paul Chamberlain | I do NOT represent IBM. tif@doorstop, sc30661 at ausvm6 512/838-7008 | ...!cs.utexas.edu!ibmaus!auschs!doorstop.austin.ibm.com!tif
fuchs@it.uka.de (Harald Fuchs) (10/09/90)
lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes: >open(FIND,"find topdir -type f -print|") || die "Can't run find: $!\n"; Shouldn't that read something like $p = open(FIND,"find topdir -type f -print|"); $x = $!; die "Can't run find: $x\n" unless kill 0, $p; -- Harald Fuchs <fuchs@it.uka.de> <fuchs%it.uka.de@relay.cs.net> ...
shawn@ncrcae.Columbia.NCR.COM (Shawn Shealy) (10/11/90)
>In article <299@lysator.liu.se> pen@lysator.liu.se (Peter Eriksson) writes: >>I`d like to know how to scan all files in a directory (and it's sub- >>directories) for a specific string (without regular expressions) as fast >>as possible with standard Unix tools and/or some special programs. My favorite is: find <Directory_Name> -type f -print | xargs fgrep "<Search_String>" but I would not dare say that it is "as fast as possible".
jbm@celebr.uucp (John B. Milton) (10/18/90)
In article <1990Oct2.192028.29731@sco.COM> georgn@sco.COM (Georg Nikodym) writes: >In article <299@lysator.liu.se> pen@lysator.liu.se (Peter Eriksson) writes: >>I`d like to know how to scan all files in a directory (and it's sub- >>directories) for a specific string (without regular expressions) as fast >>as possible with standard Unix tools and/or some special programs. >> >>(I've written such a program, and would like to compare my implementation >>of it with others.) >> >>(Using the find+fgrep combination is slooooow....) > ... > fgrep "SEARCH_STRING" `find . -type f -print` ... > FILES=`find . -type f -print` > for FILENAME in $FILES > do > fgrep "SEARCH_STRING" $FILENAME > done ... > find . -type f -exec fgrep "SEARCH_STRING" {} \; ... > dirlist=`find . -type d -print` > for dir in $dirlist > do > fgrep "SEARCH_STRING" $dir/* > done And the winner is: find . -type -f -print | xargs grep SEARCH Use whichever grep works best (fastest, does what you want, etc.) It will work on an unlimited number of files. The fork/exec of the grep is not too bad (xargs does NOT build 5k arg lists, but rather 470 character, so that part could be improved). The grep will put out file names with each match, which the last example won't do. The list of files is easily reduced with filters between the find and the xargs (grep -v spool/news). All the find qualifiers are available. You can watch the progress with ps -f. If there are a lot of files in a directory, the $dir/* in the last example will blow up. Watch (") on grep args, as shell wildcards are expanded inside ("), but not ('). Another fun one is to avoid binary executables, which can slow down grep a lot: find . -type -f -print | xargs file | grep -v ':.*executable' | cut -d: -f1 | xargs grep SEARCH You may have to tune the "executable" to whatever file(1) puts out. You could also add a "! -name '*.[aZz]'" to the find to avoid more non-text stuff. John -- John Bly Milton IV, jbm@uncle.UUCP, n8emr!uncle!jbm@osu-cis.cis.ohio-state.edu (614) h:252-8544, w:469-1990; N8KSN, AMPR: 44.70.0.52; Don't FLAME, inform!
wcs) (10/21/90)
In article <1990Oct17.183310.29684@celebr.uucp>, jbm@celebr.uucp (John B. Milton) writes: > >In article <299@lysator.liu.se> pen@lysator.liu.se (Peter Eriksson) writes: > >>I`d like to know how to scan all files in a directory (and it's sub- > >>directories) for a specific string (without regular expressions) as fast > >>as possible with standard Unix tools and/or some special programs. > And the winner is: > find . -type -f -print | xargs grep SEARCH > Use whichever grep works best (fastest, does what you want, etc.) For this case, you probably want a grep based on the Boyer-Moore pattern-matching algorithm; several were posted to *.sources a few years ago. Essentially, what B-M does is preprocess the search pattern, and start looking in the text where the END of the pattern would be, e.g. Pat.: foobarbaz Ptr: | Text: Now is the time for all good parties to foobarbaz to an end. foobarbaz The 'h' in the text doesn't match the 'z', and since there are no 'h's anywhere in the pattern, you can advance the pointer 9 positions, avoiding a lot of comparisons. The next look gets you: Pat: foobarbaz Ptr: | Text: Now is the time for all good parties to foobarbaz to an end. 'o' doesn't match 'z', but IS in the pattern, so you advance by 6. Pat: foobarbaz Ptr: | Text: Now is the time for all good parties to foobarbaz to an end. And so on, until you find it or reach the end. -- Thanks; Bill # Bill Stewart 908-949-0705 erebus.att.com!wcs AT&T Bell Labs 4M-312 Holmdel NJ Government is like an elephant on drugs: It's very confused, makes lots of noise, can't do anything well, stomps on anyone in its way, and it sure eats a lot.
cudcv@warwick.ac.uk (Rob McMahon) (10/23/90)
In article <1990Oct17.183310.29684@celebr.uucp> jbm@celebr.uucp (John B. Milton) writes: >find . -type -f -print | xargs grep SEARCH I always use find ... -print | xargs grep SEARCH /dev/null in case the last file is all on its own, so that grep doesn't bother to report the filename. Rob -- UUCP: ...!mcsun!ukc!warwick!cudcv PHONE: +44 203 523037 JANET: cudcv@uk.ac.warwick INET: cudcv@warwick.ac.uk Rob McMahon, Computing Services, Warwick University, Coventry CV4 7AL, England