[comp.sources.d] A Gaggle of GREPs

jaw@eos.UUCP (James A. Woods) (07/26/88)

#  wild card (n.)  1. A symbol which defaults to anything and is therefore
   mandatory at moments of doubt.  See DEFAULT.  2. A tab card randomly
   inserted in a pack to enliven the action.  See also LUDDITE.
	-- Stan Kelly-Bootle, The Devil's DP Dictionary

I write this note as a veteran of the "GREP WARS."  To those
users of my venerable 2-year-old fast [ef]?grep from comp.sources,
this message is a teaser for yet another grep replacement.

You may be aware of Andrew Hume's laudable work unifying the various
Bell Labs grep versions.  It is especially nice to see AT&T adapt
the speedup ideas concerning hybridization of the Boyer/Moore/Gosper
algorithm with regular expression search, as manifested in the
comp.sources package.  Hume has a forthcoming paper in "Software
Practice and Experience" detailing a bit of his rewrite, which I
had the pleasure of formally reviewing.  (The SP&E paper, however,
discusses I/O tweaks more than it does algorithms.)

If you follow the GNU project, you may also be aware of Mike Haertel's
valiant generic grep/egrep clone (with added value of sorts, particularly
in the way of a line context option).  It is free of AT&T license 
restrictions.  

My old public-domain effort is also unrestricted, except for the
odd choice of calling blackbox AT&T binaries from certain weird corners
of the program.  This has presented difficulties for GNU, now remedied.

Without going into myriad detail at this point, I, armed with
invaluable assistance from Arthur David Olson, have sped up
Haertel's automaton to incorporate BMG for a much broader class
of expressions than "old" fast egrep.
We've tested the GNU/Haertel code (plus our additions) on everything
from Scott Anderson's infamous "Khadaffi" matcher to the 121 egrep
regression microtests from Henry Spencer.
I am tracking the official GNU distribution with the supplement
until such time as the improvements are folded in (pending legal
signoff, new thoughts from Haertel, etc.)

Why all this hash?  If you are an AT&T loyalist, little of this
matters as long as you can wait for some kind vendor to port
grep from Unix V9 to System 5 rel4 on the 3B series to your
favorite desktop or supercomputer.  This could be well over a year.
(Witness the two-year hiatus from BSD 4.3 code to SunOS 4.0.)
Better yet would be access from the AT&T Toolchest, which would be glad 
to nickel-and-dime you with yet another license agreement.

Thus GNU egrep, with fast algorithms.  Stay tuned ...

James A. Woods
NASA Ames Research Center
ames!jaw
jaw@ames.arc.nasa.gov

P.S. State-of-the-art GNU 'fgrep', with source, must wait until one
of you intrepid souls implement it.  See the 1980 Commentz-Walter paper
"A String Matching Algorithm Fast on the Average", in 
"Automata, Languages, and Programming", 
Springer Verlag Lecture Notes in Computer Science #71,
Goos & Hartmanis, eds., Graz, Austria, July 1979.
The general Commentz-Walter method is 4-5X faster than Aho 'fgrep',
so it deserves re-implementation for GNU.