[comp.sources.d] Get Hep to Kanji-Ready Five-Algorithm [ef]?grep

jaw@ames.UUCP (03/12/87)

#  "I need very little,
      and of that little,
         I need very little."  -- St. Francis of Assisi

     Hybrid blitz 'egrep', whose urquell is a veritable chimaera of at least
five string search techniques, is now further tuned.

     Posted to USENET (and the mod.sources archive) some months ago, our
implementation of "plug-compatible" egrep.c has been extended to support:

	transparent 'grep' expression translation, to bring the speed of
	hyper-'egrep' to bear upon searches specified via the less capable
	'grep' syntax.

	interception of 'fgrep' for alternations of low (<= 7) cardinality,
	using a novel method of Boyer-Moore table superimposition and
	pre-computation magic.  the resulting speedup applies also to simple
	metacharacter-free 'egrep'-style alternations.

	(the above two improvements are made invisible by linking the
	grep/egrep/fgrep triumvirate.)

	a revised strategy of fallback to standard 'egrep' for hard
	cases, which eliminates costly popen() plumbing in favor of a
	statistically-based re-exec() dynamic.

	more complete application of fast match to standard options,
	including -n line numbering.

	preparation for Kanji pattern input, based upon parity-marked EUC
	codes.  new egrep.c is "eight-bit clean".  the fast algorithms
	unfortunately currently apply only to meta-free patterns and
	simple alternations; full Kanji regular expression treatment
	remains problematic.  however, the new code will pass difficult
	input through to [ef]?grep utilities in the UNIX Japan standard
	substrate.

     Kanji capability is perhaps the most important addition, as the
number of UNIX systems in the Orient proliferate, providing a "new market"
for Boyer-Moore-style search.  However, actual search efficacy remains
unknown until the Gaijin author gains feedback from JUNET or points beyond.
For all we know, Nippon text search utilities may already incorporate
the methods.  Tests conducted so far have been with artificial Kanji files.

     In case you are w(o|a)ndering, be reminded that no vendor source
changes are required to use the code.  It is configured as a turbo-charged
"front-end" to the existing section one commands, though it is (barely)
conceivable to adapt things, at a loss in worst-case efficiency, for
(heaven forfend!) non-Unix systems running C.  And, since we do not offer
a minimalist grep, do not expect it to run well on minimalist UNIX clones.

     Below appears a brief timing run on Webster's 2nd wordlist.  Notes
on implementation changes since original release follow in the next message.
If you want to skip the rest, fine.  The easy instructions -- download
from comp.sources [or 'anonymous' ftp of egrep.shar.Z from ames-aurora.arpa
for the (im)patient], and run:

	make
	sh eg.shell	# regression test amusement
	make install

after perusing README.FIRST.  Though the bundle in ~ftp/pub at NASA Ames
Research Center contains prerequisite support from Univ. of Toronto's
Henry Spencer, we are not re-posting regcomp()/regexec() to comp.sources.
John Gilmore of the GNU project has a modified regexec(), but it is not
mandatory for running the new egrep.

     Contrary to popular belief, old egrep is not I/O bound for large
large files on most machines.  The new version is.  One sample:

	time egrep 'u.*nix' /usr/dict/web2	(2.5 MB)
	  (yielding {Coturnix, Pseudophoenix, [T]turnix}), shows

		        user	 sys      real	 (load ave. < 1)

	VAX 11/750, 4.3 BSD, Fujitsu Eagles
	   (new)	6.8      3.8       11
	   (old)       70.0      5.5       87

	Sun 3/160, 3.2 OS, Eagle on SMD
	   (new)	1.7      2.2	    5
	   (old)       14.7      1.5       16

	Cray 2, Sys 5, no disk striping
	   (new)        .93      .11        1
	   (old)       8.07      .21        8

notes:  New egrep was actually run with -i option, but not old egrep.
Also, fumbling for three-character residue is not [ef]?grep's forte,
so the example is conservative.

Sun 3 has higher sys time for some unknown reason (a guess:  VAX 4.3 kernel
handles read() calls with oddsize buffers differently?).  Cray 2 reportedly
does disk I/O at 5-10 megabytes per second, but the architecture/compiler
is bad at byte addressing -- no cache, no vectors here.  Unfair comparison:
new egrep on Sun beats old egrep on Cray 2, even with fast Cray I/O!

Speculation:  the code might be useful on the Macintosh II, even if the Unix
filesystem (Sys 5?) were to waste 3/4 of the 1 MB/sec SCSI disk bandwidth.
Mac 2 testers please forward info to ames!jaw.

     Another metric is inner loop efficiency:

				# instructions
	VAX Berkeley cc			5
	Sun 68020 3.2 cc		6
	Stallman's GNU 68020 cc		4
	MIPS R2000			6
	Cray 2			       25

Thanks goes to mips!dce (David Elliott) for his testing time, as well as
to RMS for two-upping Sun's C compiler.

     Of course, if you have a Connection Machine dedicated to running their
admirable full-text keyworder on "in-core" text, you won't need [ef]?grep at
all.  And, for unindexed text on fine-grained parallel machines, reg. expr.
search algorithms can be made to run with a lower time bound (see the Hillis
book).  But if your files are on disk, even a specialized search chip won't help
once things become I/O or bus limited.  For this reason, vectorization on a
Cray(ette) would be a bust, though Cray buffs may write the author for other
scalar speedup ideas...

[continued]

jaw@ames.UUCP (03/12/87)

#  How long a time lies in one little word! -- Shakespeare, Richard II, I, iii

#  Fine words butter no parsnips. -- Southern proverb

		[ef]?grep Implementation Changes

'grep' r.e. translation:

     To buy speed for the novice 'grep' user who deigns not to learn the
extended 'egrep' syntax, we translate 'grep' r.e.'s to the 'egrep' superset.
It is straightforward enough to surround search patterns meaningful to
'egrep' but not to 'grep'.  Odd cases include the -w option, not implemented
in standard 'egrep', the defunct -y option, and "tagged expressions", which
are done via an exec() of /bin/grep.  Tagged exprs, like

	grep '\(....\).*\1' /usr/dict/words

which outputs chaff like "beriberi", "couscous", "hodgepodge", and
"lightweight", are weird.  The irregularity these exprs lend coupled with
a low complexity/utility ratio kept them from being part of 'egrep'.
But for this feature, old 'grep' code could be thrown away.

'fgrep' improvement / (partial) unification:

     In the new release, we trap low-complexity disjunctions such as

		[ef]grep 'boyer|moore' file

(or -f patfile in place of the pattern) with a method to superimpose the
non-terminals within the Boyer/Moore table.  When scanning text backwards,
other programming tricks short-circuit some tests against the pattern.
Sparing further details, which might make for a more formal writeup, it
suffices to say that although worst-case complexity here is O(Rn) with string
length 'n', and R == r.e. size, average-case for text is still sublinear.  E.g.

	egrep 'silver|orange|purple'  	# hard-to-rhyme color test in eg.shell

looks at ~55000 chars in /usr/dict/words, whereas (three) separate invocations
of egrep on the individual color words make the code look at ~40000 bytes per
word.  Aho/Corrasick's 'fgrep', in contrast, must look at all 200KB in the
dictionary.  The elegant "trie" construction within "fgrep" excels, however,
for medium/large R.

     Since the syntax for [ef]grep is similar, we thought of making egrep
hand off to fgrep for sufficiently large metacharacter-free R, as there is no
strong reason to make the user conscious of the separate algorithms.  Certain
technicalities prevent this.  For one, we are not willing to invent an 'egrep'
option to inform the code to interpret a file of (say a hundred) word
alternatives containing some innocent metacharacter, that it is literal
'fgrep' input, rather than a closure-containing 'egrep' pattern which would
otherwise make egrep explode.  More work could be done here.

     Our motivation?  Is this not all overblown?  Perhaps, but now you can
build a simple fast "NSA filter", or search for the seven dwarfs at leisure.
Besides, the final nail needed to be driven into 'bm/match', which tried
to do parallel match, but actually shuffled things out of order during its
simplistic block-based scheme.  These programs, part of source archive also,
are now historical curiosities.

Kanji egrep:

     Copious notes are in README.kanji.mods.  The March 1987 Unix Review
was indispensable for pointing out the embedded "SS2" Katakana pitfalls.
The modularity of our code as a semi-dependent filter was necessary for this
exploration, as we have no access to AT&T/Unisoft Kanji code.  Again, JUNET
or Sigma project people -- please respond with grep war stories or usage notes.

Worst-case r.e. handling:

     The first code release elaborately called upon a function mcilroy()
to pipe partial match output to old egrep for tough expressions, whose
backtracking might swamp regexp().  Some details of the DFA/NDFA tradeoff
were discussed in pep4grep.doc[12].  Due largely to feedback from John Gilmore
of the GNU project, the strategy was revised.  egrep.c function kernighan()
("let others do the hard part") now usurps calls to costly popen() by invoking
exec() on old egrep when necessary.  Rough partial match statistics gathered
on the fly determine the handoff.  You may revise the time reported previously
for
	egrep 'hoe.*g' /usr/dict/words

from 1.2 user, 1.1 sys seconds (VAX 11/750, 4.3BSD, Fuji disks) to 0.8u, 0.4s.
For those public-spirited souls who really want to build a PD egrep out of
what we offer, sans fallback from regexp() to an AT&T /usr/bin/egrep, the
slippery test "egrep 'g+h+o' /usr/dict/words" will prove enlightening.

Faster -n option:

     By popular demand.  Though Boyer/Moore techniques subvert line numbering,
we've made it faster with brute force (loop unrolling helps VAXEN, but not
CRISPS).  Timing tests for this and other options appear in the eg.shell script.

Not so fast:

	-v, -b, -w, various r.e.'s with no rexexp() "residue"

(you'll still have to use the layered "grep host /etc/hosts | grep -w host"
for speed.)

Other contra-indications for new [ef]?grep:

	Monster patterns

     The amazing expressions output by /usr/lib/calendar still beg for
the lazy evaluation technique rolled into edition 8 egrep by Prof. Aho of
Princeton.  Hinted at on p. 105 in Kernighan & Pike, lazy evaluation reduces
standard egrep r.e. compile time.  Here the possible O(R**2) machine
construction cost is eliminated to amortize complexity at run-time and 
shifted to such only if a bad match actually happens.  Whew!  Fortunately,
this is not so important for simple r.e. fare, where H. Spencer's regexp()
works well, but it certainly helps calendar(1).

     The catch with lazy eval. is that it slows down simple matching (15-20%
for /usr/dict/words on VAX), so it hasn't been adopted by System V egrep.
Note that our egrep, deferring to the underlying one in /usr/bin, doesn't
care much about these hideous beasts -- it just doesn't do better on them.
However, [ef]?grep does well by the Kadhafi matcher (eg.shell, again).

	Long lines, small alphabets

     Finally, a comment on one rapidly burgeoning application area
where new egrep should not be blindly proscribed -- genome scanning.
Though line limits have been raised (to 8192 byte buffers), much of
GENBANK has no newlines.  The code would need modification for searching
freestyle.  Also, locating ACGT sequences with "superimposed BMG" over
a 4-letter alphabet might actually be worse, but the global homology
crowd probably uses a >20 letter protein alphabet (for other reasons).
At any rate, genetic string search generally relies on more sophisticated
methods such as dynamic programming ala Sankoff/Kruskal.

     On the other hand, large alphabets such as Kanji probably help
performance.  As do parallel transfer disks, MACH file mapping, ...
Your suggestions welcome.

     James A. Woods (ames!jaw)
     NASA Ames Research Center

P.S.  Preserving author credit, [ef]?grep may be redistributed as you wish.

rs@mirror.UUCP (03/12/87)

The summary says it all:  The super-fast does-it-all Kanji-ready
new grep will be published in mod.sources in two to three weeks;
there are a few large items ahead of it in the queue.  Please don't
send me mail asking for it before then.

BTW, did anyone else notice a fast, breathless feeling as they
read James's articles? :-)

	/rich $alz
-- 
--
Rich $alz					"Drug tests p**s me off"
Mirror Systems, Cambridge Massachusetts		rs@mirror.TMC.COM
{adelie, mit-eddie, ihnp4, harvard!wjh12, cca, cbosgd, seismo}!mirror!rs