[comp.sys.atari.st] More on Egrep

bba@mtgzz.UUCP (XMRJ40000[pjc]-b.banerjee) (03/22/88)
Hi,

Enough! I have  received ~ 30 requests for James  Wood's Egrep. I
am  cleaning  it   up  and  sending  it  to   the  moderators  of
Comp.{sources,binaries}.amiga.  Some comments  on  the letters  I
received.

a. I received 1 letter from an identifiable ST user. This doesn't
justify a posting to the atari groups. Atari people who want this
will have to keep an eye on the amiga source groups.

b. One person asked me if  they could send me a diskette instead.
YES --  provided you send  a self addressed STAMPED  mailer along
with it. My post office is about  5 miles away, and I'll keep the
diskette unless I can just copy it and drop it in the mailbox. Do
this only if you cannot  receive them through the amiga moderated
newsgroups. I have  a 1 drive Amiga  500, and I'm not  set up for
massive disk  copying. Nor  do I have  the facilities  for making
disks for other machines.

c. Quite a few people suggested that  I send it to Fred Fish. I'm
not going to do that. The reason  is that I am did not write this
--  and  I  am  not  particularly  proud  of  the  port.  I  just
transferred it  over, and banged  on it  till it worked.  I would
prefer that one of the  Amiga wizards made it suitably amiga-ized
before we allow  it out into the non usenet  community. It should
also  be back  ported to  unix with  all the  amiga modifications
conditionally compiled  out -- and  have regression tests  run on
it.

By the way, I am in the  processing of porting GNU awk (gawk) the
right way. I  have ported it to  SysV (suitably conditionalized),
and am working on getting it  past lint -hp before downloading it
to the  Amiga.  Send me mail if you're interested.

One last comment. I'm certainly not  doing this for money. If you
drop a line, take the time to say a few words more than "I'd like
a copy".  Everyone needs  ego strokes --  especially if  there is
nothing else in it for them.

Regards,

	Binayak Banerjee
	{ihnp4,allegra}!mtgzz!bba
	bba@mtgzz.ATT.COM

PS. There seem  to be versions of egrep already  extant. In order
to  clear up  any confusion  as  to the  nature of  what I'll  be
posting.. I  am appending a  note regarding greps in  general and
Boyer Moore in particular.
================================================================
Some ramblings on grep.

This note  is intended to clear  up some smoke about  the various
versions of grep...  including the ames!jaw version  which I have
ported to the Amiga.

In the beginning  there was grep. Here the  regular expression is
traslated to  an NFA (Non Deterministic  Finite Automaton), which
is  interpreted  on  the  fly.  This  is  the  way  that  regular
expressions are usually implemented in text editors.

Then there came egrep. Here  the regular expression is parsed and
translated into  a DFA (Deterministic Finite  Automaton) which is
executed. The trouble is that the startup time is greater (due to
the translation  required). Also, some regular  expressions cause
the dfa to grow exponentially.

Fgrep (An oxymoron,  since it stands for "Fixed  Grep") turns its
strings into  a 'Trie'. This  is another  form of a  finite state
machine -- but it can't  handle closures, Kleene star, etc. Fgrep
is the slowest of the three.  It's strength is that it can handle
multiple strings.

Boyer &  Moore (who are  also noted for  a Theorem Prover  in AI)
came up  with another scheme  for fixed string  matches. Consider
matching the string AMIGA against some input...

........AMOEBA...
	AMIGA
	54321

Now, the  numbers below denote the  amount that would have  to be
skipped  if  a mismatch  occurred  at  that position.  Since  the
mismatch occurs at the I, we can skip 3 positions before retrying
the  match.  In practice,  we  would  build  a 'skip  table'  (My
terminology) that looked like this:

A:	1	(Why?)
M:	4
I:	3
G:	2

We would then perform a  character by character comparison of the
pattern and the  input. When a mismatch occurs --  we look up the
character mismatched in  the skip table and advance  the input by
that many  characters. Some  things intuitively obvious  are that
this  is  sublinear  (since  we  don't  have  to  look  at  every
character), and  that (in  general) the  longer the  pattern, the
better the performance.

Via  some  simple  extensions/devices, we  can  perform  anchored
matches (to  the beginning/end  of a  line) and  case independant
matching.  We can  also  superimpose several  words  in the  skip
table. The performance of this superimposed BM is worse than with
an ordinary  BM, but is  still sub-linear. In fact  it approaches
linearity asymptotically.

The James Wood Egrep:

This is  composed of a Boyer  Moore string matcher plus  an egrep
compatible NFA, courtesy of Henry Spencer (Remember that the real
egrep is a  DFA). The input pattern(s) are analyzed.  There are 3
possible cases:

a.  No regular  expressions. Here  the pattern  match is  done by
Boyer Moore only.

b. Regular  expressions + fixed strings  (eg. Boyer.*Moore). Here
the input lines  are first run through the BM  matcher. The lines
that match are then matched via regex (The NFA).

c. Regular expressions only. These are  handed off via an exec to
the real egrep.

What I have done is to modify the sources so that all matching is
done via the NFA, rather than by execing a system egrep. This has
some  pretty  severe  implications   as  far  as  performance  is
concerned for patterns such as: '^[^a-z ]', etc.

One last note concerns fgrep. Recall that fgrep shines when there
are  many  words to  search  for.  This  version finesses  it  by
performing a  superimposed BM  when the number  of words  is less
than 7,  forking off a  Unix fgrep  to handle other  cases. Since
there is no Unix fgrep on the Amiga, this version will fail under
those conditions.