bba@mtgzz.UUCP (XMRJ40000[pjc]-b.banerjee) (03/22/88)
Hi,
Enough! I have received ~ 30 requests for James Wood's Egrep. I
am cleaning it up and sending it to the moderators of
Comp.{sources,binaries}.amiga. Some comments on the letters I
received.
a. I received 1 letter from an identifiable ST user. This doesn't
justify a posting to the atari groups. Atari people who want this
will have to keep an eye on the amiga source groups.
b. One person asked me if they could send me a diskette instead.
YES -- provided you send a self addressed STAMPED mailer along
with it. My post office is about 5 miles away, and I'll keep the
diskette unless I can just copy it and drop it in the mailbox. Do
this only if you cannot receive them through the amiga moderated
newsgroups. I have a 1 drive Amiga 500, and I'm not set up for
massive disk copying. Nor do I have the facilities for making
disks for other machines.
c. Quite a few people suggested that I send it to Fred Fish. I'm
not going to do that. The reason is that I am did not write this
-- and I am not particularly proud of the port. I just
transferred it over, and banged on it till it worked. I would
prefer that one of the Amiga wizards made it suitably amiga-ized
before we allow it out into the non usenet community. It should
also be back ported to unix with all the amiga modifications
conditionally compiled out -- and have regression tests run on
it.
By the way, I am in the processing of porting GNU awk (gawk) the
right way. I have ported it to SysV (suitably conditionalized),
and am working on getting it past lint -hp before downloading it
to the Amiga. Send me mail if you're interested.
One last comment. I'm certainly not doing this for money. If you
drop a line, take the time to say a few words more than "I'd like
a copy". Everyone needs ego strokes -- especially if there is
nothing else in it for them.
Regards,
Binayak Banerjee
{ihnp4,allegra}!mtgzz!bba
bba@mtgzz.ATT.COM
PS. There seem to be versions of egrep already extant. In order
to clear up any confusion as to the nature of what I'll be
posting.. I am appending a note regarding greps in general and
Boyer Moore in particular.
================================================================
Some ramblings on grep.
This note is intended to clear up some smoke about the various
versions of grep... including the ames!jaw version which I have
ported to the Amiga.
In the beginning there was grep. Here the regular expression is
traslated to an NFA (Non Deterministic Finite Automaton), which
is interpreted on the fly. This is the way that regular
expressions are usually implemented in text editors.
Then there came egrep. Here the regular expression is parsed and
translated into a DFA (Deterministic Finite Automaton) which is
executed. The trouble is that the startup time is greater (due to
the translation required). Also, some regular expressions cause
the dfa to grow exponentially.
Fgrep (An oxymoron, since it stands for "Fixed Grep") turns its
strings into a 'Trie'. This is another form of a finite state
machine -- but it can't handle closures, Kleene star, etc. Fgrep
is the slowest of the three. It's strength is that it can handle
multiple strings.
Boyer & Moore (who are also noted for a Theorem Prover in AI)
came up with another scheme for fixed string matches. Consider
matching the string AMIGA against some input...
........AMOEBA...
AMIGA
54321
Now, the numbers below denote the amount that would have to be
skipped if a mismatch occurred at that position. Since the
mismatch occurs at the I, we can skip 3 positions before retrying
the match. In practice, we would build a 'skip table' (My
terminology) that looked like this:
A: 1 (Why?)
M: 4
I: 3
G: 2
We would then perform a character by character comparison of the
pattern and the input. When a mismatch occurs -- we look up the
character mismatched in the skip table and advance the input by
that many characters. Some things intuitively obvious are that
this is sublinear (since we don't have to look at every
character), and that (in general) the longer the pattern, the
better the performance.
Via some simple extensions/devices, we can perform anchored
matches (to the beginning/end of a line) and case independant
matching. We can also superimpose several words in the skip
table. The performance of this superimposed BM is worse than with
an ordinary BM, but is still sub-linear. In fact it approaches
linearity asymptotically.
The James Wood Egrep:
This is composed of a Boyer Moore string matcher plus an egrep
compatible NFA, courtesy of Henry Spencer (Remember that the real
egrep is a DFA). The input pattern(s) are analyzed. There are 3
possible cases:
a. No regular expressions. Here the pattern match is done by
Boyer Moore only.
b. Regular expressions + fixed strings (eg. Boyer.*Moore). Here
the input lines are first run through the BM matcher. The lines
that match are then matched via regex (The NFA).
c. Regular expressions only. These are handed off via an exec to
the real egrep.
What I have done is to modify the sources so that all matching is
done via the NFA, rather than by execing a system egrep. This has
some pretty severe implications as far as performance is
concerned for patterns such as: '^[^a-z ]', etc.
One last note concerns fgrep. Recall that fgrep shines when there
are many words to search for. This version finesses it by
performing a superimposed BM when the number of words is less
than 7, forking off a Unix fgrep to handle other cases. Since
there is no Unix fgrep on the Amiga, this version will fail under
those conditions.