bba@mtgzz.UUCP (XMRJ40000[pjc]-b.banerjee) (03/22/88)
Hi, Enough! I have received ~ 30 requests for James Wood's Egrep. I am cleaning it up and sending it to the moderators of Comp.{sources,binaries}.amiga. Some comments on the letters I received. a. I received 1 letter from an identifiable ST user. This doesn't justify a posting to the atari groups. Atari people who want this will have to keep an eye on the amiga source groups. b. One person asked me if they could send me a diskette instead. YES -- provided you send a self addressed STAMPED mailer along with it. My post office is about 5 miles away, and I'll keep the diskette unless I can just copy it and drop it in the mailbox. Do this only if you cannot receive them through the amiga moderated newsgroups. I have a 1 drive Amiga 500, and I'm not set up for massive disk copying. Nor do I have the facilities for making disks for other machines. c. Quite a few people suggested that I send it to Fred Fish. I'm not going to do that. The reason is that I am did not write this -- and I am not particularly proud of the port. I just transferred it over, and banged on it till it worked. I would prefer that one of the Amiga wizards made it suitably amiga-ized before we allow it out into the non usenet community. It should also be back ported to unix with all the amiga modifications conditionally compiled out -- and have regression tests run on it. By the way, I am in the processing of porting GNU awk (gawk) the right way. I have ported it to SysV (suitably conditionalized), and am working on getting it past lint -hp before downloading it to the Amiga. Send me mail if you're interested. One last comment. I'm certainly not doing this for money. If you drop a line, take the time to say a few words more than "I'd like a copy". Everyone needs ego strokes -- especially if there is nothing else in it for them. Regards, Binayak Banerjee {ihnp4,allegra}!mtgzz!bba bba@mtgzz.ATT.COM PS. There seem to be versions of egrep already extant. In order to clear up any confusion as to the nature of what I'll be posting.. I am appending a note regarding greps in general and Boyer Moore in particular. ================================================================ Some ramblings on grep. This note is intended to clear up some smoke about the various versions of grep... including the ames!jaw version which I have ported to the Amiga. In the beginning there was grep. Here the regular expression is traslated to an NFA (Non Deterministic Finite Automaton), which is interpreted on the fly. This is the way that regular expressions are usually implemented in text editors. Then there came egrep. Here the regular expression is parsed and translated into a DFA (Deterministic Finite Automaton) which is executed. The trouble is that the startup time is greater (due to the translation required). Also, some regular expressions cause the dfa to grow exponentially. Fgrep (An oxymoron, since it stands for "Fixed Grep") turns its strings into a 'Trie'. This is another form of a finite state machine -- but it can't handle closures, Kleene star, etc. Fgrep is the slowest of the three. It's strength is that it can handle multiple strings. Boyer & Moore (who are also noted for a Theorem Prover in AI) came up with another scheme for fixed string matches. Consider matching the string AMIGA against some input... ........AMOEBA... AMIGA 54321 Now, the numbers below denote the amount that would have to be skipped if a mismatch occurred at that position. Since the mismatch occurs at the I, we can skip 3 positions before retrying the match. In practice, we would build a 'skip table' (My terminology) that looked like this: A: 1 (Why?) M: 4 I: 3 G: 2 We would then perform a character by character comparison of the pattern and the input. When a mismatch occurs -- we look up the character mismatched in the skip table and advance the input by that many characters. Some things intuitively obvious are that this is sublinear (since we don't have to look at every character), and that (in general) the longer the pattern, the better the performance. Via some simple extensions/devices, we can perform anchored matches (to the beginning/end of a line) and case independant matching. We can also superimpose several words in the skip table. The performance of this superimposed BM is worse than with an ordinary BM, but is still sub-linear. In fact it approaches linearity asymptotically. The James Wood Egrep: This is composed of a Boyer Moore string matcher plus an egrep compatible NFA, courtesy of Henry Spencer (Remember that the real egrep is a DFA). The input pattern(s) are analyzed. There are 3 possible cases: a. No regular expressions. Here the pattern match is done by Boyer Moore only. b. Regular expressions + fixed strings (eg. Boyer.*Moore). Here the input lines are first run through the BM matcher. The lines that match are then matched via regex (The NFA). c. Regular expressions only. These are handed off via an exec to the real egrep. What I have done is to modify the sources so that all matching is done via the NFA, rather than by execing a system egrep. This has some pretty severe implications as far as performance is concerned for patterns such as: '^[^a-z ]', etc. One last note concerns fgrep. Recall that fgrep shines when there are many words to search for. This version finesses it by performing a superimposed BM when the number of words is less than 7, forking off a Unix fgrep to handle other cases. Since there is no Unix fgrep on the Amiga, this version will fail under those conditions.