[alt.sources] [comp.lang.c] Re: strstr sources: a summary of responses

chris@mimsy.umd.edu (Chris Torek) (08/28/90)
Archive-name: torek-boyer-moore/27-Aug-90
Original-posting-by: chris@mimsy.umd.edu (Chris Torek)
Original-subject: Re: strstr sources: a summary of responses
Reposted-by: emv@math.lsa.umich.edu (Edward Vielmetti)

[Reposted from comp.lang.c.
Comments on this service to emv@math.lsa.umich.edu (Edward Vielmetti).]

In article <1158@umvlsi.ecs.umass.edu> srivasta@umvlsi.ecs.umass.edu
(Manoj Srivastava) posts a Boyer-Moore based strstr() by John Lacey.

>And finally, an implementation of the Boyer-Moore algorithm. I
>am not sure offhand, but while the worst case complexity remains
>O(N+M), the avg case behaviour is O(N/M) ???

(where N is the length of the string being searched and M is the length
of the substring, aka pattern).  Yes, it is N/M.

Next, the code:

>/* strstr, Copyright (c) 1990 by John Lacey.  -*- C -*-
 [some stuff deleted]
>int ch, d[SCHAR_MAX + 1];
 [more stuff deleted]
>      i += d[text[i]];
 [still more deleted]

If `char' is signed, and a string containing a negative character value
is passed to strstr(), thid code can attempt to fetch, e.g., d[-40].  I
found the code overy difficult to read, so I rewrote it.  I also wrote
a long comment describing the algorithm.  This has been only lightly
tested, but is simple enough that it should work.  Note that this
version handles strings with embedded NUL characters as well.

(A real B-M search routine should take the deltas table as an argument,
and have a separate routine to set it up, so that the same pattern can
be applied to multiple texts.  E.g., a `Boyer-Moore grep' would read a
line at a time and apply this search to each line.  See comment.)

#include <limits.h>
#include <stddef.h>
typedef void *PTR;		/* `char *' for old compilers */

/*
 * Boyer-Moore search: input is `text' (a string) and its length,
 * and a `pattern' (another string) and its length.
 *
 * The linear setup cost of this function is approximately 256 + patlen.
 * Afterwards, however, the average cost is O(textlen/patlen), and the
 * worst case is O(textlen+patlen).
 *
 * The Boyer-Moore algorithm works by observing that, for each position
 * in the text, if the character there does *not* occur somewhere in the
 * search pattern, no comparisons including that character will match.
 * That is, given the text "hello world..." and the pattern "goodbye", the
 * `w' in `world' means that none of `hello w', `ello wo', `llo wor',
 * `lo worl', `o world', ` world.', and `world..' can match.  In fact,
 * exactly patlen strings are certain not to match.  We can discover this
 * simply by looking at the patlen'th character.  Furthermore, even if
 * the text character does occur, it may be that it rules out some number
 * of other matches.  Again, we can discover this by doing the match
 * `backwards'.
 *
 * We set up a table of deltas for each possible character, with
 * delta[character] being patlen for characters not in the pattern,
 * less for characters in the pattern, growing progressively smaller
 * as we near the end of the pattern.  Matching then works as follows:
 *
 *	 0         1         2         3
 *	 01234567890123456789012345678901234567
 *	"Here is the string being searched into"	(text)
 *	 ------						(pos = [0..5])
 *	"string"					(pat)
 *	654321-						(deltas)
 *
 * (the delta for `-' will be derived below).
 *
 * Positions 0..5 end with `i', which is not the `g' we want.  `i' does
 * appear in `string', but two characters before the end.  We skip
 * forward so as to make the `i's match up:
 *
 *	"Here is the string being searched into"	(text)
 *	  "string"					(pos = [2..7])
 *
 * Next we find that ` ' and `g' do not match.  Since ` ' does not appear
 * in the pattern at all, we can skip forward 6:
 *
 *	"Here is the string being searched into"	(text)
 *		"string"				(pos = [8..13])
 *
 * Comparing `t' vs `g', we again find no match, and so we obtain the
 * delta for `t', which is 4.  We skip to position 17:
 *
 *	"Here is the string being searched into"	(text)
 *		    "string"				(pos = [12..17])
 *
 * It thus takes only four steps to move the search point forward to the
 * match, in this case.
 *
 * If the pattern has a recurring character, we must set the delta for
 * that character to the distance of the one closest to the end:
 *
 * 	"befuddle the cat"	(text)
 *	"fuddle"		(pos = [0..5])
 *	654321-			(delta)
 *
 * We want the next search to line the `d's up like this:
 *
 * 	"befuddle the cat"	(text)
 *	  "fuddle"		(pos = [2..7])
 *
 * and not like this:
 *
 * 	"befuddle the cat"	(text)
 *	   "fuddle"		(pos = [3..8])
 *
 * so we take the smaller delta for d, i.e., 2.
 *
 * The last task is computing the delta we have noted above as `-':
 *
 *	"candlesticks"		(text)
 *	"hand"			(pos = [0..3])
 *	4321-			(delta)
 *
 * Here the `d' in `hand' matches the `d' in `candlesticks', but the
 * strings differ.  Since there are no other `d's in `hand', we know
 * that none of (cand,andl,ndle,dles) can match, and thus we want this
 * delta to be 4 (the length of the pattern).  But if we had, e.g.:
 *
 *	"candlesticks"		(text)
 *	"deed"			(pos = [0..3])
 *	4321-			(delta)
 *
 * then we should advance to line up the other `d':
 *
 *	"candlesticks"		(text)
 *	   "deed"		(pos = [3..6])
 *
 * As this suggestes, the delta should be that for the `d' nearest the
 * end, but not including the end.  This is easily managed by setting up
 * a delta table as follows:
 *
 *	for int:c in [0..255] { delta[c] = patlen; };
 *	for int:x in [0..patlen-1) { delta[pat[x]] = patlen - (x + 1); };
 *
 * delta[pat[patlen-1]] is never written, so the last letter inherits the
 * delta from an earlier iteration or from the previous loop.
 *
 * NB: the nonsense with `deltaspace' below exists merely because gcc
 * does a horrible job of common subexpression elimination (it does not
 * notice that the array is at a constant stack address).
 */
char *
boyer_moore_search(text, textlen, pat, patlen)
	char *text;
	size_t textlen;
	char *pat;
	size_t patlen;
{
	register unsigned char *p, *t;
	register size_t i, p1, j, *delta;
	size_t deltaspace[UCHAR_MAX + 1];

	/* algorithm fails if pattern is empty */
	if ((p1 = patlen) == 0)
		return (text);

	/* code below fails (whenever i is unsigned) if pattern too long */
	if (p1 > textlen)
		return (NULL);

	/* set up deltas */
	delta = deltaspace;
	for (i = 0; i <= UCHAR_MAX; i++)
		delta[i] = p1;
	for (p = (unsigned char *)pat, i = p1; --i > 0;)
		delta[*p++] = i;

	/*
	 * From now on, we want patlen - 1.
	 * In the loop below, p points to the end of the pattern,
	 * t points to the end of the text to be tested against the
	 * pattern, and i counts the amount of text remaining, not
	 * including the part to be tested.
	 */
	p1--;
	p = (unsigned char *)pat + p1;
	t = (unsigned char *)text + p1;
	i = textlen - patlen;
	for (;;) {
		if (*p == *t && memcmp((PTR)(p - p1), (PTR)(t - p1), p1) == 0)
			return ((char *)t - p1);
		j = delta[*t];
		if (i < j)
			break;
		i -= j;
		t += j;
	}
	return (NULL);
}
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 405 2750)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris