[comp.lang.c] strstr sources: a summary of responses

srivasta@umvlsi.ecs.umass.edu (Manoj Srivastava) (08/26/90)

	I want to thank all the people who replied to my request for a
 srtstr function, I really appreciated the response I got.

	Most of the answers were  some  variation of the linear (brute
 force) sub-string match algorithm O(NM) in the  worst case [where N =
 strlen (text) and M = strlen (substring)]. Here is a gem:

--------cut here--------
#include <string.h>

char *strstr(register char const *s, register char const *t) {
    do {
	register char const *ss = s;
	register char const *tt = t;
	do {
	    if (*tt == '\0') return ((char *)s);
	} while (*ss++ == *tt++);
    } while (*s++ != '\0');
    return (NULL);
}
--------cut here--------

	There  were other examples,  but  since all  of  them  are   a
 variation  of  the same theme, I   shall  not waste net bandwidth  by
 posting them all.

	In the mean while I had hacked up  the  following based on the
 Knuth-Morris-Pratt method [I guess I should have made it clear that I
 was looking for "safeguards" which I (naively)  assume are built into
 library routines as  opposed  to  code by "mere" programmers ;-) From
 the lib excerpts which were posted I guess lib  programmers are human
 too.;-) ;-) ]. This is O(N+M) in the worst case.

--------cut here--------
#include <string.h>

char *
  strstr (char * cs, char * ct)
{
  register int i;
  register int j;
  int M = strlen (ct);
  int N = strlen (cs);
  int * next = (int *) malloc ((M + 1) * sizeof(int));

  /* initialize the next array */
  next[0] = -1;
  
  for (i = 0, j = -1; i < M; i++, j++, next[i] = (ct[i] == ct[j]) ?
       next[j] : j)
      while ((j >= 0) && (ct[i] != ct[j]))
        j = next[j];
  
  /*now for the pattern match  */
  for (i = 0, j = 0; j < M && i < N - M + j + 1; i++, j++)
    while ((j >= 0) && (cs[i] != ct[j]))
      j = next[j];

  free (next);
  
  if (j == M)
    return & cs [i - M];
  else 
    return NULL;
}
--------cut here--------


	And finally, an implementation of the Boyer-Moore algorithm. I
 am  not  sure offhand, but  while  the worst case complexity  remains
 O(N+M), the avg case behaviour is O(N/M) ???


------8-<---Cut here---8-<-----
/* strstr, Copyright (c) 1990 by John Lacey.  -*- C -*-

   This program is the ANSI C strstr() string routine.  It is
   basically a string search routine.

   This implementation uses the Boyer-Moore algorithm.  

   Some rough timings found this implementation about twice as fast as 
   the following straight string search.  

   char *
   strstr( const char *text, const char * pattern )
   {
   int i = 0, j = 0;
   
   while ( text[i] && pattern[j] )
     {
       i++;
       j = ( text[i] == pattern[j] ) ? j + 1 : 0;
     }
   
   return (char *) ( ( pattern[j] ) ? NULL : text + i - j + 1 );
   }

   So, while the algorithm itself is more complicated, that complexity
   is worth it. 
*/

#include <limits.h>
#include <stdio.h>
#include <string.h>

char *
strstr( const char *text, const char * pattern )
{
int N = strlen( text );
int M = strlen( pattern );
int i = M - 1;
int i0, j, k;
int ch, d[SCHAR_MAX + 1];

for ( ch = 0; ch <= SCHAR_MAX; ch++ )
  d[ch] = M;

for ( j = 0; j < M-1; j++ )
  d[pattern[j]] = M - j - 1;

do
    {
      j = M - 1;
      k = i;
      while ( j >= 0 && pattern[j] == text[k] )
	{
	  k--; j--;
	}
      i0 = i;
      i += d[text[i]];
    }
while ( j >= 0 && i < N );
return (char *)(( j < 0 ) ? text + i0 - M + 1 : NULL);
}

#ifdef TEST
main( int argc, char *argv[] )
{
char *status = NULL;

if ( argc != 3 )
  {
    puts( "Try again, with a text string and a pattern string." );
    exit( 1 );
  }
else
  {
    status = strstr( argv[1], argv[2] );
    if ( status != NULL )
      printf( "Found substring: %s\n", status );
    else
      printf( "Nothing doing, boss.  I can't see a thing.\n" );
  }
}
#endif /* TEST */
------8-<---Cut here---8-<-----

Man weeps to think that he will die so soon; woman, that she was born so long
 ago.  -- H. L. Mencken

chris@mimsy.umd.edu (Chris Torek) (08/27/90)

In article <1158@umvlsi.ecs.umass.edu> srivasta@umvlsi.ecs.umass.edu
(Manoj Srivastava) posts a Boyer-Moore based strstr() by John Lacey.

>And finally, an implementation of the Boyer-Moore algorithm. I
>am not sure offhand, but while the worst case complexity remains
>O(N+M), the avg case behaviour is O(N/M) ???

(where N is the length of the string being searched and M is the length
of the substring, aka pattern).  Yes, it is N/M.

Next, the code:

>/* strstr, Copyright (c) 1990 by John Lacey.  -*- C -*-
 [some stuff deleted]
>int ch, d[SCHAR_MAX + 1];
 [more stuff deleted]
>      i += d[text[i]];
 [still more deleted]

If `char' is signed, and a string containing a negative character value
is passed to strstr(), thid code can attempt to fetch, e.g., d[-40].  I
found the code overy difficult to read, so I rewrote it.  I also wrote
a long comment describing the algorithm.  This has been only lightly
tested, but is simple enough that it should work.  Note that this
version handles strings with embedded NUL characters as well.

(A real B-M search routine should take the deltas table as an argument,
and have a separate routine to set it up, so that the same pattern can
be applied to multiple texts.  E.g., a `Boyer-Moore grep' would read a
line at a time and apply this search to each line.  See comment.)

#include <limits.h>
#include <stddef.h>
typedef void *PTR;		/* `char *' for old compilers */

/*
 * Boyer-Moore search: input is `text' (a string) and its length,
 * and a `pattern' (another string) and its length.
 *
 * The linear setup cost of this function is approximately 256 + patlen.
 * Afterwards, however, the average cost is O(textlen/patlen), and the
 * worst case is O(textlen+patlen).
 *
 * The Boyer-Moore algorithm works by observing that, for each position
 * in the text, if the character there does *not* occur somewhere in the
 * search pattern, no comparisons including that character will match.
 * That is, given the text "hello world..." and the pattern "goodbye", the
 * `w' in `world' means that none of `hello w', `ello wo', `llo wor',
 * `lo worl', `o world', ` world.', and `world..' can match.  In fact,
 * exactly patlen strings are certain not to match.  We can discover this
 * simply by looking at the patlen'th character.  Furthermore, even if
 * the text character does occur, it may be that it rules out some number
 * of other matches.  Again, we can discover this by doing the match
 * `backwards'.
 *
 * We set up a table of deltas for each possible character, with
 * delta[character] being patlen for characters not in the pattern,
 * less for characters in the pattern, growing progressively smaller
 * as we near the end of the pattern.  Matching then works as follows:
 *
 *	 0         1         2         3
 *	 01234567890123456789012345678901234567
 *	"Here is the string being searched into"	(text)
 *	 ------						(pos = [0..5])
 *	"string"					(pat)
 *	654321-						(deltas)
 *
 * (the delta for `-' will be derived below).
 *
 * Positions 0..5 end with `i', which is not the `g' we want.  `i' does
 * appear in `string', but two characters before the end.  We skip
 * forward so as to make the `i's match up:
 *
 *	"Here is the string being searched into"	(text)
 *	  "string"					(pos = [2..7])
 *
 * Next we find that ` ' and `g' do not match.  Since ` ' does not appear
 * in the pattern at all, we can skip forward 6:
 *
 *	"Here is the string being searched into"	(text)
 *		"string"				(pos = [8..13])
 *
 * Comparing `t' vs `g', we again find no match, and so we obtain the
 * delta for `t', which is 4.  We skip to position 17:
 *
 *	"Here is the string being searched into"	(text)
 *		    "string"				(pos = [12..17])
 *
 * It thus takes only four steps to move the search point forward to the
 * match, in this case.
 *
 * If the pattern has a recurring character, we must set the delta for
 * that character to the distance of the one closest to the end:
 *
 * 	"befuddle the cat"	(text)
 *	"fuddle"		(pos = [0..5])
 *	654321-			(delta)
 *
 * We want the next search to line the `d's up like this:
 *
 * 	"befuddle the cat"	(text)
 *	  "fuddle"		(pos = [2..7])
 *
 * and not like this:
 *
 * 	"befuddle the cat"	(text)
 *	   "fuddle"		(pos = [3..8])
 *
 * so we take the smaller delta for d, i.e., 2.
 *
 * The last task is computing the delta we have noted above as `-':
 *
 *	"candlesticks"		(text)
 *	"hand"			(pos = [0..3])
 *	4321-			(delta)
 *
 * Here the `d' in `hand' matches the `d' in `candlesticks', but the
 * strings differ.  Since there are no other `d's in `hand', we know
 * that none of (cand,andl,ndle,dles) can match, and thus we want this
 * delta to be 4 (the length of the pattern).  But if we had, e.g.:
 *
 *	"candlesticks"		(text)
 *	"deed"			(pos = [0..3])
 *	4321-			(delta)
 *
 * then we should advance to line up the other `d':
 *
 *	"candlesticks"		(text)
 *	   "deed"		(pos = [3..6])
 *
 * As this suggestes, the delta should be that for the `d' nearest the
 * end, but not including the end.  This is easily managed by setting up
 * a delta table as follows:
 *
 *	for int:c in [0..255] { delta[c] = patlen; };
 *	for int:x in [0..patlen-1) { delta[pat[x]] = patlen - (x + 1); };
 *
 * delta[pat[patlen-1]] is never written, so the last letter inherits the
 * delta from an earlier iteration or from the previous loop.
 *
 * NB: the nonsense with `deltaspace' below exists merely because gcc
 * does a horrible job of common subexpression elimination (it does not
 * notice that the array is at a constant stack address).
 */
char *
boyer_moore_search(text, textlen, pat, patlen)
	char *text;
	size_t textlen;
	char *pat;
	size_t patlen;
{
	register unsigned char *p, *t;
	register size_t i, p1, j, *delta;
	size_t deltaspace[UCHAR_MAX + 1];

	/* algorithm fails if pattern is empty */
	if ((p1 = patlen) == 0)
		return (text);

	/* code below fails (whenever i is unsigned) if pattern too long */
	if (p1 > textlen)
		return (NULL);

	/* set up deltas */
	delta = deltaspace;
	for (i = 0; i <= UCHAR_MAX; i++)
		delta[i] = p1;
	for (p = (unsigned char *)pat, i = p1; --i > 0;)
		delta[*p++] = i;

	/*
	 * From now on, we want patlen - 1.
	 * In the loop below, p points to the end of the pattern,
	 * t points to the end of the text to be tested against the
	 * pattern, and i counts the amount of text remaining, not
	 * including the part to be tested.
	 */
	p1--;
	p = (unsigned char *)pat + p1;
	t = (unsigned char *)text + p1;
	i = textlen - patlen;
	for (;;) {
		if (*p == *t && memcmp((PTR)(p - p1), (PTR)(t - p1), p1) == 0)
			return ((char *)t - p1);
		j = delta[*t];
		if (i < j)
			break;
		i -= j;
		t += j;
	}
	return (NULL);
}
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 405 2750)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris

brad@SSD.CSD.HARRIS.COM (Brad Appleton) (08/27/90)

In article <26215@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes:
>In article <1158@umvlsi.ecs.umass.edu> srivasta@umvlsi.ecs.umass.edu
>(Manoj Srivastava) posts a Boyer-Moore based strstr() by John Lacey.
>
>>And finally, an implementation of the Boyer-Moore algorithm. I
>>am not sure offhand, but while the worst case complexity remains
>>O(N+M), the avg case behaviour is O(N/M) ???
>
>(where N is the length of the string being searched and M is the length
>of the substring, aka pattern).  Yes, it is N/M.

Interestingly enough ... This August's issue of Communications of The ACM
has an article on page 132 entitled "A Very Fast Substring Search Algorithm".

The abstract reads as follows:

	This article describes a substring search algorithm that is
	faster than the Boyer-Moore algorithm. This algorithm does 
	not depend on scanning the pattern string in any particular
	order. Three variations of the algorithm are given that use
	three different pattern scan orders. These include:
	   (1) a "Quick Search" algorithm;
	   (2) a "Maximal Shift" algorithm; and
	   (3) an "Optimal Mismatch" algorithm;

The article comes complete with an implementation in C!
I strongly encourage people to take a peek at it.

ANYWAY -- IMHO, unless you are going to be looking for the same
          string multiple times... it is not worth the effort
          to get fancy (a la Knuth-Morris-Pratt or Boyer-Moore)!
          I think most strstr() function use the classic
          "Brute-Force" approach.
______________________ "And miles to go before I sleep." ______________________
 Brad Appleton        brad@travis.ssd.csd.harris.com   Harris Computer Systems
                          ...!uunet!hcx1!brad          Fort Lauderdale, FL USA
~~~~~~~~~~~~~~~~~~~~ Disclaimer: I said it, not my company! ~~~~~~~~~~~~~~~~~~~

karl@haddock.ima.isc.com (Karl Heuer) (08/28/90)

In article <1158@umvlsi.ecs.umass.edu> srivasta@umvlsi.ecs.umass.edu writes:
>[From the comments introducing John Lacey's Boyer-Moore code:]
>   Some rough timings found this implementation about twice as fast as 
>   the following straight string search.  ...
>   while ( text[i] && pattern[j] ) {
>       i++;
>       j = ( text[i] == pattern[j] ) ? j + 1 : 0;
>     }

I hope it's also more correct.  As written, this won't even handle
strstr("a", "a"); and even if you put the increment in a more likely place,
there's no provision for doing less than a full reset of the search--so
strstr("ababac", "abac") loses.

Karl W. Z. Heuer (karl@kelp.ima.isc.com or ima!kelp!karl), The Walking Lint

andrew@alice.UUCP (Andrew Hume) (08/28/90)

people interested in using REAL algorithms for strstr should check
out the fast substring algorithm in the august CACM

johnb@srchtec.UUCP (John Baldwin) (08/29/90)

In article <11246@alice.UUCP> andrew@alice.UUCP (Andrew Hume) writes:
>
>people interested in using REAL algorithms for strstr should check
>out the fast substring algorithm in the august CACM

Great article.  But I digress before I even start.  :-)

As a result of the comments being tossed around, I went home last night and
checked the source code for strstr() in the Microsoft C ver 5.10 standard
library.  Not that I'm claiming that "the Microsoft way is the Right Way"
(in fact, I think I'd have to argue *against* it! :) :) :)), but I figured
its production code in a heavily-used production compiler, and they would
want their library to perform well in benchmarks.

They used the brute-force algorithm.

Just to double-check, I looked at several other library functions at random.
All seemed to be very good choices as far as algorithms go, so it seems like
they *didn't* just use whatever was familiar, but instead put a good bit
of thought into what they were doing.  To be sure, strstr() was written in
assembler, and they used a machine instruction to scan the "search-in" string
for the next character of the "search-for" string.

Probably, for the most-general case, letting the hardware do the looking,
and avoiding the setup costs of the arrays necessary for Boyer-Moore or
Knuth-et-al, ends up being faster.

In fact, I can think of another good illustration of this effect: a colleague
was trying to figure out what sorting and storage methods to use for certain
information in his part of a very large program.  We talked about whether
the information should be maintained in sorted order, or sorted just before
use, sorted-on-entry, et cetera ad infinitum.  Then it occurred to me to ask
about the scale of the problem.  No more than a dozen items would ever be
"active" at any one time.  The fastest mechanism turned out to be the most
simple: store things FIFO (unordered) and do a linear search!
-- 
John T. Baldwin                      |  johnb%srchtec.uucp@mathcs.emory.edu
Search Technology, Inc.              | 
                                     | "... I had an infinite loop,
My opinions; not my employers'.      |  but it was only for a little while..."