[comp.sys.mac.programmer] Hashing....

athos@apple.com (Rick Eames) (09/25/89)

I am a relatively new programmer and I am searching for a good algorithim 
for finding one pattern in a source string.  A friend of mine says that 
hashing it is a good way.  Having figured out what hashing is...does any 
one have a good hash function for finding a string inside of another 
string?  

Thanks!

Rick Eames

sam@neoucom.UUCP (Scott A. Mason) (09/26/89)

In article <4345@internal.Apple.COM> athos@apple.com (Rick Eames) writes:
>I am a relatively new programmer and I am searching for a good algorithim 
>for finding one pattern in a source string...
>
It may not be a hashing function, but it does find a substring in a given
string.  Kind of a quick and dirty type utility one writes when needed.
----------cut here-----------
/* searchfor returns 1 if substring is a substring of string.  It returns 0
otherwise. The theory behind searchfor is to search for the first character
of the substring in the source string, and then compare from there.  This
helps speed of the process reducing the calls to strncmp. */

searchfor (substring, string)
char substring[], string[];
{
  int sublen, len, position;
  char *index, firstchar;

  len = strlen (string);
  sublen = strlen (substring);
         /* only process substrings with the right first character */
  firstchar = substring[0];
  position = 0;
  while (position < len) {
    if (string[position] != firstchar) { 
      position++;
      continue;
    }
    index = &(string[position]);
    if (! (strncmp (index, substring, sublen)) ) {
	return (1);	/* the string was found */
    }
    position++;
  }
  return (0);	/* the string was not found */
} /* searchfor */
-- 
--------------------------------------------------------------------------------
"If it ain't broke, don't fix it," and certainly don't blame me.
UUCP:  {pitt,scooter,hal,cwjcc,aablue}!neoucom!sam   INTERNET:  sam@neoucom.UUCP
Scott A. Mason, Coordinator of Systems Operations, NEOUCOM

henry@chinet.chi.il.us (Henry C. Schmitt) (09/28/89)

In article <1728@neoucom.UUCP> sam@neoucom.UUCP (Scott A. Mason) writes:
>In article <4345@internal.Apple.COM> athos@apple.com (Rick Eames) writes:
>>I am a relatively new programmer and I am searching for a good
>>algorithim for finding one pattern in a source string...
>>
>It may not be a hashing function, but it does find a substring in a given
>string.  Kind of a quick and dirty type utility one writes when needed.
>----------cut here-----------
>/* searchfor returns 1 if substring is a substring of string.  It
>returns 0 otherwise. The theory behind searchfor is to search for
>the first character of the substring in the source string, and then
>compare from there.  This helps speed of the process reducing the
>calls to strncmp. */ 
> [algorithm deleted!]
I don't mean to be rude but "Ack! Thpppt!"  Creeping along looking
for the first character is just about the worst way to find a
string!  (It does, however have the advantage of being simple!)

Since I am currently in a graduate course in algorithms and we are
currently doing (you guessed it) string searching, I recommend the
Boyer-Moore algorithm.  It has the interesting property of working
faster the longer the substring is!

Here it is (directly from my notes and untested) in pseudo-code:
BEGIN B-M
	Tstart := 1
	For i := 1 to max-char
		Skip[i] = len(sub)
	For i := 1 to len(sub)
		Skip[sub[i]] := i
	OUT-LOOP: DO
		s := len(sub)
		t := Tstart + len(sub) - 1
		IN-LOOP: While sub[s] = str[t] and s > 0
			s := s - 1
			t := t - 1
		END IN-LOOP
		IF s = 0 then return Tstart
		slide := Skip[str[t]] - len(sub) + s
		If slide < 0 then slide := 1
		Tstart := Tstart + slide
	While Tstart < len(str) - len(sub)
	END OUT-LOOP
	Return 0 /* Not found */
END B-M

I highly recommend the book _Algorithms_ by Robert Sedgewick for all
the gory details of Boyer-Moore.  In fact this book has over 40
chapters of exellent algorithms for doing just about anything.  Any
good Comp. Sci. bookstore will have it (I've seen it in B.Dalton's
Software Etc.)

Hope this helps!

-- 
  H3nry C. Schmitt     | CompuServe: 72275,1456  (Rarely)
                       | GEnie: H.Schmitt  (Occasionally)
 Royal Inn of Yoruba   | UUCP: Henry@chinet.chi.il.us  (Best Bet)

dowdy@apple.com (Tom Dowdy) (09/28/89)

In article <9690@chinet.chi.il.us> henry@chinet.chi.il.us (Henry C. 
Schmitt) writes:
> Since I am currently in a graduate course in algorithms and we are
> currently doing (you guessed it) string searching, I recommend the
> Boyer-Moore algorithm.  It has the interesting property of working
> faster the longer the substring is!

> [ Interesting Runga-Kutta Hashing (it's a *joke*) algorithm deleted ]

Comment on algorithm:  Sure seems to call the len() function quite a bit, 
which is a rather expensive operation on C strings, esp when searching 
long blocks.  (This analysis may not be correct)

> [H3nry then recommends a good book for us to read ]

I'd like to add in here that when adapting algorithms such as this from CS 
books, Knuth or what have you, that one should use the Script Manager 
where applicable.   Also, be wary of "clever" algorithms that take 
advantage of assuming ASCII, or those that by their cleverness introduce 
overhead when combined with the Script Manager.  Let us examine this 
algorithm in this light:

For example, string search and other similar operations should always use 
ParseTable() or CharByte() when comparing and dividing characters.  It's 
not a good idea to assume one byte per char.  In fact, one can't really 
assume that a random run of bytes being in the middle of a block of bytes 
even is aligned on "character" boundries without calling CharByte() on the 
first one to make sure.

I believe a final sanity check after you "think" you've found the 
substring you want with a call to CharByte() is the only addition needed 
for this particular algorithm.  Of course, one would need to implement it 
in a way such that if the sanity check failed, one would continue the 
search.  I'll add that since CharByte() can end up doing some scanning 
(depends on the particular Script System which is active)  of it's own, 
you might wish to just use ParseTable() or CharByte()from the start and 
convert to a more "boring" algorithm.  Left as an exercise for the reader. 
(Gosh, I've always wanted to say that!)

I'd recommend that people pick up the latest Script Manager chapter, which 
I believe was distributed with the Tech Notes some months back and is 
probably available from APDA.  (Mark?  Is this true?)  More interesting 
and useful routines.  

Also, for all of you people who've got it in your head to make the next 
great compiler or scripting language or what have you, be sure to take a 
look at the IntlTokenize() routine, which does much of the hard work for 
you, in a Script Manager compatible way.  A fun routines to call, everyone 
should run out and call it today.

p.s.  No, I'm not in the International Group here.

 Tom Dowdy                 Internet:  dowdy@apple.COM
 Apple Computer MS:27AJ    UUCP:      {sun,voder,amdahl,decwrl}!apple!dowdy
 20525 Mariani Ave         AppleLink: DOWDY1
 Cupertino, CA 95014       
 "The 'Ooh-Ah' Bird is so called because it lays square eggs."

pfluegerm@valley.UUCP (Mike Pflueger) (09/29/89)

In article <1728@neoucom.UUCP>, sam@neoucom.UUCP (Scott A. Mason) writes:
 > ----------cut here-----------
 > /* searchfor returns 1 if substring is a substring of string.  It returns 0
 > otherwise. The theory behind searchfor is to search for the first character
 > of the substring in the source string, and then compare from there.  This
 > helps speed of the process reducing the calls to strncmp. */
 > 
 > searchfor (substring, string)
 > char substring[], string[];
 > {
 >   int sublen, len, position;
 >   char *index, firstchar;
 > 
 >   len = strlen (string);
 >   sublen = strlen (substring);
 >          /* only process substrings with the right first character */
 >   firstchar = substring[0];
 >   position = 0;
 >   while (position < len) {
 >     if (string[position] != firstchar) { 
 >       position++;
 >       continue;
 >     }
 >     index = &(string[position]);
 >     if (! (strncmp (index, substring, sublen)) ) {
 > 	return (1);	/* the string was found */
 >     }
 >     position++;
 >   }
 >   return (0);	/* the string was not found */
 > } /* searchfor */

A simple variation on this is often highly useful - if the substr is found,
return POSITION so the calling routine can get at the substr easily.

You can still check for 0 (or NOT zero) to find out if the substr is contained.

-- 
Mike Pflueger @ AG Communication Systems (formerly GTE Comm. Sys.), Phoenix, AZ
  UUCP: {...!ames!ncar!noao!asuvax | uunet!zardoz!hrc | att}!gtephx!pfluegerm
  Work: 602-582-7049        FAX: 602-581-4850
Packet: WD8KPZ @ W1FJI

lins@Apple.COM (Chuck Lins) (09/29/89)

In article <4434@internal.Apple.COM> dowdy@apple.com (Tom Dowdy) writes:
>In article <9690@chinet.chi.il.us> henry@chinet.chi.il.us (Henry C. 
>Also, for all of you people who've got it in your head to make the next 
>great compiler or scripting language or what have you, be sure to take a 
>look at the IntlTokenize() routine, which does much of the hard work for 
>you, in a Script Manager compatible way.  A fun routines to call, everyone 
>should run out and call it today.

You mean, that I can set up a tokenizing scheme where all the keywords and
operators are translated into 'tokens', read the whole source file into a
big buffer, pass the buffer to the tokenizer, and get back a stream of tokens/
identifiers (for the stuff that wasn't tokenized)? If this is so, I may very
well have to write a routine tonight to do just that!

Now all I have to do is understand the Script Manager documentation ;-)
(I can see those flames coming now!)


-- 
Chuck Lins               | "Exit left to funway."
Apple Computer, Inc.     | Internet: lins@apple.com
20525 Mariani Avenue     | AppleLink: LINS
Mail Stop 41-K           | 
Cupertino, CA 95014      | "Self-proclaimed Object Oberon Evangelist"
I speak for myself and no one else.

sam@neoucom.UUCP (Scott A. Mason) (09/29/89)

In article <9690@chinet.chi.il.us> henry@chinet.chi.il.us (Henry C. Schmitt) writes:
>I don't mean to be rude but "Ack! Thpppt!"  Creeping along looking
>for the first character is just about the worst way to find a
>string!  (It does, however have the advantage of being simple!)

Well, Excuuuuuuse Me! I only offered A solution to the query, stating that
it was not a very pretty routine, but it worked and was easy to understand.
(It also took a very short time to write)
Maybe next time I'll think twice about posting something that works, and
post a fancy algorithm that I haven't tested (or coded in a real
language).
-- 
--------------------------------------------------------------------------------
"If it ain't broke, don't fix it," and certainly don't blame me.
UUCP:  {pitt,scooter,hal,cwjcc,aablue}!neoucom!sam   INTERNET:  sam@neoucom.UUCP
Scott A. Mason, Coordinator of Systems Operations, NEOUCOM

jimc@iscuva.ISCS.COM (Jim Cathey) (10/03/89)

In article <1728@neoucom.UUCP> sam@neoucom.UUCP (Scott A. Mason) writes:
>In article <4345@internal.Apple.COM> athos@apple.com (Rick Eames) writes:
>>I am a relatively new programmer and I am searching for a good algorithim 
>>for finding one pattern in a source string...

>It may not be a hashing function, but it does find a substring in a given
>string.  Kind of a quick and dirty type utility one writes when needed...

For a simple byte searcher that ignores international issues consider the
Munger Toolbox call.  

+----------------+
! II      CCCCCC !  Jim Cathey
! II  SSSSCC     !  ISC-Bunker Ramo
! II      CC     !  TAF-C8;  Spokane, WA  99220
! IISSSS  CC     !  UUCP: uunet!iscuva!jimc (jimc@iscuva.iscs.com)
! II      CCCCCC !  (509) 927-5757
+----------------+
			"With excitement like this, who is needing enemas?"

mjm@eleazar.dartmouth.edu (Michael McClennen) (10/04/89)

In article <4345@internal.Apple.COM> athos@apple.com (Rick Eames) writes:
>I am a relatively new programmer and I am searching for a good algorithim
>for finding one pattern in a source string...
>

How about using the toolbox trap Munger?  It's documented in Inside Macintosh
I-468, and does string find and replace.

Michael McClennen (mjm@dartmouth.edu)
Dartmouth College Academic Computing

athos@apple.com (Rick Eames) (10/04/89)

In article <15915@dartvax.Dartmouth.EDU> mjm@eleazar.dartmouth.edu 
(Michael McClennen) writes:
> How about using the toolbox trap Munger?  It's documented in Inside 
Macintosh
> I-468, and does string find and replace.

I was looking for a solution that would encompass other machines as 
well....

Rick Eames