athos@apple.com (Rick Eames) (09/25/89)
I am a relatively new programmer and I am searching for a good algorithim for finding one pattern in a source string. A friend of mine says that hashing it is a good way. Having figured out what hashing is...does any one have a good hash function for finding a string inside of another string? Thanks! Rick Eames
sam@neoucom.UUCP (Scott A. Mason) (09/26/89)
In article <4345@internal.Apple.COM> athos@apple.com (Rick Eames) writes: >I am a relatively new programmer and I am searching for a good algorithim >for finding one pattern in a source string... > It may not be a hashing function, but it does find a substring in a given string. Kind of a quick and dirty type utility one writes when needed. ----------cut here----------- /* searchfor returns 1 if substring is a substring of string. It returns 0 otherwise. The theory behind searchfor is to search for the first character of the substring in the source string, and then compare from there. This helps speed of the process reducing the calls to strncmp. */ searchfor (substring, string) char substring[], string[]; { int sublen, len, position; char *index, firstchar; len = strlen (string); sublen = strlen (substring); /* only process substrings with the right first character */ firstchar = substring[0]; position = 0; while (position < len) { if (string[position] != firstchar) { position++; continue; } index = &(string[position]); if (! (strncmp (index, substring, sublen)) ) { return (1); /* the string was found */ } position++; } return (0); /* the string was not found */ } /* searchfor */ -- -------------------------------------------------------------------------------- "If it ain't broke, don't fix it," and certainly don't blame me. UUCP: {pitt,scooter,hal,cwjcc,aablue}!neoucom!sam INTERNET: sam@neoucom.UUCP Scott A. Mason, Coordinator of Systems Operations, NEOUCOM
henry@chinet.chi.il.us (Henry C. Schmitt) (09/28/89)
In article <1728@neoucom.UUCP> sam@neoucom.UUCP (Scott A. Mason) writes: >In article <4345@internal.Apple.COM> athos@apple.com (Rick Eames) writes: >>I am a relatively new programmer and I am searching for a good >>algorithim for finding one pattern in a source string... >> >It may not be a hashing function, but it does find a substring in a given >string. Kind of a quick and dirty type utility one writes when needed. >----------cut here----------- >/* searchfor returns 1 if substring is a substring of string. It >returns 0 otherwise. The theory behind searchfor is to search for >the first character of the substring in the source string, and then >compare from there. This helps speed of the process reducing the >calls to strncmp. */ > [algorithm deleted!] I don't mean to be rude but "Ack! Thpppt!" Creeping along looking for the first character is just about the worst way to find a string! (It does, however have the advantage of being simple!) Since I am currently in a graduate course in algorithms and we are currently doing (you guessed it) string searching, I recommend the Boyer-Moore algorithm. It has the interesting property of working faster the longer the substring is! Here it is (directly from my notes and untested) in pseudo-code: BEGIN B-M Tstart := 1 For i := 1 to max-char Skip[i] = len(sub) For i := 1 to len(sub) Skip[sub[i]] := i OUT-LOOP: DO s := len(sub) t := Tstart + len(sub) - 1 IN-LOOP: While sub[s] = str[t] and s > 0 s := s - 1 t := t - 1 END IN-LOOP IF s = 0 then return Tstart slide := Skip[str[t]] - len(sub) + s If slide < 0 then slide := 1 Tstart := Tstart + slide While Tstart < len(str) - len(sub) END OUT-LOOP Return 0 /* Not found */ END B-M I highly recommend the book _Algorithms_ by Robert Sedgewick for all the gory details of Boyer-Moore. In fact this book has over 40 chapters of exellent algorithms for doing just about anything. Any good Comp. Sci. bookstore will have it (I've seen it in B.Dalton's Software Etc.) Hope this helps! -- H3nry C. Schmitt | CompuServe: 72275,1456 (Rarely) | GEnie: H.Schmitt (Occasionally) Royal Inn of Yoruba | UUCP: Henry@chinet.chi.il.us (Best Bet)
dowdy@apple.com (Tom Dowdy) (09/28/89)
In article <9690@chinet.chi.il.us> henry@chinet.chi.il.us (Henry C. Schmitt) writes: > Since I am currently in a graduate course in algorithms and we are > currently doing (you guessed it) string searching, I recommend the > Boyer-Moore algorithm. It has the interesting property of working > faster the longer the substring is! > [ Interesting Runga-Kutta Hashing (it's a *joke*) algorithm deleted ] Comment on algorithm: Sure seems to call the len() function quite a bit, which is a rather expensive operation on C strings, esp when searching long blocks. (This analysis may not be correct) > [H3nry then recommends a good book for us to read ] I'd like to add in here that when adapting algorithms such as this from CS books, Knuth or what have you, that one should use the Script Manager where applicable. Also, be wary of "clever" algorithms that take advantage of assuming ASCII, or those that by their cleverness introduce overhead when combined with the Script Manager. Let us examine this algorithm in this light: For example, string search and other similar operations should always use ParseTable() or CharByte() when comparing and dividing characters. It's not a good idea to assume one byte per char. In fact, one can't really assume that a random run of bytes being in the middle of a block of bytes even is aligned on "character" boundries without calling CharByte() on the first one to make sure. I believe a final sanity check after you "think" you've found the substring you want with a call to CharByte() is the only addition needed for this particular algorithm. Of course, one would need to implement it in a way such that if the sanity check failed, one would continue the search. I'll add that since CharByte() can end up doing some scanning (depends on the particular Script System which is active) of it's own, you might wish to just use ParseTable() or CharByte()from the start and convert to a more "boring" algorithm. Left as an exercise for the reader. (Gosh, I've always wanted to say that!) I'd recommend that people pick up the latest Script Manager chapter, which I believe was distributed with the Tech Notes some months back and is probably available from APDA. (Mark? Is this true?) More interesting and useful routines. Also, for all of you people who've got it in your head to make the next great compiler or scripting language or what have you, be sure to take a look at the IntlTokenize() routine, which does much of the hard work for you, in a Script Manager compatible way. A fun routines to call, everyone should run out and call it today. p.s. No, I'm not in the International Group here. Tom Dowdy Internet: dowdy@apple.COM Apple Computer MS:27AJ UUCP: {sun,voder,amdahl,decwrl}!apple!dowdy 20525 Mariani Ave AppleLink: DOWDY1 Cupertino, CA 95014 "The 'Ooh-Ah' Bird is so called because it lays square eggs."
pfluegerm@valley.UUCP (Mike Pflueger) (09/29/89)
In article <1728@neoucom.UUCP>, sam@neoucom.UUCP (Scott A. Mason) writes: > ----------cut here----------- > /* searchfor returns 1 if substring is a substring of string. It returns 0 > otherwise. The theory behind searchfor is to search for the first character > of the substring in the source string, and then compare from there. This > helps speed of the process reducing the calls to strncmp. */ > > searchfor (substring, string) > char substring[], string[]; > { > int sublen, len, position; > char *index, firstchar; > > len = strlen (string); > sublen = strlen (substring); > /* only process substrings with the right first character */ > firstchar = substring[0]; > position = 0; > while (position < len) { > if (string[position] != firstchar) { > position++; > continue; > } > index = &(string[position]); > if (! (strncmp (index, substring, sublen)) ) { > return (1); /* the string was found */ > } > position++; > } > return (0); /* the string was not found */ > } /* searchfor */ A simple variation on this is often highly useful - if the substr is found, return POSITION so the calling routine can get at the substr easily. You can still check for 0 (or NOT zero) to find out if the substr is contained. -- Mike Pflueger @ AG Communication Systems (formerly GTE Comm. Sys.), Phoenix, AZ UUCP: {...!ames!ncar!noao!asuvax | uunet!zardoz!hrc | att}!gtephx!pfluegerm Work: 602-582-7049 FAX: 602-581-4850 Packet: WD8KPZ @ W1FJI
lins@Apple.COM (Chuck Lins) (09/29/89)
In article <4434@internal.Apple.COM> dowdy@apple.com (Tom Dowdy) writes: >In article <9690@chinet.chi.il.us> henry@chinet.chi.il.us (Henry C. >Also, for all of you people who've got it in your head to make the next >great compiler or scripting language or what have you, be sure to take a >look at the IntlTokenize() routine, which does much of the hard work for >you, in a Script Manager compatible way. A fun routines to call, everyone >should run out and call it today. You mean, that I can set up a tokenizing scheme where all the keywords and operators are translated into 'tokens', read the whole source file into a big buffer, pass the buffer to the tokenizer, and get back a stream of tokens/ identifiers (for the stuff that wasn't tokenized)? If this is so, I may very well have to write a routine tonight to do just that! Now all I have to do is understand the Script Manager documentation ;-) (I can see those flames coming now!) -- Chuck Lins | "Exit left to funway." Apple Computer, Inc. | Internet: lins@apple.com 20525 Mariani Avenue | AppleLink: LINS Mail Stop 41-K | Cupertino, CA 95014 | "Self-proclaimed Object Oberon Evangelist" I speak for myself and no one else.
sam@neoucom.UUCP (Scott A. Mason) (09/29/89)
In article <9690@chinet.chi.il.us> henry@chinet.chi.il.us (Henry C. Schmitt) writes: >I don't mean to be rude but "Ack! Thpppt!" Creeping along looking >for the first character is just about the worst way to find a >string! (It does, however have the advantage of being simple!) Well, Excuuuuuuse Me! I only offered A solution to the query, stating that it was not a very pretty routine, but it worked and was easy to understand. (It also took a very short time to write) Maybe next time I'll think twice about posting something that works, and post a fancy algorithm that I haven't tested (or coded in a real language). -- -------------------------------------------------------------------------------- "If it ain't broke, don't fix it," and certainly don't blame me. UUCP: {pitt,scooter,hal,cwjcc,aablue}!neoucom!sam INTERNET: sam@neoucom.UUCP Scott A. Mason, Coordinator of Systems Operations, NEOUCOM
jimc@iscuva.ISCS.COM (Jim Cathey) (10/03/89)
In article <1728@neoucom.UUCP> sam@neoucom.UUCP (Scott A. Mason) writes: >In article <4345@internal.Apple.COM> athos@apple.com (Rick Eames) writes: >>I am a relatively new programmer and I am searching for a good algorithim >>for finding one pattern in a source string... >It may not be a hashing function, but it does find a substring in a given >string. Kind of a quick and dirty type utility one writes when needed... For a simple byte searcher that ignores international issues consider the Munger Toolbox call. +----------------+ ! II CCCCCC ! Jim Cathey ! II SSSSCC ! ISC-Bunker Ramo ! II CC ! TAF-C8; Spokane, WA 99220 ! IISSSS CC ! UUCP: uunet!iscuva!jimc (jimc@iscuva.iscs.com) ! II CCCCCC ! (509) 927-5757 +----------------+ "With excitement like this, who is needing enemas?"
mjm@eleazar.dartmouth.edu (Michael McClennen) (10/04/89)
In article <4345@internal.Apple.COM> athos@apple.com (Rick Eames) writes: >I am a relatively new programmer and I am searching for a good algorithim >for finding one pattern in a source string... > How about using the toolbox trap Munger? It's documented in Inside Macintosh I-468, and does string find and replace. Michael McClennen (mjm@dartmouth.edu) Dartmouth College Academic Computing
athos@apple.com (Rick Eames) (10/04/89)
In article <15915@dartvax.Dartmouth.EDU> mjm@eleazar.dartmouth.edu (Michael McClennen) writes: > How about using the toolbox trap Munger? It's documented in Inside Macintosh > I-468, and does string find and replace. I was looking for a solution that would encompass other machines as well.... Rick Eames