tchrist@convexe.uucp (Tom Christiansen) (12/20/89)
I'm looking for an algorithm that would allow me to determine whether two strings were similar. Thus "abcde" !~ "xyzzy" but "this old man can read" =~ "that old man can't read" I'd like to be able to specify a fuzziness threshhold, so I envision a function something like like this: boolean strfzcmp(string1,string2,correlation) char *string1, *string2; float correlation; or perhaps just float strfzcmp(string1,string2) and then I'd check the return value for the desired threshhold. Anyone have any ideas? Please mail me any suggestions that you may post, as I'm about to go on Christmas vacation, and we expire news too quickly here for it to still be here when I get back. thanks, --tom Tom Christiansen {uunet,uiucdcs,sun}!convex!tchrist Convex Computer Corporation tchrist@convex.COM "EMACS belongs in <sys/errno.h>: Editor too big!"
istvan@hhb.UUCP (Istvan Mohos) (12/22/89)
tchrist@convexe.uucp (Tom Christiansen @ Convex Computer) writes: >I'm looking for an algorithm that would allow me to determine >whether two strings were similar. Thus > > "abcde" !~ "xyzzy" > "this old man can read" =~ "that old man can't read" > >... perhaps just > float strfzcmp(string1,string2) I must confess, my first reaction was: thank God, Tom 's finally found a problem he can't solve in Perl. :-) You may want to try running the *diff* algorithm along the individual characters of the two strings (rather than applying it to successive lines of two files); the ratio of the number of failed chars to the byte count of the two strings is a dandy float in the range 0.---1. Thus, strfzcmp("abcde","xyzzy") --> 1. strfzcmp("this old man can read","that old man can't read") --> .136363.. -- Istvan Mohos ...uunet!pyrdc!pyrnj!hhb!istvan HHB Systems 1000 Wyckoff Ave. Mahwah NJ 07430 201-848-8000 ====================================================================
jeff@cdp.UUCP (12/25/89)
Here's a fuzzy string matching program based on an algorithm originally published in Dr. Dobbs. The code has been tested in an MS-DOS environment, but there shouldn't be any problems running it on Unix. Jeff Dean uunet!pyramid!cdp!jeff ---------------------------------------------------------------------- /* Ratcliff/Obershelp pattern matching * * Approximate word matching: takes two words and returns a * similarity score based on co-occurrence of subpatterns. * * This code appeared in a letter to the editor in DDJ, 11/88. * The original article on the pattern matching, "Pattern Matching * by Gestalt" by John Ratcliff, appeared in the July 1988 * issue (#181) but the algorithm was presented in assembly. * * The 11/88 issue also contained another verison in C which was * a more faithful translation of the original; it has the * advantage of not being recursive. * * The algorithm seems to work nicely for a variety of test cases. * The main drawback of the algorithm is the cost of the pairwise * comparisons. It is significantly more expensive than stemming, * soundex, and the like. Might consider using this as a second * phase... */ int simil(s1, s2) char *s1, *s2; { short l1, l2; l1 = strlen(s1); l2 = strlen(s2); /* exact match end-case */ if( l1 == 1 && l2 == 1 && *s1 == *s2 ) return(100); return(200 * GCsubstr(s1, s1 + l1, s2, s2 + l2) / (l1 + l2)); } int GCsubstr(st1, end1, st2, end2) char *st1, *end1, *st2, *end2; { register char *a1, *a2; char *b1, *s1, *b2, *s2; short max, i; if( end1 <= st1 || end2 <= st2 ) return(0); if( end1 == st1 + 1 && end2 == st2 + 1 ) return(0); max = 0; b1 = end1; b2 = end2; for( a1 = st1; a1 < b1; a1++ ) { for( a2 = st2; a2 < b2; a2++ ) { if( *a1 == *a2 ) { /* determine length of common substring */ for( i = 1; a1[i] && (a1[i] == a2[i]); i++ ) ; if( i > max ) { max = i; s1 = a1; s2 = a2; b1 = end1 - max; b2 = end2 - max; } } } } if( !max ) return(0); max += GCsubstr(s1 + max, end1, s2 + max, end2); /* rhs */ max += GCsubstr(st1, s1, st2, s2); /* lhs */ return(max); } /* test program */ #include <stdio.h> char *strtok(); main() { char *first, *second; char buf[128]; for(;;) { printf("Words: "); gets(buf); if( buf[0] == '\0' ) break; first = strtok(buf, " "); second = strtok(NULL, " "); printf("Score for %s : %s = %d\n", first, second, simil(first, second)); } }
mitchell@cbmvax.commodore.com (Fred Mitchell - PA) (01/09/90)
In article <297@hhb.UUCP> istvan@hhb.UUCP (Istvan Mohos) writes: >tchrist@convexe.uucp (Tom Christiansen @ Convex Computer) writes: >>I'm looking for an algorithm that would allow me to determine >>whether two strings were similar. Thus >> >> "abcde" !~ "xyzzy" >> "this old man can read" =~ "that old man can't read" >> >>... perhaps just >> float strfzcmp(string1,string2) I will give a general description of an algorithm that can accomplish what you ask. NOTE: I am doing this from the 'top of my head'. Some refinement may be in order. You will do a weighted comparision based on two factors: a) The number of characters each string has in common b) The number of matches the strings have in sequence Let's take two arbitrary strings: "This old man can't read" string alpha "That silly man can't read" string beta Notice that they are of different lengths. Also, there is an alignment shift. The following algorithm _should_ properly handle this: a) Count the number of occurences of each character in each string. Compare the count of each character in alpha to that in beta in the following way: For each different character from alpha & beta: Normalize the counts so that the greater is normalized to 1. Multiply the two normalized values toghether. Add this product to a running total. Normalize this total by the the length of the longest string Multiply that by weight w1 yeilding (we shall call it) yw1 b) Compare the two strings byte for byte as follows: Start at the beginning of alpha & beta (let's use i and j as indexes) initialize k to 0 Until we hit the end of either string: if (alpha[i] == beta[j]) ++k, ++i, ++j else we scan forward on beta to find a byte that matchess our current location on alpha. If so, adjust j to index it. If nothing was found, do vice-versa for alpha. When the above loop is completed, Normalize k against the size of the longest string and multiply it by weight w2 yeilding (we shall call it) yw2. return (yw1 + yw2) / (w1 + w2) This should produce a good evaluation of how closely matched alpha is to beta, taking into account mis-alignment. The value returned will be between 0 and 1 inclusive. Weight w1 should be less than w2. Perhaps w1 = w2 / 3. Experiment. As I said, this is off the top of my head, so some refinement is in order, as should be evident during the implementation/testing phase. If anyone knows of a better way to accomplish this, let's hear it! -Mitchell mitchell@cbmvax.UUCP "The shortening of the Way."