[comp.lang.c] fuzzy strcmp

tchrist@convexe.uucp (Tom Christiansen) (12/20/89)

I'm looking for an algorithm that would allow me to determine
whether two strings were similar.  Thus 

	"abcde" !~ "xyzzy"
but
	"this old man can read" =~ "that old man can't read"

I'd like to be able to specify a fuzziness threshhold, so I 
envision a function something like like this:

    boolean strfzcmp(string1,string2,correlation)
		     char *string1, *string2;
		     float correlation;
or perhaps just

    float   strfzcmp(string1,string2)

and then I'd check the return value for the desired threshhold.

Anyone have any ideas?  Please mail me any suggestions that you may
post, as I'm about to go on Christmas vacation, and we expire 
news too quickly here for it to still be here when I get back.

thanks,

--tom

    Tom Christiansen                       {uunet,uiucdcs,sun}!convex!tchrist 
    Convex Computer Corporation                            tchrist@convex.COM
		 "EMACS belongs in <sys/errno.h>: Editor too big!"

istvan@hhb.UUCP (Istvan Mohos) (12/22/89)

tchrist@convexe.uucp (Tom Christiansen @ Convex Computer) writes:
>I'm looking for an algorithm that would allow me to determine
>whether two strings were similar.  Thus 
>
>	"abcde" !~ "xyzzy"
>	"this old man can read" =~ "that old man can't read"
>
>... perhaps just
>    float   strfzcmp(string1,string2)

I must confess, my first reaction was: thank God, Tom 's finally found
a problem he can't solve in Perl.  :-)

You may want to try running the *diff* algorithm along the individual
characters of the two strings (rather than applying it to successive
lines of two files); the ratio of the number of failed chars to the
byte count of the two strings is a dandy float in the range 0.---1.
Thus,
    strfzcmp("abcde","xyzzy") --> 1.
    strfzcmp("this old man can read","that old man can't read") --> .136363..

-- 
        Istvan Mohos
        ...uunet!pyrdc!pyrnj!hhb!istvan
        HHB Systems 1000 Wyckoff Ave. Mahwah NJ 07430 201-848-8000
====================================================================

mitchell@cbmvax.commodore.com (Fred Mitchell - PA) (01/09/90)

In article <297@hhb.UUCP> istvan@hhb.UUCP (Istvan Mohos) writes:
>tchrist@convexe.uucp (Tom Christiansen @ Convex Computer) writes:
>>I'm looking for an algorithm that would allow me to determine
>>whether two strings were similar.  Thus 
>>
>>	"abcde" !~ "xyzzy"
>>	"this old man can read" =~ "that old man can't read"
>>
>>... perhaps just
>>    float   strfzcmp(string1,string2)

I will give a general description of an algorithm that can accomplish
what you ask.

NOTE: I am doing this from the 'top of my head'. Some refinement may
be in order.

You will do a weighted comparision based on two factors:
a) The number of characters each string has in common
b) The number of matches the strings have in sequence

Let's take two arbitrary strings:
	"This old man can't read"    string alpha
	"That silly man can't read"  string beta

Notice that they are of different lengths. Also, there is an alignment
shift. The following algorithm _should_ properly handle this:

a) Count the number of occurences of each character in each string.
   Compare the count of each character in alpha to that in beta in the
   following way:
     For each different character from alpha & beta:
	Normalize the counts so that the greater is normalized to 1.
	Multiply the two normalized values toghether.
	Add this product to a running total.
     Normalize this total by the the length of the longest string
     Multiply that by weight w1 yeilding (we shall call it) yw1

b) Compare the two strings byte for byte as follows:
     Start at the beginning of alpha & beta (let's use i and j as indexes)
     initialize k to 0
     Until we hit the end of either string:
	if (alpha[i] == beta[j])
		++k, ++i, ++j
	else 
		we scan forward on beta to find a byte that matchess
		our current location on alpha. If so, adjust j to index
		it. If nothing was found, do vice-versa for alpha.
     When the above loop is completed,
     Normalize k against the size of the longest string
     and multiply it by weight w2 yeilding (we shall call it) yw2.

     return (yw1 + yw2) / (w1 + w2)


This should produce a good evaluation of how closely matched alpha
is to beta, taking into account mis-alignment. The value returned
will be between 0 and 1 inclusive. Weight w1 should be less than w2.
Perhaps w1 = w2 / 3. Experiment.

As I said, this is off the top of my head, so some refinement is in
order, as should be evident during the implementation/testing phase.

If anyone knows of a better way to accomplish this, let's hear it!

	-Mitchell
	 mitchell@cbmvax.UUCP
	 "The shortening of the Way."