ejk@uxh.cso.uiuc.edu (Ed Kubaitis) (04/07/90)
My application involves finding the beginning of articles matching articles listed in the table of contents in machine-readable issues of a publication intended for human readers, not programs. (The problem is that these documents are apparently hand entered or edited so titles in the table-of-contents do not always exactly match the titles in the text: misspellings, abreviations in one place but not the other, different punctuation, etc.) Here's a fuzzy match that has (so far) served well. The algorithm is to calculate the ratio of two letter substrings in string A occuring anywhere in string B to the total number of two letter substrings in the shorter of the two strings. A ratio of 0.8 or greater empirically seems to do what I want. You may want to change the way of dealing with differences in lengths of the two strings, or try 3 character substrings, etc. It may not be the best, but it's small, easy to understand, and fairly fast. Ed Kubaitis Computing Services Office University of Illinois ------------------------------------------------------------------------------ sub fmatch { local($A, $B) = @_; local($l) = (length($A) < length($B)) ? length($A) : length($B); local($m) = 0; local($w) = 2; local($k); $A eq $B && return(1.0); $l > $w || return(0.0); for $k(0..$l-$w) { local($s) = substr($A, $k, $w); #---escape r.e. characters in string A--- $s =~ s/([()[\]*|?.{}\\])/\\$1/; $B =~ $s && $m++; } ($m/($l-$w) > 0.80) ? 1 : 0; }