chris@mimsy.UUCP (Chris Torek) (07/12/88)
[I have deleted groups comp.theory and comp.ai since Soundex has little to do with these] In article <12520@sunybcs.UUCP> stewart@sunybcs.uucp (Norman R. Stewart) writes: >2: Apply the following rules to produce a code of one letter and > three numbers. > A: The first letter of the word becomes the initial character > in the code. > B: When two or more letters from the same group occur together > only the first is coded. > C: If two letters from the same group are seperated by an H or > a W, code only the first. > D: Group 7 letters are never coded (this does not include the > first letter in the word, which is always coded). [I thought Soundex codes were usually fixed at four symbols.] What if more than two letters from the same group are separated by H or W? For instance: FDHTWTHTWL. Is this encoded as F334 or as F34? The table has L=4, R=6; I find this surprising, as both R and L are semivowels and they are easily confused by those who did not grow up with the distinction (e.g., some Orientals). -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
fredex@cg-atla.UUCP (Fred Smith) (01/25/90)
Please pardon the cross-posting--a while back someone was asking about algorithms such as soundex, for finding the "likeness" of two strings (but I don't recall which group that was in, hence the cross-posting.). Below is a program, typed in from published listings in Dr. Dobbs Journal November 1987 issue which implements a different algorithm for comparing strings. See the magazine article for explanation. Hope that whoever the person was who made the inquiry finds this and finds it useful! Fred Smith fredex@cg-atla.agfa.com uunet!samsung!cg-atla!fredex ----------------------------------------------------------------------- /* * bickel.c * * from Dr. Dobbs Journal, Nov 87 * * A C implementation of Bickel's name comparison * algorithm, CACM 30/3 (March 1987), p. 244. * This is generic C code and should work on any * compiler with no modification. * Jim Howell, March 1987 * * This code is placed in the Public domain. You are * free to use it in any way you see fit. */ #include <ctype.h> #include <stdio.h> unsigned char MaskArray [52] = {0, 0x40, /* a */ 3, 0x80, /* b */ 2, 0x20, /* c */ 1, 0x10, /* d */ 0, 0x20, /* e */ 2, 0x10, /* f */ 2, 0x08, /* g */ 1, 0x08, /* h */ 0, 0x10, /* i */ 3, 0x08, /* j */ 3, 0x20, /* k */ 1, 0x04, /* l */ 2, 0x04, /* m */ 0, 0x08, /* n */ 0, 0x04, /* o */ 2, 0x02, /* p */ 3, 0x10, /* q */ 1, 0x02, /* r */ 0, 0x02, /* s */ 0, 0x01, /* t */ 1, 0x01, /* u */ 3, 0x40, /* v */ 2, 0x01, /* w */ 3, 0x04, /* x */ 3, 0x02, /* y */ 3, 0x01}; /* z */ /* * Demonstrate the algorithm with some short examples. */ main () { unsigned char LetterSet0 [4]; unsigned char LetterSet1 [4]; MakeLetterSet ("ecdysiast", LetterSet0); MakeLetterSet ("ecstasy", LetterSet1); printf ("The likeness of 'ecdysiast' and 'ecstasy' is %d.\n", CompareSets (LetterSet0, LetterSet1)); MakeLetterSet ("ectoplasm", LetterSet1); printf ("The likeness of 'ecdysiast' and 'ectoplasm' is %d.\n", CompareSets (LetterSet0, LetterSet1)); MakeLetterSet ("edcysiast", LetterSet1); printf ("The likeness of 'ecdysiast' and 'edcysiast' is %d.\n", CompareSets (LetterSet0, LetterSet1)); MakeLetterSet ("ecdysiast", LetterSet1); printf ("The likeness of 'ecdysiast' and 'ecdysiast' is %d.\n", CompareSets (LetterSet0, LetterSet1)); } /* * Make the letter set by going through the string * one character at a time. */ MakeLetterSet (Name, Mask) char *Name; unsigned char *Mask; { char *pC; unsigned char *pM; int I; int Lk; pC = Name; for (I = 0 ; I < 4 ; ++I) { Mask[I] = 0; } while (*pC) { /* * Use letters only, convert upper to lower case */ if (isalpha (*pC)) { pM = &MaskArray [2 * (tolower (*pC) - 'a')]; Mask [*pM] |= *(pM + 1); } ++pC; } } /* * This particular version constructs a mask by using * a logical AND of the relevant bytes of the two sets. * The rightmost bit is checked to see if the likeness * score needs to be increased by the appropriate * weight, and the mask is shifted right one bit. An * alternative would be to use a bit mask and to * shift it instead of the name mask. The two bytes * of the least common letters are checked for content * to eliminate any unnecessary calculations. */ CompareSets (Set1, Set2) unsigned char *Set1; unsigned char *Set2; { unsigned char Mask; int I; int Lk; Lk = 0; /* * For the first byte */ Mask = Set1[0] & Set2[0]; for (I = 0 ; I < 7 ; ++I) { if (Mask & 0x01) { Lk += 3; } Mask >>= 1; } /* * For the second byte */ Mask = Set1[1] & Set2[1]; for (I = 0 ; I < 5 ; ++I) { if (Mask & 0x01) { Lk += 4; } Mask >>= 1; } /* * For the third byte */ Mask = Set1[2] & Set2[2]; for (I = 0 ; I < 6 ; ++I) { if (Mask & 0x01) { Lk += 5; } Mask >>= 1; } /* * The last byte is more complicated */ Mask = Set1[3] & Set2[3]; if (!Mask) return (Lk); if (Mask & 0x01) Lk += 9; /* z */ Mask >>= 1; if (Mask & 0x01) Lk += 8; /* y */ Mask >>= 1; if (Mask & 0x01) Lk += 8; /* x */ Mask >>= 1; if (Mask & 0x01) Lk += 8; /* j */ Mask >>= 1; if (Mask & 0x01) Lk += 7; /* q */ Mask >>= 1; if (Mask & 0x01) Lk += 7; /* k */ Mask >>= 1; if (Mask & 0x01) Lk += 6; /* v */ Mask >>= 1; if (Mask & 0x01) Lk += 6; /* b */ Mask >>= 1; return (Lk); }