readdm@walt.cc.utexas.edu (David M. Read) (10/13/89)
I have need of the SOUNDEX routine which represents words as sounds in order to do comparisons and look-ups based on how the words sound. Does any one know where I can get a reference on the algorithm, or does any one know how it works? Unless you think other people will be interested, just send your replies to me via E-Mail, thanks. Thank you in advance for your help. ---------------------------------------------------------------------------- David M. Read best -=> readdm@walt.cc.utexas.edu all-else-fails -=> read@physics.utexas.edu "...[he's] stupid and he's ignorant but he's got guts...and guts is enough!" ----------------------------------------------------------------------------
walter@hpclwjm.HP.COM (Walter Murray) (10/14/89)
> I have need of the SOUNDEX routine.
See Knuth, The Art of Computer Programming, Volume 3, pp. 391-392,
for the "classic" Soundex algorithm.
Walter Murray
----
ing@hades.OZ (Ian Gold) (12/12/89)
I am looking for a 'soundex' routine in C (or C++). That is a routine capable of finding a substring of a target string that sounds like a given string. The call would look something like this. char *soundex(char *target, char *given); P.S. If the routine you have is NOT in C that's fine. I can always convert it.
wew@naucse.UUCP (Bill Wilson) (12/13/89)
From article <488@hades.OZ>, by ing@hades.OZ (Ian Gold): > > I am looking for a 'soundex' routine in C (or C++). That is a routine > capable of finding a substring of a target string that sounds like a given > string. The call would look something like this. > > char *soundex(char *target, char *given); > I would be interested in the code as well... -- Let sleeping dragons lie........ | The Bit Chaser ---------------------------------------------------------------- Bill Wilson (Bitnet: ucc2wew@nauvm | wilson@nauvax) Northern AZ Univ Flagstaff, AZ 86011
wgh@ubbpc.UUCP (William G. Hutchison) (12/13/89)
In article <488@hades.OZ>, ing@hades.OZ (Ian Gold) writes: > > I am looking for a 'soundex' routine in C (or C++). /* soundex.c - Generate a soundex code from a character string */ /* Odell, Margaret K. and Russell, Robert C. */ /* U. S. Patents 1261167 (1918) and 1435663 (1922) */ /* Implemented in C by William Hutchison, Data Processing Consultant */ /* Unisys Corporation */ /* Copyright 1985 William G. Hutchison, Jr. */ /* see also: */ /* Knuth, Donald E. "The Art of Computer Programming", vol 3. */ /* Searching and Sorting, p. 391-92, Addison-Wesley, Reading, MA, 1973 */ /* to compile: cc -o soundex -DMAIN soundex.c */ #include <stdio.h> #include <string.h> #include <assert.h> char *progname; char* soundex(s, code) char *s; /* input character string */ char *code; /* output soundex code string */ { #ifndef lint static char Vers[] = "@(#)soundex.c 1.2 Compiled: 14:25:54 12/12/89 Delta Date: 14:25:49 12/12/89"; #endif char* p = code; int chars_needed = 3; static char code_table[] = "B\1F\1P\1V\1C\2G\2J\2K\2Q\2S\2X\2Z\2D\3T\3L\4M\5N\5R\6"; char* x; #define encode(c) (x=strchr(code_table,(c)),(x == NULL)?'\0':*++x + '0') char last_code = encode(*s); assert(s != NULL); assert(code != NULL); assert(strlen(code) >= 4); *p++ = toupper(*s++); /* copy first character as is */ while (*s && chars_needed) { char c = toupper(*s++); char new_code; if ((new_code = encode(c)) && new_code != last_code) { last_code = *p++ = new_code; chars_needed--; } else last_code = ' '; } while (chars_needed-- > 0) *p++ = '0'; *p = '\0'; return code; } /* end soundex() */ #ifdef MAIN main(argc, argv) int argc; char *argv[]; { char string[BUFSIZ]; static char code[] = "K123"; progname = argv[0]; while (gets(string)) { puts(soundex(string, code)); } return 0; } /* end main */ /* Sample input test stream: Euler Ellery Gauss Ghosh Hilbert Heilbronn Hutcherson Hutchison Hutchinson Knuth Kant Lloyd Ladd Lukasiewicz Lissajous Sample output from the above input: E460 E460 G200 G200 H416 H416 H326 H322 H325 K530 K530 L300 L300 L222 L222 */ #endif /* EOF soundex.c */ -- Bill Hutchison, DP Consultant rutgers!cbmvax!burdvax!ubbpc!wgh Unisys UNIX Portation Center "Unless you are very rich and very eccentric, P.O. Box 500, M.S. B121 you will not enjoy the luxury of a computer Blue Bell, PA 19424 in your own home", Edward Yourdon, 1975.
rick_jones@f616.n713.fido.oz (Rick Jones) (12/14/89)
Original to: ing@hades.oz G'day. Rather than give you the code, here's the algorithm (it's a lot simpler to do than most people think): A soundex code is a four character representation based on the way a name sounds rather than the way it is spelled. Theoretically, using this system, you should be able to index a name so that it can be found no matter how it was spelled. The system was developed by Margaret K. Odell and Robert C. Russell (see U.S. Patents 1261167 [1918] and 1435663 [1922]). Every soundex code consists of a letter and three numbers, such as B525. The letter is always the first letter of the surname. The numbers are assigned this way: 1 = b,p,f,v 2 = c,s,k,g,j,q,x,z 3 = d,t 4 = l 5 = m,n 6 = r disregard - a,e,i,o,u,w,y,h To figure out a surname's code, do this: JOHNSON - Eliminate any a,e,i,o,u,w,y,h JNSN - Write the first letter, as is, followed by the codes found in the table above JNSN = J525 No matter how long or short the surname is, the soundex code is always the first letter of the name followed by three numbers. If you have coded the first letter and three numbers but still have more letters in the name, ignore them. If you have run out of letters in the name before you have three numbers, then add zeroes to the code: WASHINGTON = WSNGTN = W252 (ignore the ending TN) KUHNE = KN = K500 (add zeroes to the end) YE = Y = Y000 (add zeroes to the end) Any double letters side by side should be treated as one letter. For example LLOYD is coded as if it were spelled LOYD. GUTIERREZ is coded as if it were GUTIEREZ. You may have different letters side by side that have the same code value. For example PFISTER (P & F are both 1), JACKSON (CKS are all 2). These letters should be treated as one letter. PFISTER is coded as PSTR (P236) and JACKSON is coded as JCN (J250). Thus, variations in spellings or mispellings should produce the same code number. This material based on "Beginning Your Genealogical Research in the National Archives," courtesy ROOTS-BBS, CA, Brian Mavrogeorge, sysop. If you have any trouble coding the above (hardly likely, I'd imagine), let me know and I'll write you a piece of code compatible with SVID. Unfortunately, my routine uses lower-level routines proprietary to my library and it would be useless to give it to you without several other support routines. Hope this is of some help. Rick Jones. --- * Origin: /\/\onitor \/\/orld (~~Sydney Australia~~) (Opus 3:713/616)
john@riddle.UUCP (Jonathan Leffler) (12/16/89)
In article <1842@naucse.UUCP> wew@naucse.UUCP (Bill Wilson) writes: >From article <488@hades.OZ>, by ing@hades.OZ (Ian Gold): >> I am looking for a 'soundex' routine in C (or C++). >> >> char *soundex(char *target, char *given); >> >I would be interested in the code as well... Will this do? : "@(#)shar2.c 1.5" #!/bin/sh # shar: Shell Archiver (v1.22) # # This is a shell archive. # Remove everything above this line and run sh on the resulting file # If this archive is complete, you will see this message at the end # "All files extracted" # # Created: Fri Dec 15 21:33:37 1989 by john at Sphinx Ltd. # Files archived in this archive: # soundex.c # if test -f soundex.c; then echo "File soundex.c exists"; else echo "x - soundex.c" sed 's/^X//' << 'SHAR_EOF' > soundex.c && X/* X** SOUNDEX CODING X** X** Rules: X** 1. Retain the first letter; ignore non-alphabetic characters. X** 2. Replace second and subsequent characters by a group code. X** Group Letters X** 1 BFPV X** 2 CGJKSXZ X** 3 DT X** 4 L X** 5 MN X** 6 R X** 3. Do not repeat digits X** 4. Truncate or ser-pad to 4-character result. X** X** Originally formatted with tabstops set at 4 spaces -- you were warned! X** X** Code by: Jonathan Leffler (john@sphinx.co.uk) X** This code is shareware -- I wrote it; you can have it for free X** if you supply it to anyone else who wants it for free. X** X** BUGS: Assumes ASCII X*/ X X#include <ctype.h> Xstatic char lookup[] = { X '0', /* A */ X '1', /* B */ X '2', /* C */ X '3', /* D */ X '0', /* E */ X '1', /* F */ X '2', /* G */ X '0', /* H */ X '0', /* I */ X '2', /* J */ X '2', /* K */ X '4', /* L */ X '5', /* M */ X '5', /* N */ X '0', /* O */ X '1', /* P */ X '0', /* Q */ X '6', /* R */ X '2', /* S */ X '3', /* T */ X '0', /* U */ X '1', /* V */ X '0', /* W */ X '2', /* X */ X '0', /* Y */ X '2', /* Z */ X}; X X/* X** Soundex for arbitrary number of characters of information X*/ Xchar *nsoundex(str, n) Xchar *str; /* In: String to be converted */ Xint n; /* In: Number of characters in result string */ X{ X static char buff[10]; X register char *s; X register char *t; X char c; X char l; X X if (n <= 0) X n = 4; /* Default */ X if (n > sizeof(buff) - 1) X n = sizeof(buff) - 1; X t = &buff[0]; X X for (s = str; ((c = *s) != '\0') && t < &buff[n]; s++) X { X if (!isascii(c)) X continue; X if (!isalpha(c)) X continue; X c = toupper(c); X if (t == &buff[0]) X { X l = *t++ = c; X continue; X } X c = lookup[c-'A']; X if (c != '0' && c != l) X l = *t++ = c; X } X while (t < &buff[n]) X *t++ = '0'; X *t = '\0'; X return(&buff[0]); X} X X/* Normal external interface */ Xchar *soundex(str) Xchar *str; X{ X return(nsoundex(str, 4)); X} X X/* X** Alternative interface: X** void soundex(given, gets) X** char *given; X** char *gets; X** { X** strcpy(gets, nsoundex(given, 4)); X** } X*/ X X X#ifdef TEST X#include <stdio.h> Xmain() X{ X char buff[30]; X X while (fgets(buff, sizeof(buff), stdin) != (char *)0) X printf("Given: %s Soundex produces %s\n", buff, soundex(buff)); X} X#endif SHAR_EOF chmod 0640 soundex.c || echo "$0: failed to restore soundex.c" fi echo All files extracted exit 0