[comp.lang.c] Soundex

readdm@walt.cc.utexas.edu (David M. Read) (10/13/89)

I have need of the SOUNDEX routine which represents words as sounds in order to 
do comparisons and look-ups based on how the words sound.  Does any one know 
where I can get a reference on the algorithm, or does any one know how it works?

Unless you think other people will be interested, just send your replies to me
via E-Mail, thanks.      

Thank you in advance for your help.


----------------------------------------------------------------------------
David M. Read                        best -=>  readdm@walt.cc.utexas.edu
                           all-else-fails -=>  read@physics.utexas.edu

"...[he's] stupid and he's ignorant but he's got guts...and guts is enough!"
----------------------------------------------------------------------------

walter@hpclwjm.HP.COM (Walter Murray) (10/14/89)

> I have need of the SOUNDEX routine.

See Knuth, The Art of Computer Programming, Volume 3, pp. 391-392,
for the "classic" Soundex algorithm.  

Walter Murray
----

ing@hades.OZ (Ian Gold) (12/12/89)

I am looking for a 'soundex' routine in C (or C++).  That is a routine 
capable of finding a substring of a target string that sounds like a given 
string.  The call would look something like this.

		char *soundex(char *target, char *given);


P.S. If the routine you have is NOT in C that's fine.  I can always convert it.

wew@naucse.UUCP (Bill Wilson) (12/13/89)

From article <488@hades.OZ>, by ing@hades.OZ (Ian Gold):
> 
> I am looking for a 'soundex' routine in C (or C++).  That is a routine 
> capable of finding a substring of a target string that sounds like a given 
> string.  The call would look something like this.
> 
> 		char *soundex(char *target, char *given);
>
I would be interested in the code as well...
 
-- 
Let sleeping dragons lie........               | The Bit Chaser
----------------------------------------------------------------
Bill Wilson             (Bitnet: ucc2wew@nauvm | wilson@nauvax)
Northern AZ Univ  Flagstaff, AZ 86011

wgh@ubbpc.UUCP (William G. Hutchison) (12/13/89)

In article <488@hades.OZ>, ing@hades.OZ (Ian Gold) writes:
> 
> I am looking for a 'soundex' routine in C (or C++).  

/* soundex.c - Generate a soundex code from a character string		*/

/*	Odell, Margaret K. and Russell, Robert C.			*/
/*	U. S. Patents 1261167 (1918) and 1435663 (1922)			*/

/* Implemented in C by William Hutchison, Data Processing Consultant	*/
/* 					  Unisys Corporation		*/
/* Copyright 1985 William G. Hutchison, Jr.				*/

/* see also:								*/
/*	Knuth, Donald E. "The Art of Computer Programming", vol 3.	*/
/* Searching and Sorting, p. 391-92, Addison-Wesley, Reading, MA, 1973	*/

/* to compile: cc -o soundex -DMAIN soundex.c				*/


#include <stdio.h>
#include <string.h>
#include <assert.h>

char	*progname;


char*
soundex(s, code)
char	*s;				/* input character string	*/
char	*code;				/* output soundex code string	*/
{
#ifndef lint
	static char	Vers[] = 
	"@(#)soundex.c	1.2  Compiled: 14:25:54 12/12/89  Delta Date: 14:25:49 12/12/89";
#endif
	char* 	p = code;
	int	chars_needed = 3;
	static char
	code_table[] = "B\1F\1P\1V\1C\2G\2J\2K\2Q\2S\2X\2Z\2D\3T\3L\4M\5N\5R\6";
	char*	x;
#define	encode(c) (x=strchr(code_table,(c)),(x == NULL)?'\0':*++x + '0')
	char	last_code = encode(*s);

	assert(s != NULL);
	assert(code != NULL);
	assert(strlen(code) >= 4);

	*p++ = toupper(*s++);		/* copy first character as is	*/

	while (*s && chars_needed) {
		char	c = toupper(*s++);
		char	new_code;

		if ((new_code = encode(c)) && new_code != last_code) {
			last_code = *p++ = new_code;
			chars_needed--;
		} else
			last_code = ' ';
	}

	while (chars_needed-- > 0)
		*p++ = '0';
	*p = '\0';

	return code;
}					/* end soundex() */


#ifdef MAIN

main(argc, argv)
int	argc;
char	*argv[];
{
	char	string[BUFSIZ];
	static char	code[] = "K123";

	progname = argv[0];

	while (gets(string)) {
		puts(soundex(string, code));
	}

	return 0;
}				/* end main */

/* 
 	Sample input test stream:

Euler
Ellery
Gauss
Ghosh
Hilbert
Heilbronn
Hutcherson
Hutchison
Hutchinson
Knuth
Kant
Lloyd
Ladd
Lukasiewicz
Lissajous

	Sample output from the above input:

E460
E460
G200
G200
H416
H416
H326
H322
H325
K530
K530
L300
L300
L222
L222

 */

#endif

/* EOF soundex.c */
-- 
Bill Hutchison, DP Consultant	rutgers!cbmvax!burdvax!ubbpc!wgh
Unisys UNIX Portation Center	"Unless you are very rich and very eccentric,
P.O. Box 500, M.S. B121         you will not enjoy the luxury of a computer
Blue Bell, PA 19424		in your own home", Edward Yourdon, 1975.

rick_jones@f616.n713.fido.oz (Rick Jones) (12/14/89)

Original to: ing@hades.oz
G'day.

Rather than give you the code, here's the algorithm (it's a lot simpler
to do than most people think):

A soundex code is a four character representation based on the way a name 
sounds rather than the way it is spelled. Theoretically, using this system, 
you should be able to index a name so that it can be found no matter how it 
was spelled. The system was developed by Margaret K. Odell and Robert C. 
Russell (see U.S. Patents 1261167 [1918] and 1435663 [1922]). 

Every soundex code consists of a letter and three numbers, such as B525. The 
letter is always the first letter of the surname. The numbers are assigned 
this way: 

  1  =  b,p,f,v
  2  =  c,s,k,g,j,q,x,z
  3  =  d,t
  4  =  l
  5  =  m,n
  6  =  r
  disregard  -  a,e,i,o,u,w,y,h

To figure out a surname's code, do this:           JOHNSON
   - Eliminate any a,e,i,o,u,w,y,h                 JNSN
   - Write the first letter, as is, followed
     by the codes found in the table above         JNSN = J525

No matter how long or short the surname is, the soundex code is always the 
first letter of the name followed by three numbers. If you have coded the 
first letter and three numbers but still have more letters in the name, 
ignore them. If you have run out of letters in the name before you have three 
numbers, then add zeroes to the code: 

   WASHINGTON = WSNGTN = W252 (ignore the ending TN)
   KUHNE      = KN     = K500 (add zeroes to the end)
   YE         = Y      = Y000 (add zeroes to the end)

Any double letters side by side should be treated as one letter. For example 
LLOYD is coded as if it were spelled LOYD. GUTIERREZ is coded as if it were 
GUTIEREZ. 

You may have different letters side by side that have the same code value. 
For example PFISTER (P & F are both 1), JACKSON (CKS are all 2). These 
letters should be treated as one letter.  PFISTER is coded as PSTR (P236) and 
JACKSON is coded as JCN (J250). 

Thus, variations in spellings or mispellings should produce the same code 
number.

This material based on "Beginning Your Genealogical Research in the
National Archives," courtesy ROOTS-BBS, CA, Brian Mavrogeorge, sysop.

If you have any trouble coding the above (hardly likely, I'd imagine),
let me know and I'll write you a piece of code compatible with SVID.
Unfortunately, my routine uses lower-level routines proprietary to my
library and it would be useless to give it to you without several other
support routines.

Hope this is of some help.

Rick Jones.



---
 * Origin: /\/\onitor \/\/orld (~~Sydney Australia~~) (Opus 3:713/616)

john@riddle.UUCP (Jonathan Leffler) (12/16/89)

In article <1842@naucse.UUCP> wew@naucse.UUCP (Bill Wilson) writes:
>From article <488@hades.OZ>, by ing@hades.OZ (Ian Gold):
>> I am looking for a 'soundex' routine in C (or C++).
>> 
>> 		char *soundex(char *target, char *given);
>>
>I would be interested in the code as well...

Will this do?

:	"@(#)shar2.c	1.5"
#!/bin/sh
# shar:	Shell Archiver (v1.22)
#
#	This is a shell archive.
#	Remove everything above this line and run sh on the resulting file
#	If this archive is complete, you will see this message at the end
#	"All files extracted"
#
#	Created: Fri Dec 15 21:33:37 1989 by john at Sphinx Ltd.
#	Files archived in this archive:
#	  soundex.c
#
if test -f soundex.c; then echo "File soundex.c exists"; else
echo "x - soundex.c"
sed 's/^X//' << 'SHAR_EOF' > soundex.c &&
X/*
X**	SOUNDEX CODING
X**
X**	Rules:
X**	1.	Retain the first letter; ignore non-alphabetic characters.
X**	2.	Replace second and subsequent characters by a group code.
X**		Group	Letters
X**		1		BFPV
X**		2		CGJKSXZ
X**		3		DT
X**		4		L
X**		5		MN
X**		6		R
X**	3.	Do not repeat digits
X**	4.	Truncate or ser-pad to 4-character result.
X**
X**	Originally formatted with tabstops set at 4 spaces -- you were warned!
X**
X**	Code by: Jonathan Leffler (john@sphinx.co.uk)
X**	This code is shareware -- I wrote it; you can have it for free
X**	if you supply it to anyone else who wants it for free.
X**
X**	BUGS: Assumes ASCII
X*/
X
X#include <ctype.h>
Xstatic char	lookup[] = {
X	'0',	/* A */
X	'1',	/* B */
X	'2',	/* C */
X	'3',	/* D */
X	'0',	/* E */
X	'1',	/* F */
X	'2',	/* G */
X	'0',	/* H */
X	'0',	/* I */
X	'2',	/* J */
X	'2',	/* K */
X	'4',	/* L */
X	'5',	/* M */
X	'5',	/* N */
X	'0',	/* O */
X	'1',	/* P */
X	'0',	/* Q */
X	'6',	/* R */
X	'2',	/* S */
X	'3',	/* T */
X	'0',	/* U */
X	'1',	/* V */
X	'0',	/* W */
X	'2',	/* X */
X	'0',	/* Y */
X	'2',	/* Z */
X};
X
X/*
X**	Soundex for arbitrary number of characters of information
X*/
Xchar	*nsoundex(str, n)
Xchar	*str;	/* In: String to be converted */
Xint		 n;		/* In: Number of characters in result string */
X{
X	static	char	buff[10];
X	register char	*s;
X	register char	*t;
X	char	c;
X	char	l;
X
X	if (n <= 0)
X		n = 4;	/* Default */
X	if (n > sizeof(buff) - 1)
X		n = sizeof(buff) - 1;
X	t = &buff[0];
X
X	for (s = str; ((c = *s) != '\0') && t < &buff[n]; s++)
X	{
X		if (!isascii(c))
X			continue;
X		if (!isalpha(c))
X			continue;
X		c = toupper(c);
X		if (t == &buff[0])
X		{
X			l = *t++ = c;
X			continue;
X		}
X		c = lookup[c-'A'];
X		if (c != '0' && c != l)
X			l = *t++ = c;
X	}
X	while (t < &buff[n])
X		*t++ = '0';
X	*t = '\0';
X	return(&buff[0]);
X}
X
X/* Normal external interface */
Xchar	*soundex(str)
Xchar	*str;
X{
X	return(nsoundex(str, 4));
X}
X
X/*
X**	Alternative interface:
X**	void	soundex(given, gets)
X**	char	*given;
X**	char	*gets;
X**	{
X**		strcpy(gets, nsoundex(given, 4));
X**	}
X*/
X
X
X#ifdef TEST
X#include <stdio.h>
Xmain()
X{
X	char	buff[30];
X
X	while (fgets(buff, sizeof(buff), stdin) != (char *)0)
X		printf("Given: %s Soundex produces %s\n", buff, soundex(buff));
X}
X#endif
SHAR_EOF
chmod 0640 soundex.c || echo "$0: failed to restore soundex.c"
fi
echo All files extracted
exit 0