[net.sources] soundex algorithm wanted

jwagner@booboo.UUCP (Joel Wagner) (08/22/86)

I would like any info pertaining to soundex search algorithms
(phonetic grep).  Source to a nifty, efficient algorithm would
be great, but I'll take anything.  Thanx in advance.

Joel Wagner
Gould Inc., Information Systems
Computer Systems Division
{pur-ee,brl-bmd,sun}!gould!jwagner

jeffm@chinet.UUCP (Jeff Marraccini) (08/26/86)

Joel,

I have a routine written in BASIC on the Apple many moons ago that uses
Soundex coding. I'll see if I can find it and mail it to you.

A good source is your city library. There are often sample routines in
articles pertaining to soundex. I found one in "Soundex Explained" which
was an abstract by Joe A. Donovan. If I can find that too I'll mail it.


-- 
Jeff Marraccini		...ihnp4!{chinet!jeffm, mb2c!edsdrd!ahxenix!jeff}
5582 Dvorak St.
Clarkston, MI 48016
313/623-7115 voice	313/623-6309 modem

rlk@chinet.UUCP (Richard Klappal) (08/27/86)

See Knuth's Art of Computer Programming (Vol 1, I think)

-- 
---
UUCP: ..!ihnp4!chinet!uklpl!rlk || MCIMail: rklappal || Compuserve: 74106,1021
      ..!ihnp4!ihu1h!rlk
---

blob@bnrmtv.UUCP (Brian Bechtel) (08/27/86)

> 
> I would like any info pertaining to soundex search algorithms
> (phonetic grep).  Source to a nifty, efficient algorithm would
> be great, but I'll take anything.  Thanx in advance.
> 

 /*********************************************************\
 * This program exemplifies the soundex algorithm.	  *
 *							  *
 * You type in a word and it spits out the soundex string  *
 * that was produced for that word.			  *
 \*********************************************************/
 
 #include <stdio.h>
 
 char table[] = {
     '0',		    /* A */
     '1',		    /* B */
     '2',		    /* C */
     '3',		    /* D */
     '0',		    /* E */
     '1',		    /* F */
     '2',		    /* G */
     '0',		    /* H */
     '0',		    /* I */
     '2',		    /* J */
     '2',		    /* K */
     '4',		    /* L */
     '5',		    /* M */
     '5',		    /* N */
     '0',		    /* O */
     '1',		    /* P */
     '2',		    /* Q */
     '6',		    /* R */
     '2',		    /* S */
     '3',		    /* T */
     '0',		    /* U */
     '1',		    /* V */
     '0',		    /* W */
     '2',		    /* X */
     '0',		    /* Y */
     '2'			    /* Z */
 };
 
 main()
 {
     char line[81], reduced[25];
     
     printf("type quit to terminate\n");
     do
     {
 	gets(line);
 	if(strcmp(line,"quit")==0)
 	    break;
 	soundex(reduced,line);
 	printf("%s = %s\n",line,reduced);
     } while(1);
     printf("done\n");
 }
 
 soundex(d,s)
 char *d, *s;
 {
     char last_char='#';
     int c;
     char *dorig, *ptr, *ptr2;
     
     dorig = ptr = d;
     
     /* pick up the first char in the string */
     
     *d++ = *s++;
 
     /* for the rest of characters in the string */
     while(*s)
     {
 	/* throw away nonalphabetic characters */
 	if(isalpha(*s)==0)
 	    continue;
 
 	/* convert to upper case */
 	*s = toupper(*s);
     
 	/* convert to group code and place into destination string */
 	
 	c = (int) ( *s++ - 'A') ;
 	*d++ = table[c];
     }
 
     *d = 0;
     
     do
     {
 	/* if character is '0' or character is same as last one */
 	if(*ptr=='0' || *ptr==last_char)
 	{
 	    /* get rid of the character */
 	    ptr2 = ptr;
 	    while(*ptr2=*(ptr2+1)) ptr2++;
 	}
 	else
 	{
 	    /* set last character seen */
 	    last_char = *ptr++;
 	}
     } while(*ptr);    /* while still a character left in the string */
     
     /* make sure the string isn't more than 4 characters */
     
     *(dorig+4) = 0;
      
 }

mike@whuxl.UUCP (BALDWIN) (09/04/86)

> > I would like any info pertaining to soundex search algorithms
> > (phonetic grep).  Source to a nifty, efficient algorithm would
> > be great, but I'll take anything.  Thanx in advance.
> > 
> 
>  /*********************************************************\
>  * This program exemplifies the soundex algorithm.	  *
>  *							  *
>  * You type in a word and it spits out the soundex string  *
>  * that was produced for that word.			  *
>  \*********************************************************/

Unfortunately, it doesn't generate correct Soundex codes.
The algorithm is actually pretty tricky, and I've seen
lots that don't handle names like Lloyd and Manning
properly.  Here's one that I believe is correct:
-----

#include <ctype.h>

#define	SDXLEN	4

char *
soundex(name)
char	*name;
{
	static char	buf[SDXLEN+1];
	register char	c, lc, prev = '0';
	register int	i;

	strcpy(buf, "a000");

	for (i = 0; *name && i < SDXLEN; name++)
		if (isalpha(*name)) {
			lc = tolower(*name);
			c = "01230120022455012623010202" [lc-'a'];
			if (i == 0 || (c != '0' && c != prev)) {
				buf[i] = i ? c : lc;
				i++;
			}
			prev = c;
		}

	return buf;
}

-----
And a little driver for it:
-----

main()
{
	char	line[64];

	while (gets(line))
		puts(soundex(line));
	return 0;
}
-- 
						Michael Baldwin
			(not the opinions of)	AT&T Bell Laboratories
						{at&t}!whuxl!mike

chris@umcp-cs.UUCP (Chris Torek) (09/04/86)

In article <1239@whuxl.UUCP> mike@whuxl.UUCP (BALDWIN) writes:
>	register char	c, lc, prev = '0';

`register int' generates better code on my compiler, and still works.

>		if (isalpha(*name)) {

First you should test isascii(*name) (a nit).

>			lc = tolower(*name);

Watch out!  Some tolower()s fail miserably if !isupper(c).

Anyway, assuming that the basic algorithm is ... sound, I would
change the driver routine, so:

#include <ctype.h>

#define	SDXLEN	4

char *
soundex(name)
	register char *name;
{
	static char buf[SDXLEN+1];
	static char codes[] = "01230120022455012623010202";
	register int c, i = 0, prev;
	char *strcpy();

#ifdef lint
	/* lint cannot tell that prev is set before used */
	prev = 0;
#endif
	(void) strcpy(buf, "a000");
	while ((c = *name++) != 0 && i < SDXLEN) {
		/*
		 * Throw out non-alphabetics, and convert upper case
		 * to lower.
		 */
		if (!isascii(c) || !isalpha(c))
			continue;
		if (isupper(c))
			c = tolower(c);
		/*
		 * Non-first characters must translate to non-zero codes
		 * that are different from the previous code; throw out
		 * those that translate to zero or to prev.
		 */
		if (i > 0 && ((c = codes[c - 'a']) == '0' || c == prev))
			continue;
		buf[i++] = prev = c;
	}
	return (buf);
}
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 1516)
UUCP:	seismo!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@mimsy.umd.edu

falk@sun.uucp (Ed Falk) (09/05/86)

> 
> Unfortunately, it doesn't generate correct Soundex codes.
> The algorithm is actually pretty tricky, and I've seen
> lots that don't handle names like Lloyd and Manning
> properly.  Here's one that I believe is correct:
> -----

Both this version and the previous version had the same bug, to wit:

	lc = tolower(*name);

The 'tolower' function only works if the input was already upper case.
(at least in bsd 4.2 that's the way it is).  The line should read

	lc = isupper(*name) ? tolower(*name) : *name ;


Also, this version left trailing "0"'s on the output.  Is this right?

Perhaps someone could just post the plain english description of
soundex...

		    thanx..
-- 
		-ed falk, sun microsystems
			falk@sun.com
			sun!falk

garyp@cognos.UUCP (Gary Puckering) (09/09/86)

> > 
> Perhaps someone could just post the plain english description of
> soundex...
> 


Sure!  Be glad too...
   The Soundex function creates a phoenetic code based on an input

   key value according to the following rules:

	1) Always retain the first letter of the key.

	2) Drop out A, E, I, O, U, Y, W, and H.

	3) Assign the following numbers to the remaining similar-sounding
	   sets of letters:

		B, F, P, V = 1
		C, G, J, K, Q, S, X, Z = 2
		D, T =3
		L = 4
		M, N = 5
		R = 6

	4) If there are insufficient letters in the key to complete the
	   code, fill with zeroes.

	5) Drop adjacent equivalent letters (e.g. KELLY = K400).

	6) Drop out adjacent equivalents of the first letter (e.g. 
	   LLOYD = L300).  

-- 

-- 
Gary Puckering
COGNOS Incorporated
3755 Riverside Dr.
Ottawa, Ontario
Canada  K1G 3N3

Telephone: (613) 738-1440
Telex: 053-3341