[net.sources] CORRECT soundex AGAIN!

mike@whuxl.UUCP (BALDWIN) (09/08/86)

Chris, I'm surprised.  You made some gratuitous changes to my
code, but you also broke it!  Please don't publish code unless
you're reasonably sure it works.  I've tested mine against the
examples in Knuth's Vol 3.

For everyone's edification, here is the text from p. 392 for
the Soundex algorithm:

1. Retain the first letter of the name, and drop all occurrences of
   a, e, h, i, o, u, w, y in other positions.

2. Assign the following numbers to the remaining letters after the first:
	b, f, p, v -> 1				l -> 4
	c, g, j, k, q, s, x, z -> 2		m, n -> 5
	d, t -> 3				r -> 6

3. If two or more letters with the same code were adjacent in the original
   name (before step 1), omit all but the first.

4. Convert to the form ``letter, digit, digit, digit'' by adding trailing
   zeros (if there are less than three digits), or by dropping rightmost
   digits (if there are more than three).

The examples given in the book are:

	Euler, Ellery		E460
	Gauss, Ghosh		G200
	Hilbert, Heilbronn	H416
	Knuth, Kant		K530
	Lloyd, Ladd		L300
	Lukasiewicz, Lissajous	L222

Most algorithms fail in two ways:
 1. they omit adjacent letters with the same code AFTER step 1, not before.
 2. they do not omit adjacent letters with the same code at the beginning
    of the name.

I.e., most will fail on Lloyd, Lukasiewicz, and Lissajous.
Some comments on your comments on my code:

> >	register char	c, lc, prev = '0';
> `register int' generates better code on my compiler, and still works.

Those variables are used as characters, not integers.
I'm sorry that your compiler is deficient, but I like
to declare variables the way they are used.

> >		if (isalpha(*name)) {
> First you should test isascii(*name) (a nit).

Um, isalpha() returns false for all non-ASCII characters already.

> >			lc = tolower(*name);
> Watch out!  Some tolower()s fail miserably if !isupper(c).

I should have said my code conforms to the SVID; according to
it and all other stds, tolower(c) will always work correctly.

The only things I would change would be to add <string.h> and
to cast strcpy to (void).  Those are the only things my lint
complains about.

> #ifdef lint
> 	/* lint cannot tell that prev is set before used */
> 	prev = 0;
> #endif

Mine can.

Here is the CORRECT code again:
-----

#include <ctype.h>
#include <string.h>

#define	SDXLEN	4

char *
soundex(name)
char	*name;
{
	static char	buf[SDXLEN+1];
	register char	c, lc, prev = '0';
	register int	i;

	(void) strcpy(buf, "a000");

	for (i = 0; *name && i < SDXLEN; name++)
		if (isalpha(*name)) {
			lc = tolower(*name);
			c = "01230120022455012623010202" [lc-'a'];
			if (i == 0 || (c != '0' && c != prev)) {
				buf[i] = i ? c : lc;
				i++;
			}
			prev = c;
		}

	return buf;
}

-----
With the caveat that your tolower() may need an isupper() test
in front of it, and you may have <strings.h> or none at all.
Please don't change it unless you're sure the new code still works!
-- 
						Michael Baldwin
			(not the opinions of)	AT&T Bell Laboratories
						{at&t}!whuxl!mike