[comp.misc] Soundex algorithm

chris@mimsy.UUCP (Chris Torek) (07/12/88)

[I have deleted groups comp.theory and comp.ai since Soundex has little
to do with these]

In article <12520@sunybcs.UUCP> stewart@sunybcs.uucp (Norman R. Stewart)
writes:
>2: Apply the following rules to produce a code of one letter and
>   three numbers.
>   A: The first letter of the word becomes the initial character
>      in the code.
>   B: When two or more letters from the same group occur together
>      only the first is coded.
>   C: If two letters from the same group are seperated by an H or
>      a W, code only the first.
>   D: Group 7 letters are never coded (this does not include the
>      first letter in the word, which is always coded).

[I thought Soundex codes were usually fixed at four symbols.]

What if more than two letters from the same group are separated by H
or W?  For instance: FDHTWTHTWL.  Is this encoded as F334 or as F34?

The table has L=4, R=6; I find this surprising, as both R and L are
semivowels and they are easily confused by those who did not grow up
with the distinction (e.g., some Orientals).
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

wunder@hp-sde.SDE.HP.COM (Walter Underwood) (07/12/88)

> The table has L=4, R=6; I find this surprising, as both R and L are
> semivowels and they are easily confused by those who did not grow up
> with the distinction (e.g., some Orientals).
>
> In-Real-Life: Chris Torek

Phonetic algorithms are closely tied to the language.  You really need
a different algorithm for each language, and even for variants of the
language (Canadian French, Cajun French, and French French, for example).
How about an American user phonetically spelling a French name?  Now
we have the user language and the name language.  Yikes.

Things get even worse when you translate Chinese names into English
and look them up with Soundex.  You get Every Lee, Li, etc., in the
book, because English does not include phonitic distinctions that
exist in Chinese.

So, Soundex is a quick hack, and we probably should live with the
limitations.  A better solution is probably much more complex.

wunder

lee@uhccux.uhcc.hawaii.edu (Greg Lee) (07/15/88)

From article <460001@hp-sde.SDE.HP.COM>, by wunder@hp-sde.SDE.HP.COM (Walter Underwood):
" > The table has L=4, R=6; I find this surprising, as both R and L are
" > semivowels and they are easily confused by those who did not grow up
" > with the distinction (e.g., some Orientals).
" >
" > In-Real-Life: Chris Torek
" 
" Things get even worse when you translate Chinese names into English
" and look them up with Soundex.  You get Every Lee, Li, etc., in the
" book, because English does not include phonitic distinctions that
" exist in Chinese.

It's a plausible idea that such differences in spellings of names might
be due to distinctions in the original language.  It's not so, however.

" So, Soundex is a quick hack, and we probably should live with the
" limitations.  A better solution is probably much more complex.

A better solution to what, I wonder.  If soundex is used to collect
names in a bibliography, say, which might represent alternate spellings
of the same pronunciation, we'd like to identify r/l when some of the
people who made up those spellings speak languages that merge r/l.
This is how I've used it.

shorne@citron (Scott Horne) (07/16/88)

From article <2050@uhccux.uhcc.hawaii.edu>, by lee@uhccux.uhcc.hawaii.edu (Greg Lee):
> From article <460001@hp-sde.SDE.HP.COM>, by wunder@hp-sde.SDE.HP.COM (Walter Underwood):
> " > The table has L=4, R=6; I find this surprising, as both R and L are
> " > semivowels and they are easily confused by those who did not grow up
> " > with the distinction (e.g., some Orientals).
> " >
> " > In-Real-Life: Chris Torek
> " 
> " Things get even worse when you translate Chinese names into English
> " and look them up with Soundex.  You get Every Lee, Li, etc., in the
> " book, because English does not include phonitic distinctions that
> " exist in Chinese.
> 
> It's a plausible idea that such differences in spellings of names might
> be due to distinctions in the original language.  It's not so, however.
> 

First, sincere thanks to all who have helped so much with references and
comments.

It *is* so that there are distinctions in Chinese between words transcribed
as Lee or Li in English.  In Chinese, the tone is very important in the
meaning of words.  Chinese uses four tones--a high-pitched, level one; a
sharply rising one; a falling-then-rising one; and a sharply falling one.
Li pronounced with the second tone, for example, is readily distinguished from
Li in the other three tones.  This makes Soundex not so suitable for Chinese.

As a matter of fact, the main reason for my request for Soundex info is to
possibly implement it (or, more likely, a greatly modified version thereof)
in a Chinese word-processing program (The Duke Chinese Typist, developed by
Dr. Richard Kunst at Duke University, 2111 Campus Dr., Durham, NC   27706)!

Again, thanks, everyone--and keep the info coming.

				--Scott Horne

BITNET:		PHORNE@CLEMSON (not working; please use another address)
uucp:		....!gatech!hubcap!scarle!{hazel,citron,amber}!shorne
		(If that doesn't work, send to cchang@hubcap.clemson.edu)
SnailMail:	Scott Horne
		812 Eleanor Dr.
		Florence, SC   29501
VoiceNet:	803 667-9848