chris@mimsy.UUCP (Chris Torek) (07/12/88)
[I have deleted groups comp.theory and comp.ai since Soundex has little to do with these] In article <12520@sunybcs.UUCP> stewart@sunybcs.uucp (Norman R. Stewart) writes: >2: Apply the following rules to produce a code of one letter and > three numbers. > A: The first letter of the word becomes the initial character > in the code. > B: When two or more letters from the same group occur together > only the first is coded. > C: If two letters from the same group are seperated by an H or > a W, code only the first. > D: Group 7 letters are never coded (this does not include the > first letter in the word, which is always coded). [I thought Soundex codes were usually fixed at four symbols.] What if more than two letters from the same group are separated by H or W? For instance: FDHTWTHTWL. Is this encoded as F334 or as F34? The table has L=4, R=6; I find this surprising, as both R and L are semivowels and they are easily confused by those who did not grow up with the distinction (e.g., some Orientals). -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
wunder@hp-sde.SDE.HP.COM (Walter Underwood) (07/12/88)
> The table has L=4, R=6; I find this surprising, as both R and L are > semivowels and they are easily confused by those who did not grow up > with the distinction (e.g., some Orientals). > > In-Real-Life: Chris Torek Phonetic algorithms are closely tied to the language. You really need a different algorithm for each language, and even for variants of the language (Canadian French, Cajun French, and French French, for example). How about an American user phonetically spelling a French name? Now we have the user language and the name language. Yikes. Things get even worse when you translate Chinese names into English and look them up with Soundex. You get Every Lee, Li, etc., in the book, because English does not include phonitic distinctions that exist in Chinese. So, Soundex is a quick hack, and we probably should live with the limitations. A better solution is probably much more complex. wunder
lee@uhccux.uhcc.hawaii.edu (Greg Lee) (07/15/88)
From article <460001@hp-sde.SDE.HP.COM>, by wunder@hp-sde.SDE.HP.COM (Walter Underwood): " > The table has L=4, R=6; I find this surprising, as both R and L are " > semivowels and they are easily confused by those who did not grow up " > with the distinction (e.g., some Orientals). " > " > In-Real-Life: Chris Torek " " Things get even worse when you translate Chinese names into English " and look them up with Soundex. You get Every Lee, Li, etc., in the " book, because English does not include phonitic distinctions that " exist in Chinese. It's a plausible idea that such differences in spellings of names might be due to distinctions in the original language. It's not so, however. " So, Soundex is a quick hack, and we probably should live with the " limitations. A better solution is probably much more complex. A better solution to what, I wonder. If soundex is used to collect names in a bibliography, say, which might represent alternate spellings of the same pronunciation, we'd like to identify r/l when some of the people who made up those spellings speak languages that merge r/l. This is how I've used it.
shorne@citron (Scott Horne) (07/16/88)
From article <2050@uhccux.uhcc.hawaii.edu>, by lee@uhccux.uhcc.hawaii.edu (Greg Lee): > From article <460001@hp-sde.SDE.HP.COM>, by wunder@hp-sde.SDE.HP.COM (Walter Underwood): > " > The table has L=4, R=6; I find this surprising, as both R and L are > " > semivowels and they are easily confused by those who did not grow up > " > with the distinction (e.g., some Orientals). > " > > " > In-Real-Life: Chris Torek > " > " Things get even worse when you translate Chinese names into English > " and look them up with Soundex. You get Every Lee, Li, etc., in the > " book, because English does not include phonitic distinctions that > " exist in Chinese. > > It's a plausible idea that such differences in spellings of names might > be due to distinctions in the original language. It's not so, however. > First, sincere thanks to all who have helped so much with references and comments. It *is* so that there are distinctions in Chinese between words transcribed as Lee or Li in English. In Chinese, the tone is very important in the meaning of words. Chinese uses four tones--a high-pitched, level one; a sharply rising one; a falling-then-rising one; and a sharply falling one. Li pronounced with the second tone, for example, is readily distinguished from Li in the other three tones. This makes Soundex not so suitable for Chinese. As a matter of fact, the main reason for my request for Soundex info is to possibly implement it (or, more likely, a greatly modified version thereof) in a Chinese word-processing program (The Duke Chinese Typist, developed by Dr. Richard Kunst at Duke University, 2111 Campus Dr., Durham, NC 27706)! Again, thanks, everyone--and keep the info coming. --Scott Horne BITNET: PHORNE@CLEMSON (not working; please use another address) uucp: ....!gatech!hubcap!scarle!{hazel,citron,amber}!shorne (If that doesn't work, send to cchang@hubcap.clemson.edu) SnailMail: Scott Horne 812 Eleanor Dr. Florence, SC 29501 VoiceNet: 803 667-9848