[comp.sources.wanted] SOUNDEX routines wanted

ron@iconsys.UUCP (Ron Holt) (06/24/88)

I am considering writing an interactive spell checker/corrector for
Unix similar to that implemented in WordPerfect.  I would like to try
using Soundex for the spell corrector portion.  Does any one know
where I can get source code to any Soundex routines?
-- 
Ron Holt                     UUCP: {uunet,caeco}!iconsys!ron
Software Development Manager ARPANET: icon%byuadam.bitnet@cunyvm.cuny.edu
Icon International, Inc.     BITNET: icon%byuadam.bitnet
Orem, Utah 84058             PHONE: (801) 225-6888

leo@philmds.UUCP (Leo de Wit) (06/29/88)

In article <250@iconsys.UUCP> ron@iconsys.UUCP (Ron Holt) writes:
>
>I am considering writing an interactive spell checker/corrector for
>Unix similar to that implemented in WordPerfect.  I would like to try
>using Soundex for the spell corrector portion.  Does any one know
>where I can get source code to any Soundex routines?

Soundex is in fact so easy you should write it yourself. Here's what I
read in an old Pascal exercise book (in Dutch, so I translated for you):

---------------- S T A R T    Q U O T A T I O N ------------------

All characters belong to a group, as follows (ignoring case)

0:  a,e,i,o,u,h,w,y, <all non-alpha characters>
1:  b,f,p,v
2:  c,g,j,k,q,s,x,z
3:  d,t
4:  l
5:  m,n
6:  r

1) First replace each character from the string to be encoded by the
group. So 'This is a testcase' becomes '300200200030232020'.

2) Then replace all repetitions by one occurence. So the example becomes
'30202030232020'.

3) Finally remove '0''s. So the example becomes '32232322'.

---------------- E N D   Q U O T A T I O N ------------------

Because there are 7 groups (with only 6 used) you can use a nibble (4
bits) to encode the group number. If your strcmp() does not ignore bit
7, you can use it for comparing encoded soundex strings, otherwise use
memcmp(). Below I put an implementation that should work (haven't tested
it). The class[] char array should contain a 0 on most places, only
    class['b'] = class['f'] = class['p'] = class['v'] = 
    class['B'] = class['F'] = class['P'] = class['V'] = 1;
etc. for the other non-0 classes.
The encoded string is null-terminated to allow standard str...  functions.

static char class[] = {

  /* put correct (256) initializers here */

};

void soundex(src,dest)
char *src, *dest;
{
    char lastclass = 0, newclass;
    int even = 1;

    for ( ; *src != '\0'; src++) {
        if ((newclass = class[*src]) != lastclass) {
            lastclass = newclass;
            if (newclass != 0) {
                if (even) {
                    *dest = newclass << 4;
                } else {
                    *dest++ |= newclass; 
                }
                even = !even;
            }
        }
    }
    if (!even) dest++;
    *dest = '\0';
}

Hope this will satisfy your need  ---  success!

    Leo.

jackm@devvax.JPL.NASA.GOV (Jack Morrison) (07/01/88)

In article <538@philmds.UUCP> leo@philmds.UUCP (L.J.M. de Wit) writes:
>In article <250@iconsys.UUCP> ron@iconsys.UUCP (Ron Holt) writes:
>>
>>I am considering writing an interactive spell checker/corrector for
>>Unix similar to that implemented in WordPerfect.  I would like to try
>>using Soundex for the spell corrector portion.  Does any one know [...]

>Soundex is in fact so easy you should write it yourself. Here's what I
>read in an old Pascal exercise book (in Dutch, so I translated for you):
>

[algorithm and implementation deleted...]

I wonder, if you had space, whether a better version could be written
using basic text-to-speech pattern matching. It would try, for example,
to determine whether a 'c' meant an 'S' sound or a 'K' sound based on
letter context. Build up the 'soundex+' string based on a larger set
of classes roughly equivalent to phonemes. For example, see Ciarcia's
speech synthesizer project from BYTE magazine about two years back.

Just a thought...
-- 
Jack C. Morrison	Jet Propulsion Laboratory
(818)354-1431		jackm@jpl-devvax.jpl.nasa.gov
"The paycheck is part government property, but the opinions are all mine."