dickow@ui3.UUCP (01/31/90)
>/ ui3:comp.lang.forth / ForthNet@willett.UUCP (ForthNet articles from GEnie) / 4:52 pm Jan 25, 1990 / >W.BADEN1 [Wil] at 18:55 PST >Forth enlightenment: The principal input method in Forth is not KEY or >EXPECT, but INTERPRET. Yeah, that's the ideal, but quite a few forth systems can not be distributed freely without disabling the interpreter, scrambling the dictionary names, etc. Often you have to write a pseudo-interpreter then to call your application words. Bob Dickow (...egg-id!ui3!dickow) (rdickow@groucho.mrc.uidaho.edu) (dickow@idui1.bitnet)
ForthNet@willett.pgh.pa.us (ForthNet articles from GEnie) (09/12/90)
Date: 09-09-90 (21:56) Number: 3743 (Echo)
To: ALL Refer#: NONE
From: ZAFAR ESSAK Read: (N/A)
Subj: SOUNDEX Status: PUBLIC MESSAGE
I have been experimenting with the utility SOUNDEX described by Ron
Braithwaite in FD X/3 & 4 in 1988. I modified it slightly for use
without a string stack and to be compatible with F-PC as follows:
\ SOUNDEX.TXT Ron Braithwaite "Using A String Stack" FD X/3 p.15
(1988)
((
The whole idea of SOUNDEX dates back to the 1894 U.S. census when they
wanted to be able to find names that sounded alike. The algorithm for
$SOUNDEX came from Guy Kelly.
))
ONLY FORTH ALSO DEFINITIONS
DECIMAL
: C>SNDX ( ascii--char2)
DUP 97 > IF 32 - THEN \ convert to uppercase
65 - 0 MAX 26 MIN
( ABCDEFGHIJKLMNOPQRSTUVWXYZ )
" 012301200224550126230102020" DROP + C@ ;
CREATE sndx.buf ( --$adr) ," 0000"
: >SOUNDEX ( adr1,n--$adr2) \ 0000 <= $adr2 <= Z999
0 sndx.buf C! sndx.buf 1+ 4 ASCII 0 FILL
?DUP
IF OVER C@
DUP 97 > IF 32 - THEN \ convert to uppercase
DUP sndx.buf 1+ C! \ store first character
1 sndx.buf C+! \ as start of $soundex
C>SNDX -ROT \ earlier character's sndx
BOUNDS 1+
?DO I C@ C>SNDX \ old,new
TUCK =
OVER ASCII 0 = OR 0=
IF DUP sndx.buf COUNT + C! 1 sndx.buf C+!
THEN sndx.buf C@ 4 = ?LEAVE
LOOP
THEN DROP
sndx.buf 4 OVER C! ;
: $SOUNDEX ( $adr1--$adr2) \ 0000 <= $adr2 <= Z999
COUNT >SOUNDEX ;
CR .( cr pad dup 20 expect cr span @ ) CR CR
CR .( >SOUNDEX cr count type space ) CR
======================================================
Now I am wondering if anyone can tell me if I have inadvertantly
introduced any errors in this translation?
Assuming I have not I have taken the above code and applied it to 2,000
names from an existing database and have been examining the results.
At the moment I am not sure exactly how this function can be useful.
It does group names which at times seems close:
e.g. SCHMIDT, SMITH, SMYTH are all S530
But other times names such as:
ACTON, ASHDOWN, AUSTIN are grouped as A235.
I have wondered if the ethnic origin of names might affect the
weighting used in the definitions above. Any comments would be
welcomed.
Zafar.
---
* Via Qwikmail 2.01
NET/Mail : British Columbia Forth Board - Burnaby BC - (604)434-5886
-----
This message came from GEnie via willett through a semi-automated process.
Report problems to: uunet!willett!dwp or dwp@willett.pgh.pa.usForthNet@willett.pgh.pa.us (ForthNet articles from GEnie) (09/14/90)
Date: 09-11-90 (13:10) Number: 3758 (Echo)
To: ZAFAR ESSAK Refer#: 3743
From: JACK BROWN Read: NO
Subj: SOUNDEX Status: PUBLIC MESSAGE
ZE>I have been experimenting with the utility SOUNDEX described by Ron
ZE>Braithwaite in FD X/3 & 4 in 1988. I modified it slightly for use
ZE>without a string stack and to be compatible with F-PC as follows:
Ralph Dean had a Forth implementation of SOUNDEX in Dr Dobbs #50
You can get his complete implementation in the file BSTRING.SEQ that
can be found in L6.ZIP of Jack Brown's F-PC 3.5 Tutorial.
[ Lesson's 1 - 7 are on wsmr-simtel20.army.mil and wuarchive.wustl.edu.
The file is called fpcl1-7.zip. -dwp ]
Below is the last section of this file. You could use Ralph's
implementation to check your own. You will need to get the
file BSTRING.SEQ from L6.ZIP to compile the code below.
\ Ralph Dean's FORTH implementation of SOUNDEX program that
\ originally appeared in the May 1980 Byte Magazine.
\
\ Executing SOUND will cause a prompt for the name.
\ The name is terminated after 30 characters or <enter>.
\ The soundex code is then computed and typed out.
\ The string variable S$ conatains the code produced.
\ For more information on Soundex codes see the original
\ Byte article.
FORTH DEFINITIONS DECIMAL
30 STRING N$ \ Input string whose soundex code is to be found.
4 STRING S$ \ Output string containing soundex code.
1 STRING K$ 1 STRING L$
: NAME ( -- ) \ Prompt for input of last name.
CR ." Last Name? " N$ $IN ;
: FIRST1 ( -- ) \ Move first character to S$
1 N$ LEFT$ S$ S! ;
: ITH ( n m -- k )
N$ MID$ DROP C@ 64 - ;
: KTH ( k -- )
DUP " 01230120022455012623010202"
MID$ K$ S! ;
: BLS ( -- )
S$ K$ S+ S$ S! ;
: TEST ( -- flag )
K$ L$ S= K$ " 0" S= OR 0= ;
: IST ( n n flag )
DUP 1 < OVER 26 > OR 0= ;
\ Compute soundex code
: COMP ( -- )
N$ LEN 1+ 2
DO I I ITH IST
IF KTH TEST IF BLS THEN
ELSE DROP
THEN
K$ L$ S!
LOOP ;
\ This is the Program. BROWN , BRUN , BRAWN all give B650
: SOUNDEX ( -- )
NAME FIRST1 N$ LEN 2 >
IF COMP THEN S$ " 0000" S+ S$ S!
CR ." Soundex Code = " S$ TYPE CR ;
---
* QDeLuxe 1.01 #260s Are you a member of FIG? Why not join today!
NET/Mail : British Columbia Forth Board - Burnaby BC - (604)434-5886
-----
This message came from GEnie via willett through a semi-automated process.
Report problems to: uunet!willett!dwp or dwp@willett.pgh.pa.usForthNet@willett.pgh.pa.us (ForthNet articles from GEnie) (09/14/90)
Date: 09-12-90 (00:46) Number: 3761 (Echo)
To: ZAFAR ESSAK Refer#: 3743
From: KENNETH O'HESKIN Read: NO
Subj: SOUNDEX Status: PUBLIC MESSAGE
ZE>Assuming I have not I have taken the above code and applied it to 2,0
ZE>names from an existing database and have been examining the results.
ZE>At the moment I am not sure exactly how this function can be useful.
I havn't yet applied Soundex to any serious use, since its
utility seems to contigent on two preconditions...
(1) very large databases of proper nouns, and
(2) unreliable methods of data entry, especially systems
prone to misspellings due to operator error.
Both conditions are more likely to occur in the corporate-
governmental mainframe environments rather than on single-user
microcomputers. Most of us probably have had the experience
of our name being misspelled, say on a magazine label, and
as this "sucker list" is sold to other databases, the error
gets cloned and we start getting junk mail from all and sundry
with the identical error.
The data in that kind of environment may have been gathered
over the phone, or taken from forms with little boxes far too
small to print legibly in, and often the operator may be some
underpaid drudge who has no motivation to do accurate work.
Since an exact match may not yield a successful search, a
Soundex type of pattern matching might get you in the ballpark.
---
~ EZ 1.26 ~
NET/Mail : British Columbia Forth Board - Burnaby BC - (604)434-5886
-----
This message came from GEnie via willett through a semi-automated process.
Report problems to: uunet!willett!dwp or dwp@willett.pgh.pa.usForthNet@willett.pgh.pa.us (ForthNet articles from GEnie) (09/15/90)
Date: 09-13-90 (09:45) Number: 3768 (Echo) To: KENNETH O'HESKIN Refer#: 3761 From: STEVE PALINCSAR Read: NO Subj: SOUNDEX Status: PUBLIC MESSAGE If you ever come to the National Archives to do any genealogical research you get to use SOUNDEX when you use the indexes to the old Censuses. It's useful in consolidating the various attempts that were made at spelling foreign names. ----- This message came from GEnie via willett through a semi-automated process. Report problems to: uunet!willett!dwp or dwp@willett.pgh.pa.us
ForthNet@willett.pgh.pa.us (ForthNet articles from GEnie) (09/15/90)
Date: 09-13-90 (10:57) Number: 3769 (Echo)
To: ZAFAR ESSAK Refer#: 3743
From: GENE LEFAVE Read: NO
Subj: SOUNDEX Status: PUBLIC MESSAGE
ZE>At the moment I am not sure exactly how this function can be useful.
ZE>It does group names which at times seems close:
ZE> e.g. SCHMIDT, SMITH, SMYTH are all S530
Although I don't pretend to be a SOUNDEX expert I have some experience
using it. First, the state of Illinois uses it to generate driver
license numbers. A license number is the SOUNDEX code for your last
name, a first name code, ( I don't know where that comes from), and a
coded birth date.
I used to use SOUNDEX code to retrieve entries in a database. Using
SOUNDEX made the program very tolerant of spelling errors. I seem
to recall that certain database programs had this function built in.
However, English has so many short words that I found that in many cases
I was essentially searching on the first character. So I went to
a string search.
As to the basic algorithm, the idea is to use the first letter, then
drop all vowels, then group the remaining consonants into 6 sound alike
classes. These classes are English specific, not necessarily ethnic.
adjacent duplicates are dropped.
SCHMIDT = S530 because
S first character.
C dropped because its same class as S and adjacent.
H always dropped
M class 5
I dropped vowel
D class 3
T dropped, adjacent class 3
You can easily work out the other names. Its useful for names because
most last names are long enough to generate a meaningful code. Assuming
a list of 1,000,000 names SOUNDEX hashes to 5616 codes, for 180 average
collisions, which would not be difficult to resolve with a first name
and birthdate, or some other type of qualifier. You have to remember
that it was originally set up for manual searching.
---
~ EZ-Reader 1.13 ~
-----
This message came from GEnie via willett through a semi-automated process.
Report problems to: uunet!willett!dwp or dwp@willett.pgh.pa.usForthNet@willett.pgh.pa.us (ForthNet articles from GEnie) (09/18/90)
Date: 09-15-90 (12:46) Number: 3786 (Echo)
To: STEVE PALINCSAR Refer#: 3768
From: KENNETH O'HESKIN Read: NO
Subj: SOUNDEX Status: PUBLIC MESSAGE
SP>research you get to use SOUNDEX when you use the indexes to the old
SP>Censuses. It's useful in consolidating the various attempts that wer
SP>made at spelling foreign names.
Agreed, SOUNDEX is a powerful and impressive tool for uses
such as this. But like Zafar, when implementing databases I
_wanted_ to find a legitimate application for it but couldn't.
In a small operation where 3 or 4 people have a variety of
tasks to do on a computer (ie: arn't bored to death doing
mindless data entry for 8 hours every working day), and the
databases arn't likely to be larger than a few thousand records
anyway, there is not much need for this kind of tool. Errors
are less likely to occur, and are usually caught and corrected
by someone else.
---
~ EZ 1.26 ~
-----
This message came from GEnie via willett through a semi-automated process.
Report problems to: uunet!willett!dwp or dwp@willett.pgh.pa.usrueter_a@wums2.wustl.edu (09/24/90)
> Agreed, SOUNDEX is a powerful and impressive tool for uses > such as this. But like Zafar, when implementing databases I > _wanted_ to find a legitimate application for it but couldn't. > > In a small operation where 3 or 4 people have a variety of > tasks to do on a computer (ie: arn't bored to death doing > mindless data entry for 8 hours every working day), and the > databases arn't likely to be larger than a few thousand records > anyway, there is not much need for this kind of tool. Errors > are less likely to occur, and are usually caught and corrected > by someone else. I agree, but SOUNDEX with the birthdate ( in month/day/year order) of a person is very powerful. We keep track of 5 years worth of patients (400,000) and have a miss match rate of 1/50 verses 1/10 for other group at the medical school clinics. Allen Rueter Mallinckrodt Insitute of Radiology allen@cisco!wugate.wustl.edu <- I think, Cisco is decnet, wugate is a unix rtr
ForthNet@willett.pgh.pa.us (ForthNet articles from GEnie) (10/15/90)
Date: 10-11-90 (11:10) Number: 8 of 10 To: ZAFAR ESSAK Refer#: NONE From: GENE LEFAVE Read: NO Subj: SOUNDEX Status: PUBLIC MESSAGE Conf: FORTH (58) Read Type: GENERAL (+) I think the proper approach depends on the size of your list. On our systems we rarely have more then 3000 names. I'm using a straight string match on an alpha list. I actually use the built in block editor from polyFORTH. Then convert the located strings back into record numbers. I've never had a user complaint this way, and they can find names usually with just three characters of the name. When I was using soundex I was always answering questions about why a totally unrelated name would come up. And they always had to get the first letter right. Using the string search also lets them search on first names if it's unusual. I would only recommend soundex if your database is very large (>20,000) and then I would just display the hits. It's pretty unlikely that a name with a close spelling does not hit. Another experiment I tried involved hashing but the results were so weird I'm still trying to think up a use for it. --- ~ EZ-Reader 1.13 ~ NET/Mail : East Coast Forth Board -- McLean, VA -- 703-442-8695 PCRelay:DCINFO -> #16 MetroLink (tm) International Network 4.10 DC Info Exchange MetroLink International Hub ----- This message came from GEnie via willett through a semi-automated process. Report problems to: dwp@willett.pgh.pa.us or uunet!willett!dwp
ForthNet@willett.pgh.pa.us (ForthNet articles from GEnie) (10/16/90)
Date: 10-07-90 (23:29) Number: 3996 (Echo)
To: GENE LEFAVE Refer#: 3769
From: ZAFAR ESSAK Read: 10-11-90 (11:14)
Subj: SOUNDEX Status: PUBLIC MESSAGE
Gene thanks for the synopsis of the SOUNDEX algorithm and your
comments.
GL> I used to use SOUNDEX code to retrieve entries in a database.
GL> Using SOUNDEX made the program very tolerant of spelling errors.
GL> I seem to recall that certain database programs had this function
GL> built in.
This is exactly what I had in mind. i.e. Using the SOUNDEX code as the
index into the list of names, and then displaying appropriate names for
selection by the User in an alphabetic listing. Hoping that this
would be tolerant of spelling errors. However, with the experimental
run I have done and the range of names that may have the same codes I
am wondering what to do?
For example:
Code B200 includes: BAUGH, BEGGS, BOSSE, BOYCE, BUKSH
Code B520 includes: BAINS, BANJI, BINGA
Code K400 includes: KALE, KELLY, KUHL, KYLE
Now if every name in the range is displayed, little if anything is
gained. Should the selection only display the other names with
matching codes, Ignoring others with spelling that may be
alphabetically adjacent? Or ...
I have experience with using a straight string comparison but would
like to find something that might provide additional tolerance.
Thanks for your comments. Zafar.
---
* Via Qwikmail 2.01
NET/Mail : British Columbia Forth Board - Burnaby BC - (604)434-5886
-----
This message came from GEnie via willett through a semi-automated process.
Report problems to: dwp@willett.pgh.pa.us or uunet!willett!dwpForthNet@willett.pgh.pa.us (ForthNet articles from GEnie) (10/16/90)
Date: 10-07-90 (23:29) Number: 3998 (Echo) To: KENNETH O'HESKIN Refer#: 3786 From: ZAFAR ESSAK Read: NO Subj: SOUNDEX Status: PUBLIC MESSAGE KO'H> In a small operation where 3 or 4 people have a variety of KO'H> tasks to do on a computer (ie: arn't bored to death doing KO'H> mindless data entry for 8 hours every working day), and the KO'H> databases arn't likely to be larger than a few thousand records KO'H> anyway, there is not much need for this kind of tool. Errors KO'H> are less likely to occur, and are usually caught and corrected KO'H> by someone else. Sorry, Kenneth I cannot agree with you. Life's not like that. It's real easy to type a name incorrectly, maybe only once in a day if you're real good. So, is it possible for the machine to be forgiving and lend a hand? That is the issue I would like to address with a routine like SOUNDEX. --- * Via Qwikmail 2.01 NET/Mail : British Columbia Forth Board - Burnaby BC - (604)434-5886 ----- This message came from GEnie via willett through a semi-automated process. Report problems to: dwp@willett.pgh.pa.us or uunet!willett!dwp
ForthNet@willett.pgh.pa.us (ForthNet articles from GEnie) (10/21/90)
Date: 10-11-90 (12:10) Number: 4034 (Echo) To: ZAFAR ESSAK Refer#: 3996 From: GENE LEFAVE Read: NO Subj: SOUNDEX Status: PUBLIC MESSAGE I think the proper approach depends on the size of your list. On our systems we rarely have more then 3000 names. I'm using a straight string match on an alpha list. I actually use the built in block editor from polyFORTH. Then convert the located strings back into record numbers. I've never had a user complaint this way, and they can find names usually with just three characters of the name. When I was using soundex I was always answering questions about why a totally unrelated name would come up. And they always had to get the first letter right. Using the string search also lets them search on first names if it's unusual. I would only recommend soundex if your database is very large (>20,000) and then I would just display the hits. It's pretty unlikely that a name with a close spelling does not hit. Another experiment I tried involved hashing but the results were so weird I'm still trying to think up a use for it. --- ~ EZ-Reader 1.13 ~ ----- This message came from GEnie via willett through a semi-automated process. Report problems to: dwp@willett.pgh.pa.us or uunet!willett!dwp