[comp.misc] PF474 chips and variant lists

rmorey@orion.cf.uci.edu (Robert Morey) (05/13/89)

     I am looking for information on the PF474 chip and/or information on
   how to produce variants of a word.  For instance, if I have a word, how
   can I produce a series of other words which are semantically, phonetically,
   or structurally similar to the first, with probabilities?  The work
   involved is in trying to find variations of XV and XVI century words
   across various European languages.

     Any help whatsoever in this mystery would be very much appreciated!

                                           Robert J. Morey

bill@twwells.uucp (T. William Wells) (05/19/89)

Followups have been directed to comp.misc.

In article <1914@orion.cf.uci.edu> rmorey@orion.cf.uci.edu (Robert Morey) writes:
:      I am looking for information on the PF474 chip

Hi, you've found the right person. The chip was created by and is sold
by Proximity Technology Inc., the company I work for. I'd have
responded sooner but my feed has been in its usual yo-yo mode and I'm
now about five days behind.

The PF474 is a string comparator: it compares two strings and returns
a real number indicating the degree of similarity; zero for no
features in common and one for identity. Its normal operating mode is
to have one string in an internal RAM while other strings are passed
into it via DMA. It maintains a list of the 16 closest matches to the
string in the RAM; you query it for those matches when you've passed
all the data in. There is a board for the IBM-PC that has the PF474
on it. The PF474, and the algorithm it implements, are patented by
Proximity.

You can do all kinds of matching with this chip: we've used it for
speech and image recognition experiments, Chinese character
recognition experiments, and classifying heartbeats. The heart of our
business is spelling checking and correction; the algorithm in the
PF474 is the base of our spelling corrector.

However, we have discovered that, for many applications, the PF474 is
not necessary. E.g., we sell a product, Friendly Finder (unless
they've changed the name again) that searches DBase files for near
matches and which uses the algorithm that is implemented in the
PF474; we discovered that, by judicious pruning of the data going
into the algorithm, the software is as fast as the hardware.

There have been a number of articles on the PF474 and the algorithm it
implements: they have all described a version of the algorithm that
is quadratic or worse in the length of the query strings. The PF474
algorithm is linear; this makes a big difference in software versions
of the algorithm.

If your purpose is academic, we can probably get you code that
simulates the algorithm and I'd be willing to help you get some use
of it. If your purpose is commercial, we will certainly sell you
either the chip or the code, but you should take it up with our sales
people. They can be reached at 305-566-3511.

:                                                     and/or information on
:    how to produce variants of a word.  For instance, if I have a word, how
:    can I produce a series of other words which are semantically, phonetically,
:    or structurally similar to the first, with probabilities?  The work
:    involved is in trying to find variations of XV and XVI century words
:    across various European languages.

Which way you do this depends on how many words you want to find
variants of. If you are doing this a word at a time, you can pass
your database through the PF or the emulation, one pass per word. You
don't pass the raw database through, however; you encode the data in
some way and then pass the encoded data. I can't get too much into
that, as it strays into proprietary data, but here are some hints:

   For phonetic similarity, you can group letters together, and pass
   the code for the group into the PF. You can, of course, do more
   complex encodings. Since this is the meat of our spelling
   correction, I'll not say more other than you should take it up
   with Proximity to see what they'll let you look at.

   For other kinds of similarity, you can construct strings that
   contain properly selected features of the input placed in an
   appropriate order.

If you want to take every word in the database and match it against
every other word, you have to do a much more complex algorithm unless
you have access to a Cray or better. :-)

If you want more information, feel free to send me e-mail.

---
Bill                            { uunet | novavax } !twwells!bill
Newsgroups: comp.misc,comp.sys.ibm.pc,misc.wanted,ca.wanted
Subject: Re: PF474 chips and variant lists
Summary:
Expires:
References: <1914@orion.cf.uci.edu>
Sender:
Reply-To: bill@twwells.UUCP (T. William Wells)
Followup-To: comp.misc
Distribution:
Organization: None, Ft. Lauderdale
Keywords: PF474 chisp

Followups have been directed to comp.misc.

In article <1914@orion.cf.uci.edu> rmorey@orion.cf.uci.edu (Robert Morey) writes:
:      I am looking for information on the PF474 chip

Hi, you've found the right person. The chip was created by and is sold
by Proximity Technology Inc., the company I work for. I'd have
responded sooner but my feed has been in its usual yo-yo mode and I'm
now about five days behind.

The PF474 is a string comparator: it compares two strings and returns
a real number indicating the degree of similarity; zero for no
features in common and one for identity. Its normal operating mode is
to have one string in an internal RAM while other strings are passed
into it via DMA. It maintains a list of the 16 closest matches to the
string in the RAM; you query it for those matches when you've passed
all the data in. There is a board for the IBM-PC that has the PF474
on it. The PF474, and the algorithm it implements, are patented by
Proximity.

You can do all kinds of matching with this chip: we've used it for
speech and image recognition experiments, Chinese character
recognition experiments, and classifying heartbeats. The heart of our
business is spelling checking and correction; the algorithm in the
PF474 is the base of our spelling corrector.

However, we have discovered that, for many applications, the PF474 is
not necessary. E.g., we sell a product, Friendly Finder (unless
they've changed the name again) that searches DBase files for near
matches and which uses the algorithm that is implemented in the
PF474; we discovered that, by judicious pruning of the data going
into the algorithm, the software is as fast as the hardware.

There have been a number of articles on the PF474 and the algorithm it
implements: they have all described a version of the algorithm that
is quadratic or worse in the length of the query strings. The PF474
algorithm is linear; this makes a big difference in software versions
of the algorithm.

If your purpose is academic, we can probably get you code that
simulates the algorithm and I'd be willing to help you get some use
of it. If your purpose is commercial, we will certainly sell you
either the chip or the code, but you should take it up with our sales
people. They can be reached at 305-566-3511.

:                                                     and/or information on
:    how to produce variants of a word.  For instance, if I have a word, how
:    can I produce a series of other words which are semantically, phonetically,
:    or structurally similar to the first, with probabilities?  The work
:    involved is in trying to find variations of XV and XVI century words
:    across various European languages.

Which way you do this depends on how many words you want to find
variants of. If you are doing this a word at a time, you can pass
your database through the PF or the emulation, one pass per word. You
don't pass the raw database through, however; you encode the data in
some way and then pass the encoded data. I can't get too much into
that, as it strays into proprietary data, but here are some hints:

   For phonetic similarity, you can group letters together, and pass
   the code for the group into the PF. You can, of course, do more
   complex encodings. Since this is the meat of our spelling
   correction, I'll not say more other than you should take it up
   with Proximity to see what they'll let you look at.

   For other kinds of similarity, you can construct strings that
   contain properly selected features of the input placed in an
   appropriate order.

If you want to take every word in the database and match it against
every other word, you have to do a much more complex algorithm unless
you have access to a Cray or better. :-)

If you want more information, feel free to send me e-mail.

---
Bill                            { uunet | novavax } !twwells!bill