rmorey@orion.cf.uci.edu (Robert Morey) (05/13/89)
I am looking for information on the PF474 chip and/or information on how to produce variants of a word. For instance, if I have a word, how can I produce a series of other words which are semantically, phonetically, or structurally similar to the first, with probabilities? The work involved is in trying to find variations of XV and XVI century words across various European languages. Any help whatsoever in this mystery would be very much appreciated! Robert J. Morey
bill@twwells.uucp (T. William Wells) (05/19/89)
Followups have been directed to comp.misc. In article <1914@orion.cf.uci.edu> rmorey@orion.cf.uci.edu (Robert Morey) writes: : I am looking for information on the PF474 chip Hi, you've found the right person. The chip was created by and is sold by Proximity Technology Inc., the company I work for. I'd have responded sooner but my feed has been in its usual yo-yo mode and I'm now about five days behind. The PF474 is a string comparator: it compares two strings and returns a real number indicating the degree of similarity; zero for no features in common and one for identity. Its normal operating mode is to have one string in an internal RAM while other strings are passed into it via DMA. It maintains a list of the 16 closest matches to the string in the RAM; you query it for those matches when you've passed all the data in. There is a board for the IBM-PC that has the PF474 on it. The PF474, and the algorithm it implements, are patented by Proximity. You can do all kinds of matching with this chip: we've used it for speech and image recognition experiments, Chinese character recognition experiments, and classifying heartbeats. The heart of our business is spelling checking and correction; the algorithm in the PF474 is the base of our spelling corrector. However, we have discovered that, for many applications, the PF474 is not necessary. E.g., we sell a product, Friendly Finder (unless they've changed the name again) that searches DBase files for near matches and which uses the algorithm that is implemented in the PF474; we discovered that, by judicious pruning of the data going into the algorithm, the software is as fast as the hardware. There have been a number of articles on the PF474 and the algorithm it implements: they have all described a version of the algorithm that is quadratic or worse in the length of the query strings. The PF474 algorithm is linear; this makes a big difference in software versions of the algorithm. If your purpose is academic, we can probably get you code that simulates the algorithm and I'd be willing to help you get some use of it. If your purpose is commercial, we will certainly sell you either the chip or the code, but you should take it up with our sales people. They can be reached at 305-566-3511. : and/or information on : how to produce variants of a word. For instance, if I have a word, how : can I produce a series of other words which are semantically, phonetically, : or structurally similar to the first, with probabilities? The work : involved is in trying to find variations of XV and XVI century words : across various European languages. Which way you do this depends on how many words you want to find variants of. If you are doing this a word at a time, you can pass your database through the PF or the emulation, one pass per word. You don't pass the raw database through, however; you encode the data in some way and then pass the encoded data. I can't get too much into that, as it strays into proprietary data, but here are some hints: For phonetic similarity, you can group letters together, and pass the code for the group into the PF. You can, of course, do more complex encodings. Since this is the meat of our spelling correction, I'll not say more other than you should take it up with Proximity to see what they'll let you look at. For other kinds of similarity, you can construct strings that contain properly selected features of the input placed in an appropriate order. If you want to take every word in the database and match it against every other word, you have to do a much more complex algorithm unless you have access to a Cray or better. :-) If you want more information, feel free to send me e-mail. --- Bill { uunet | novavax } !twwells!bill Newsgroups: comp.misc,comp.sys.ibm.pc,misc.wanted,ca.wanted Subject: Re: PF474 chips and variant lists Summary: Expires: References: <1914@orion.cf.uci.edu> Sender: Reply-To: bill@twwells.UUCP (T. William Wells) Followup-To: comp.misc Distribution: Organization: None, Ft. Lauderdale Keywords: PF474 chisp Followups have been directed to comp.misc. In article <1914@orion.cf.uci.edu> rmorey@orion.cf.uci.edu (Robert Morey) writes: : I am looking for information on the PF474 chip Hi, you've found the right person. The chip was created by and is sold by Proximity Technology Inc., the company I work for. I'd have responded sooner but my feed has been in its usual yo-yo mode and I'm now about five days behind. The PF474 is a string comparator: it compares two strings and returns a real number indicating the degree of similarity; zero for no features in common and one for identity. Its normal operating mode is to have one string in an internal RAM while other strings are passed into it via DMA. It maintains a list of the 16 closest matches to the string in the RAM; you query it for those matches when you've passed all the data in. There is a board for the IBM-PC that has the PF474 on it. The PF474, and the algorithm it implements, are patented by Proximity. You can do all kinds of matching with this chip: we've used it for speech and image recognition experiments, Chinese character recognition experiments, and classifying heartbeats. The heart of our business is spelling checking and correction; the algorithm in the PF474 is the base of our spelling corrector. However, we have discovered that, for many applications, the PF474 is not necessary. E.g., we sell a product, Friendly Finder (unless they've changed the name again) that searches DBase files for near matches and which uses the algorithm that is implemented in the PF474; we discovered that, by judicious pruning of the data going into the algorithm, the software is as fast as the hardware. There have been a number of articles on the PF474 and the algorithm it implements: they have all described a version of the algorithm that is quadratic or worse in the length of the query strings. The PF474 algorithm is linear; this makes a big difference in software versions of the algorithm. If your purpose is academic, we can probably get you code that simulates the algorithm and I'd be willing to help you get some use of it. If your purpose is commercial, we will certainly sell you either the chip or the code, but you should take it up with our sales people. They can be reached at 305-566-3511. : and/or information on : how to produce variants of a word. For instance, if I have a word, how : can I produce a series of other words which are semantically, phonetically, : or structurally similar to the first, with probabilities? The work : involved is in trying to find variations of XV and XVI century words : across various European languages. Which way you do this depends on how many words you want to find variants of. If you are doing this a word at a time, you can pass your database through the PF or the emulation, one pass per word. You don't pass the raw database through, however; you encode the data in some way and then pass the encoded data. I can't get too much into that, as it strays into proprietary data, but here are some hints: For phonetic similarity, you can group letters together, and pass the code for the group into the PF. You can, of course, do more complex encodings. Since this is the meat of our spelling correction, I'll not say more other than you should take it up with Proximity to see what they'll let you look at. For other kinds of similarity, you can construct strings that contain properly selected features of the input placed in an appropriate order. If you want to take every word in the database and match it against every other word, you have to do a much more complex algorithm unless you have access to a Cray or better. :-) If you want more information, feel free to send me e-mail. --- Bill { uunet | novavax } !twwells!bill