djb@wjh12.harvard.edu (David J. Birnbaum) (11/01/88)
After a careful survey of desktop (<$2000, including software) optical scanners capable of Optical Character Recognition (OCR) and capable of being trained to read new typefaces, styles, and alphabets, I purchased a Datacopy 730 flatbed scanner. This comes with Datacopy OCR software, which enables the scanner to read about 17 pretrained typefaces. Additionally, it comes with Datacopy "OCRplus" software. The latter is a training program; the user can train the system to recognize new typefaces and can then convert this information into a new typestyle that can be loaded into the regular OCR program. Datacopy specifically states that this is useful for foreign languages. What Datacopy doesn't tell you is that their OCR program is fundamentally incapable of dealing with new alphabets. Specifically, there is context checking built into OCR that enables it to distinguish letters that look alike in English. Thus, the pattern recognition may not be able to tell '1' from 'l' or '5' from 'S' in all cases, but the context checking corrects many potential errors. This is an excellent feature for reading English text. Unfortunately, it can not be disabled. After training OCRplus to read a Cyrillic typeface, where I had mapped characters that looked nothing like each other or like English 'l' and '1' to the 'l' and '1' keys, I converted the information into a typestyle, loaded it into OCR, and was dismayed at the rampant confusion of 'l' and '1'. Worse still, you can't avoid the problem by working around the offending letters. There is context checking for 'h' and 'b' built in, and I designed my typestyle with nothing mapped to 'h'. The letter simply wasn't part of the alphabet, so that no matter what the system thought it was seeing, it couldn't think it was seeing 'h'. Yet the resulting text is studded with 'h', all misreadings for 'b'. Paradoxically, the trainability of OCRplus is excellent, and when I use it in an 'autotraining mode', in which it guesses at the letters and lets me intervene to correct it when necessary, it has a completely acceptable recognition rate. Unfortunately, it is impractical to read long documents through the training mode of OCRplus, since it pauses on all the visual 'noise' on the page and you have to sit over the keyboard telling it to skip such material. If the information that OCRplus extracts could be converted to a typestyle for OCR that avoids the context checking, the system would be ideal. So ... is there anyone doing OCR with non-Latin alphabet languages who has found a solution to this problem? Please don't tell me to try a Kurzweil; I believe it's a better system, but the base price for the least expensive sheet feed Kurzweil is $10,000, putting it out of the reach of many institutions, let alone small desktop users. Most scanner/OCR companies have abandoned the foreign language user (and therefore much of the academic community) and come pretrained for a variety of standard typefaces with no possibility for expansion. This suits banks and insurance companies and other customers with standard business reports and correspondence. Are there any foreign language users out there who have come up with reasonable solutions to OCR problems? Please email replies; I will post a summary. David J. Birnbaum djb@wjh12.harvard.edu [Internet] djb@wjh12.uucp [UUCP] djb@harvunxw.bitnet [Bitnet] Bitnet warning: please address Bitnet mail only to djb@harvunxw.bitnet. No other bitnet address is reliable. Please do not trust your mail program to supply an address.