[comp.periphs] Multilingual OCR

djb@wjh12.harvard.edu (David J. Birnbaum) (11/01/88)
After a careful survey of desktop (<$2000, including software) optical
scanners capable of Optical Character Recognition (OCR) and capable of
being trained to read new typefaces, styles, and alphabets, I purchased a
Datacopy 730 flatbed scanner.  This comes with Datacopy OCR software, which
enables the scanner to read about 17 pretrained typefaces.  Additionally,
it comes with Datacopy "OCRplus" software.  The latter is a training
program; the user can train the system to recognize new typefaces and can
then convert this information into a new typestyle that can be loaded into
the regular OCR program.  Datacopy specifically states that this is useful
for foreign languages.

What Datacopy doesn't tell you is that their OCR program is fundamentally
incapable of dealing with new alphabets.  Specifically, there is context
checking built into OCR that enables it to distinguish letters that look
alike in English.  Thus, the pattern recognition may not be able to tell
'1' from 'l' or '5' from 'S' in all cases, but the context checking
corrects many potential errors.  This is an excellent feature for reading
English text.

Unfortunately, it can not be disabled.  After training OCRplus to read a
Cyrillic typeface, where I had mapped characters that looked nothing like
each other or like English 'l' and '1' to the 'l' and '1' keys, I converted
the information into a typestyle, loaded it into OCR, and was dismayed at
the rampant confusion of 'l' and '1'.  Worse still, you can't avoid the
problem by working around the offending letters.  There is context checking
for 'h' and 'b' built in, and I designed my typestyle with nothing mapped
to 'h'.  The letter simply wasn't part of the alphabet, so that no matter
what the system thought it was seeing, it couldn't think it was seeing 'h'.
Yet the resulting text is studded with 'h', all misreadings for 'b'.

Paradoxically, the trainability of OCRplus is excellent, and when I use it
in an 'autotraining mode', in which it guesses at the letters and lets me
intervene to correct it when necessary, it has a completely acceptable
recognition rate.  Unfortunately, it is impractical to read long documents
through the training mode of OCRplus, since it pauses on all the visual
'noise' on the page and you have to sit over the keyboard telling it to
skip such material.  If the information that OCRplus extracts could be
converted to a typestyle for OCR that avoids the context checking, the
system would be ideal.

So ... is there anyone doing OCR with non-Latin alphabet languages
who has found a solution to this problem?  Please don't tell me to try a
Kurzweil; I believe it's a better system, but the base price for the least
expensive sheet feed Kurzweil is $10,000, putting it out of the reach of
many institutions, let alone small desktop users.  Most scanner/OCR
companies have abandoned the foreign language user (and therefore much of
the academic community) and come pretrained for a variety of standard
typefaces with no possibility for expansion.  This suits banks and
insurance companies and other customers with standard business reports and
correspondence.  Are there any foreign language users out there who have
come up with reasonable solutions to OCR problems? 

Please email replies; I will post a summary.

David J. Birnbaum
djb@wjh12.harvard.edu [Internet]       djb@wjh12.uucp [UUCP]
djb@harvunxw.bitnet [Bitnet]

Bitnet warning: please address Bitnet mail only to djb@harvunxw.bitnet.
No other bitnet address is reliable.  Please do not trust your mail
program to supply an address.