[comp.fonts] OCR for Non-Latin texts - SUMMARY

ruslan@uncecs.edu (Robin C. LaPasha) (01/11/90)

Replies to my request for scanners and software that can handle Cyrillic text:

[My comments -as here- will be in brackets and with my initials;
replies are separated by dashed lines - RL]
----------------------------------
Date: Thu, 04 Jan 90 16:27:41
From: Malcolm Brown <mbb@jessica.Stanford.EDU>

Robin,
here at Stanford, we've had some luck scanning cyrillic using
a Kurzweil Model 4000.  Kurzweil doesn't make this anymore; it
was the last of their stand-alone models.  Anything less than
a trainable Kurzweil system would have a chance, I'd think.
[_would_ or _would NOT_? - RL]

Incidentally, we even got the Kurzweil to scan a Russian/English
dictionary.  Here we had to contend not only with different
alphabets but different fonts as well.  it was slow going, but
it seemed a bit more efficient than keyboarding.

Hence I'd recommend you check out the new, top-of-the-line
Kurzweil systems (the 5100, I believe).

good luck
Malcolm Brown
Stanford University
---------------------------------
Date: 4 Jan 90 16:42:58 PST (Thu)
From: gatech!mailrus!ames!elroy!gryphon.COM!richard@mcnc.org (Richard Sexton)

Hi Robin;

I think you need to talk to Flagstaff Engineering in Flagstaff
Arizona. They claim to be able to scan this stuff. I don't
have an address, but they advertise regularly in Byte.

How the hell are ya anyway ?
[Fine, Richard.  How're the Drosophilae? ;^) - RL]
-----------------------------------------
Date:         Fri, 05 Jan 90 06:39:20 EST
From: "Donald Parsons  dfp10@uacsc2.albany.edu"
 <DFP10@ALBNYVM1.UUCP>

Contact Dimitri Vulis at dlv@cunyvms1 - he saw a fairly successful
demonstration of a particular OCR program working with a standard
flat bed scanner.  He is owner of the RUSTEX-l@ubvm list which
concentrates on Tex Cyrillic fonts (and PC fonts and telecom to
USSR).  
--------------------------------
Date:     Fri, 5 Jan 90 15:28 EDT
From: <DLV%CUNYVMS1.BITNET@CUNYVM.CUNY.EDU>

I few months ago I spent a few hours playing with a flat bed HP scanner
connected to an MS-DOS computer running Spot OCR software from Flagstaff
engineering (602-779-3341). The software was very very good at reading
Russian text produced on a HP LJ using TeX and WNCYR fonts (upright and
italic) with some English words mixed in. (I came across it unexpectedly
at a trade show and I just happened to have a business letter that I was
about to mail to Moscow; had I been prepared, I would have tried other
kinds of text). It's trainable and it will remember the letters the next
time you use it (it remembers several typefaces). The output is saved in
a file and a little programming would be needed to convert the file into
whatever scheme you use to store Cyrillic (e.g. the software is not smart
enough to understand that 'A' is sometimes Latin A and sometimes Russian A).
The only 2 problems that I recall were 1) it sometimes (not most of the
time) choked on multi-part characters (i kratkoe, y), 2) it sometimes was
confused by kerning (which WNCYR has more of than most typefaces).

Unfortunately, I was asked $850 for the software (and the list price is
even more), which is a bit more than I can afford. I tried to convince my
school to pay for it, but we're really short of funds (this is not an
election year and the finanical situation is even worse than it was last
year, although it's not publicized as much).

They said for OCR applications you need a flat bed scanner; hand scanners
just aren't adequate.

If it does become likely that we'll be able to afford OCR software (we
already have a scanner, but no ocr software for it), I'll also test CAT
software and a few others that I've heard are good before choosing one.

I'd appreciate hearing about your experiences and/or replies to your
question.

[This next piece is from later mail by the same correspondent - RL]

If Duke folks can afford it, then a HP flatbed scanner + SPOT software look
like a very good choice; but advise them not to get a hand scanner or a one
that pulls a sheet thru, like a fax machine. CAT software also sounds very
very good from the reviews, but I've never tried it.

Dimitri Vulis
Department of Mathematics
City University of New York Graduate Center
------------------------------------------
Date: Sat, 6 Jan 90 15:34:21 gmt
From: frisk@rhi.hi.is (Fridrik Skulason)
Organization: University of Iceland (RHI)

I have seen a Hungarian product called 'Recognita' that should be able to
handle this. It runs on a AT-class machine with a 300 (or more) dpi scanner.

I can find the address of the manufacturer if you want.
-frisk
----------------------------------------
Date: Sat, 6 Jan 90 20:38:00 -0600
From: David Boyes <dboyes@brazos.rice.edu>

I use AccuText on the Mac and OCR Plus on the PC (both done by
Xerox) to do classical Greek and Anglo-Saxon text. AccuText has a
Cyrillic configuration, but I've never used it. We scan the
stuff, create MS Word files on both machines and then feed them
into our structured archive system. We get about 90% accuracy on
the Greek texts (thank god -- they're a hassle to proof) and
92-95% accuracy on the A-S texts.

These packages are by the same programmers that do the Kurzweil
firmware. It's nice stuff.

[I import the texts into] MS Word on the PC, [use] the Adobe downloadable
fonts for Greek, and a PostScript printer. Works great for me.
-----------------------------------
From @sunic.sunet.se:ath@prosys Sun Jan  7 04:14:22 1990
From: ath@prosys.UUCP (Anders Thulin)
Organization: Programsystem AB, Teknikringen 2A, S-583 30 Linkoping, Sweden

The Kurzweil OCR system seems to have a rather good reputation in
general.  It's main drawback is the price.

Calera Recognition Systems (sorry - have no address) have produced
something of a Kurzweil killer - at least as good as K. but cheaper.

In the PC world there are at least two packages that may be usable
here: Xerox Accutext and Recognita.

Accutext is claimed to be based on some kind of AI techniques. It is
trainable, although I cannot swear that it can deal with non-Latin
character sets. Some people have had trouble scanning program listings
with it - ordinary text seems OK.

Recognita is the only non-US OCR software I know of for the PC. It is
from a Hungarian company (Szyk?) and seems to be very good at accents
- reports indicate that it is able to handle both Swedish and
Icelandic better than most other comparable systems. I wouldn't at all
be surprised to learn that it is good at Russian.

I don't know what methods Accutext uses for character recognition -
Recognita, however, uses some form of contour-based method. It should
be possible to make it learn Russian.

There is one other European OCR package, but it is for the Mac:
Textpert.  It is trainable, but some reviews seem to indicate that it
is somewhat complicated to train. In one review (in Personal Computer
World - a British magazine) Textpert was matched against Omnipage.
Omnipage 'won'. Again I don't know what method of regnition it uses.

I'm sorry that I don't have any hands-on experience from these - I am
hoping my dealer can give me a demo of Recognita soon, as I am trying
to find software that would permit me to scan Swedish texts.

Hope this is of some help,
-- 
Anders Thulin		ath@prosys.se   {uunet,mcsun}!sunic!prosys!ath
Telelogic Programsystem AB, Teknikringen 2B, S-583 30 Linkoping, Sweden
-----------------------------------------
Date: Mon, 8 Jan 90 13:56:05 EST
From: doering@Kodak.COM (Paul Doering)

Hi, Robin. I saw your inquiry in comp.periphs. Here is a bit of
info paraphrased from a recent Electronic Engineeering Times --

Hungarian state-owned company SzKI offers its Recognita Plus
software for PC's: OCR on any Roman- or Greek-based alphabet 
without preliminary training. Handles material from typesetters,
typewriters; daisywheel, NLQ dot-matrix, or laser printers. 
Uses a feature-analysis recognition algorithm for all popular fonts.
Training procedure is available if needed.

I wish I could tell you how to contact SzKI. If I were interested
enough, I'd call the Hungarian embassy in Washington. SzKI claims
to be the world's leader in OCR. I'd feel better about that claim
if I had ever heard of them before. Still, Here's a lead, and you
can handle it as you wish....

Good luck
Paul Doering; Kodak Research, Rochester NY.  doering@kodak.com
-----------------------------------
Date: Sun, 7 Jan 90 10:29:10 EST
From: djb@wjh12.harvard.edu (David J. Birnbaum)
[Note - I'm combining several email messages from this correspondent - RL]

Robin,
I am preparing an article for the Bits & Bytes Review on
OCR for non-Latin texts.  As a Slavist who works with a
variety of modern and medieval materials (he alliterated),
I am particularly interested in Cyrillic.
++++++++++++++++++++++++++++
[Quoting previous email]
>I have gotten a lot more response than I would have expected, so
>I will be supplying a summary.
I'll look forward to it.  I have considerable practical experi-
ence with Datacopy and SPOT and I'll be looking into Textpert
(soon to be released for MS-DOS) and Kurzweil for my article in
the Bits & Bytes Review.  If there is anything else in your sum-
mary that looks promising, I'll try to include it as well. 
+++++++++++++++++++++++
        Still patiently looking forward to news of OCR for Cyril-
lic.  I got a call yesterday from someone in Kansas (whose name I
promptly misplaced), who said that he's about to post a summary
of similar information to Humanist. [Excerpt from an earlier message:
"the Humanist Listserv run (until recently) by Willard McCarty at
UToronto.  If you don't currently subscribe to Humanist, you might
find it interesting; their mission is to provide a forum for computing
in the humanities." - RL]  He has access to USENET and I told him to
join comp.fonts and watch for your posting.
Cheers,
David
-----------------------------

That's all for now!

Thanks for all the input.  I'm amazed at what's out there; we have
both hardware and software choices that just didn't exist 5 years ago
(the last time I asked anyone about it.)  If I get more details about
these products or info about products not mentioned, I'll post later
with a followup. - RL

Summary - 

	Hardware: if you're not going with a Kurzweil (turnkey?) system,
hook a flatbed scanner up to a Mac or PC.

	Software:
		PC:
			SPOT (Flagstaff Engineering)
			CAT
			Recognita [plus](SzKI)
			Datacopy
			OCR Plus (Xerox)
			(something by) (Calera Recognition Systems)
			[in development: Textpert]
			Kurzweil program...?
		Mac:
			Accutext (Xerox)
			Textpert
			OmniPage
-- 
=-=-=-=-=-=-=-
Robin LaPasha               |Deep-Six your
ruslan@ecsvax.uncecs.edu    |files with VI! ;^) ;^) ;^)