[comp.graphics] Optical Character Recognition software?

camille@msn006.misemi (Camille Goudeseune) (06/08/88)

[]

We've just got a 300 DPI scanner that can read in 8.5 by 11" pages, and
store the bitmap (in Sun rasterfile format) on disk.  People here would save
quite some time if they could use it to read in text, instead of typing it
in.  Has anyone in Netland heard of, or using, programs to convert bitmaps
to text?  This wheel doesn't bear re-inventing.

Please reply to
		... !uunet!helios!spock!camille
If I get enough "me too" requests (and at least one reference :-) ), I'll
post a summary.
			Thank you,
				Camille Goudeseune.
------------------------------------------------------------------------------
 "Screw the Prime Directive.  Let's kill something." -- Jean-Luc Picard

kens@hpldola.HP.COM (Ken Shrum) (06/08/88)

The problem of converting bitmaps to text is quite difficult due to
problems with multiple fonts, misaligned characters, etc.  The
additional problem is that the error rate has to be *extremely* low in
order to have the output be useful.  (One page of source has perhaps
3.5K character positions...)

There are companies which sell products that do this job - it's
probably worth contacting one of them.

	Ken Shrum
	hpldola!kens

eugene@pioneer.arpa (Eugene N. Miya) (06/09/88)

I've been looking at several OCRs.  DEST, Kurzweil and others.
I've developed a small "benchmark" [dirty word] to test these systems.
Unfortunately, I can't post it as it varies point sizes (6-24 point
sort of like an eye chart), font types (Times and Courier for instance),
character spacing (variable and constant).  This problem is tough, you have
to distinguish between 0 O <Oh and zero, or did I type zero and oh?>
	l and 1 (ell and one)
You can easily think of one, just take a bunch of text and numbers,
special characters, and see how well the thing reads them in.  Oh,
commas and periods are also trouble.  The ell and one problem is
particularly annoying because we have older secretaries who started
using the ell as 1.  If you see some sci.space forwardings, there's
a secretary at HQ who does this.  Believe me, OCR has a long way to go.

I've not developed quantatitive estimates how much these systems will
save, but their utility is currently marginal.
It depends on your application (what you are reading in).

Another gross generalization from

--eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov
  resident cynic at the Rock of Ages Home for Retired Hackers:
  "Mailers?! HA!", "If my mail does not reach you, please accept my apology."
  {uunet,hplabs,ncar,ihnp4,decwrl,allegra,tektronix}!ames!aurora!eugene
  "Send mail, avoid follow-ups.  If enough, I'll summarize."

jru@etn-rad.UUCP (John Unekis) (06/09/88)

In article <367@msn006.misemi> camille@spock.UUCP (Camille Goudeseune) writes:
>We've just got a 300 DPI scanner that can read in 8.5 by 11" pages, and
> ...
>in.  Has anyone in Netland heard of, or using, programs to convert bitmaps
>to text?  This wheel doesn't bear re-inventing.
... I haven't heard of any stand-alone software to do this, but DEST of 
  San Jose makes a 300 dpi scanner that reads the text (process is called
  Optical Character Recognition or OCR) on the fly.

eugene@pioneer.arpa (Eugene N. Miya) (06/11/88)

This is a brief summary of my experiences in the OCR field up to today
when I had yet another Demo of OCR systems.  I've been asked to look at
one more and will report on it some time.

But first: My background: I've worked off and on for about 10 years
in digital image processing for remote sensing and computer graphics.
No, no, significant paper, but I understand the problems of IP, and I
know how hard OCR is.  So far I have tried the following
systems:
	DEST /DataCopy(first recommended by one of our Cray people, in fact, 
		one of those friend of a friend things)
	DataCopy
	1215 Terra Bella
	Mountain View, CA 94043
	
	Kurzweil (A Xerox Corporation, best know for all kinds of
		interesting equipment)
	185 Albany St.
	Cambridge, MA 02139
and today:
	Transimage
	910 Benicia Ave
	Sunnyvale, CA 94086
	(408)-733-411

You normally can't just contact these companies, they refer you to
places you can demo units.  (Like Western Office Supply).
They range in price from $3,000 for H/W and S/W to $20,000.
The first two are sheet readers and the Trans image is hand held.
The first two are 300 dot per inch resolution.  The DEST/Datacopy
uses fixed font information which keeps it cheap.  The recognition
miss rate on say variable space laser printer stuff is not helpful.
It's designed for reading fixed size typewriter faces.

The Kurzweil is a system wich requires training.  It takes several
passes.  It can handle variable spaced text.  Both I felt were
not worth the cost, I think the Kuzweil was something like $7,000
and getting cheaper.  Neither is super fast, but it was the error
rate which turned me off.  You have to deal with false positives
as well as true negatives (i.e., is that a blot or a period?)
I already mentioned the ells and 1s, zeroes and ohs.

Context is a very difficult thing to deal with.

Oh, I should say something about CPUs, all run using IBM PC clones
ATs or XTs, no PS2/OS2 support, and the DEST is the only Mac
connectable.  Well as of today I was most impressed with the
TransImage, but I suspect this is because it has 1,000 dots per inch
res.  Hand held is initially novel, but sucks in the long run.
(PC based, BTW).  It did my test sheet fairly well (including
special characters).

Now the reason why I posted this detailed explanation.

SO I can read text in with some rate of errors, yes it something of a
pain but it's only $3K.  The question is how to quantity the recognition
rate.  If I have to change 1 character (or edit) for every 100, is this
too high (able one per line?  Well the DEST/Kur* were 4-7 errors
per line on my sample text, much too high.  Why, my typo rate is
one per line and this is about what the TI1000 is.  Now a page is about
60 lines.  Should a the document just be handed to the secretary,
here type this? or should we go OCR?  I have to figure out error rate
per 1000 or so characters, and in the case of the TI1000 factor
time to scan as well.  Comments on error rate measurements?

Note: only a bureaucratic Agency or company really needs one of these
things.  I think WE can blow away $3K of taxplayers money to try it.
Like 90 page documents, etc.

Next Pallentair(sp).  $30K Tak-eye!  [sorry can't spell Japanese
phoentically well].

Another gross generalization from

--eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov
  resident cynic at the Rock of Ages Home for Retired Hackers:
  "Mailers?! HA!", "If my mail does not reach you, please accept my apology."
  {uunet,hplabs,ncar,decwrl,allegra,tektronix}!ames!aurora!eugene
  "Send mail, avoid follow-ups.  If enough, I'll summarize."

jbn@glacier.STANFORD.EDU (John B. Nagle) (06/11/88)

      Datacopy and Dest are not the same company; they are competitors.
Datacopy's OCR technology was developed internally at Datacopy, and
rather recently, years after Dest had working OCR for typewritten material.
Xerox is proposing a takeover of Datacopy at $6 per share.  Xerox also owns
Kurtzweil.

					John Nagle

truett@cup.portal.com (06/11/88)

Everyvody is doing a good job of not answering the question that was originally
asked.  The question was:  Given the image of an 8.5 x 11" page in Sun raster-
file format (presumably on a Sun), is their software that will perform OCR on
that file, producing a text file as output.

Many respondents mention the Dest products.  I am aware of the PC oriented
products from Dest, but do not know if they sell a version that interfaces to
a Sun.  Even at that, it is easy to choke a Dest scanner, though they do fine
for clean typewritten copy.

Also, at 300 dpi, you have a useful limit of about 6 pt for type size (even
though the human eye can read clean 4 pt with a 300 dpi image).  Basically, the
OCR algorithm runs into a two-dimensional Nyquist problem.

If the file can be converted into something like a TIFF file on an MS-DOS
machine, then there are several standalone software packages that can do the
job requested, though with varying degrees of accuracy and speed.  The copanies
selling such software include a company in Boca Raton, FL (I keep thinking of
the name Solution Technology but I may be confusing that with something else),
OCR Systems of Bensalem, PA, and another company called TNTI in Fremont, CA
(hard to find).  If there is interest, I can get the exact data out of my files
at work on Monday and post them.  The packages range in price from about $400
to $750.

An interesting solution if you have lots of OCR to do is the OCR processor from
Palantir which can be connected to a network.  At $15K it is expensive but very
powerful, on a par with the Kurzweil machines.

Truett Lee Smith, Sunnyvale, CA
UUCP:  truett@cup.portal.com

hjp@bambam.UUCP (Howard J. Postley) (06/14/88)

We at On Word have been doing OCR in-house and building systems for
various people for about five years.  I think we have as near to
at least one of everything as anyone (hardware and software).  The
only software system that we have found (which happens to run on
Suns) is the Kurzweil OCR Toolbox, which is the software that they
use in their PC-Board product (Discover 7320).  This is good for
about 30-60 CPS recognition and is reasonably flexible; however it
has drawbacks.  The biggest drawback for us it the methods for
error correction (no OCR is even close to perfect), or lack thereof.
The biggest problem for most everyone else is the price: About $50K.
The price is designed for an OEM to buy it, integrate it with something
else, then sell the whole package - license for resale is an additional
$1K/copy.

I know that someone is going to remind us that there are PC and Mac software
products that do OCR of a bit-map, for $500-$1000.  These are only useful
if the source is *always* Courier 10.  I don't care what the ads say.

The solution that we usually go with is connecting a Palantir CDP or
Recognition Server (Rec. Server is CDP without scanner) to the ethernet
that the Sun is on and use it for OCR.  This runs about $12K on the low
end and $30K on the high end but these things work really and take maximum
of the Sun's graphics for error correction.  There are two models:  3000
and 9000.  3000 is 50-100 CPS, 9000 is 200-300 CPS.

Both Palantir and Kurzweil are omni-font (will read just about everyrhing
you throw at it) but the Palantir is somewhat more so.  There are currently
no other machines that compete with these two.  The Palantir will work
with PC as well as other things.  It has RS-232, SCSI, and optionl Ethernet
(TCP/IP) interfaces.  Software is currently available only for PC (and clones)
and Sun 2 & 3's.  Kurzweil is PC only (hardware is PC board).

We have piles of info on this stuff so if anyone wants anything specific,
just ask - it's hard to cover it all, there are a lot of really subtle
differences.

For those of you who are interested, the Palantir 9000 has about 15 MIPS
of compute power to get to 300 CPS, the Kurzweil has about 3 MIPS to do
60 CPS.  Keep this in mind when you think about a software only solution,
these machines are optimized for OCR.


// Howard

elt@entire.UUCP (Edward L. Taychert) (06/14/88)

In article <10160@ames.arc.nasa.gov>, eugene@pioneer.arpa (Eugene N. Miya) writes:
> This is a brief summary of my experiences in the OCR field up to today
> when I had yet another Demo of OCR systems.  I've been asked to look at
> one more and will report on it some time.
> 
(stuff gone)
>
> The Kurzweil is a system wich requires training.  It takes several

My Kurzweil Discover does not require training. Indeed, this is pushed
as their advantage over other systems. No training is required. You
put the document in the scanner and it converts it.

> .....
> too high (able one per line?  Well the DEST/Kur* were 4-7 errors
> per line on my sample text, much too high.  Why, my typo rate is

It all depends, When I run typewriter courier through the Kurzweil,
I typically get 100% recoginition accuracy. (That's ZERO errors, page
after page). If I run 5x7 dot matrix through it, I typically get 100%
gobbledy gook. Small print, < 8 points increases the error rate too.
My Kurzweil Summary:
typeset pages - perfect or near perfect (nothings really perfect...)
laser writer print (300dpi) - two or three errors per page
many pin(24) dot matrix - untested
5x9 dot matrix - useless
> ....
> Another gross generalization from
> 
Yes, I think this might be the case for me too... :-)


-- 

____________________________________________________________________________

Ed Taychert				Phone: USA (716) 381-7500
Entire Inc.				UUCP: rochester!rocksanne!entire!elt
445 E. Commercial Street
East Rochester, N.Y. 14445 
_____________________________________________________________________________