[comp.text.desktop] Multi-font OCR scanners

chuq@plaid.UUCP (05/15/87)

Date: Thu, 14 May 87 16:39:00 PDT
From: dick@ccb.ucsf.edu (Dick Karpinski)

I believe that some but not all of the scanners discussed in the last couple
of weeks on desktop have software which trys to generate ASCII text from the
scanned image of the input page.  I believe that I would be most delighted
with one that constructed the PostScript which would generate "approximately"
the page that was scanned, but with the text in ASCII, not bitmap.  

Much too much to ask for, right?

I did take a page of dot-matrix print in to one outfit selling a $2k 300 dpi
scanner with OCR software for the IBM-PC line, but there were several problems:
   1) The spacing on my sample input overwhelmed the fixed spacing OCR
      software available then.
   2) No interface nor software was available for the Macintosh.
   3) Operation seemed awkward and not at all intuitive.
   4) Software costs took the package price up to around $3k.
   5) The best recognition rates seemed to be only 95-98% correct.

I am told that a $36k Kurtzweil multi-font scanner will do just about
everything I want.  (Not sure about a Macintosh interface.)  But I
will never be able to afford that.  Should I wait a few years, or is
one of these current products really capable of reading most of the
submissions to my newsletter so that I can convert them all to some
pleasant consistent font?  That would make my newsletter look more
like a magazine and less like a piece of patchwork.

Dick

Dick Karpinski  Manager of Unix Services, UCSF Computer Center
UUCP:  ...!ucbvax!ucsfcgl!cca.ucsf!dick        (415) 476-4529 (11-7)
BITNET:  dick@ucsfcca or dick@ucsfvm           Compuserve: 70215,1277  
USPS:  U-76 UCSF, San Francisco, CA 94143-0704   Telemail: RKarpinski   
Domain: dick@cca.ucsf.edu  Home (415) 658-6803  Ans 658-3797

----------------------------------------
Submissions to:   desktop%plaid@sun.com -OR- sun!plaid!desktop
Administrivia to: desktop-request%plaid@sun.com -OR- sun!plaid@desktop-request
Paths:  {ihnp4,decwrl,hplabs,seismo,ucbvax}!sun
Chuq Von Rospach	chuq@sun.COM		[I don't read flames]

There is no statute of limitations on stupidity

chuq%plaid@Sun.COM (Chuq Von Rospach) (05/18/87)

From: rbl@nitrex.UUCP ( Dr. Robin Lake )
Date: 18 May 87 13:02:49 GMT
Distribution: comp
Organization: The Standard Oil Co., Cleveland

>I am told that a $36k Kurtzweil multi-font scanner will do just about
>everything I want.  (Not sure about a Macintosh interface.)  But I
>will never be able to afford that.  Should I wait a few years, or is
>one of these current products really capable of reading most of the
>submissions to my newsletter so that I can convert them all to some
>pleasant consistent font?  That would make my newsletter look more
>like a magazine and less like a piece of patchwork.

We have a 4 year old Kurzweil.  It has not been able to handle dot
matrix well, but we have not tried to fiddle with the threshold settings,
etc. to make it do so.  We are looking at a new system made by Palantir
this Thursday.  We'll give dot matrix a try and let you know.

Rob Lake

----------------------------------------
Submissions to:   desktop%plaid@sun.com -OR- sun!plaid!desktop
Administrivia to: desktop-request%plaid@sun.com -OR- sun!plaid@desktop-request
Paths:  {ihnp4,decwrl,hplabs,seismo,ucbvax}!sun
Chuq Von Rospach	chuq@sun.COM		[I don't read flames]

There is no statute of limitations on stupidity

chuq%plaid@Sun.COM (Chuq Von Rospach) (05/19/87)

From: hoptoad!gnu@ucbvax.Berkeley.EDU (John Gilmore)
Date: 18 May 87 07:57:42 GMT
Organization: Nebula Consultants in San Francisco

From: dick@ccb.ucsf.edu (Dick Karpinski)
>                                   I believe that I would be most delighted
> with one that constructed the PostScript which would generate "approximately"
> the page that was scanned, but with the text in ASCII, not bitmap.  

Good luck.  The scanner manufacturers have tried to jump on the
coattails of the PostScript bandwagon by inventing a scanner language
and calling it PreScript but from what I've seen it is no relation at
all and wasn't worth mentioning except for the cute name.  So far the
scanner industry seems to be taking a severe split -- the guys who give
you bits and, incidentally, here's a floppy for your FeeCees that might
turn that into ascii, sort of; and the folks who are really working on
read-anything scanners.  Don't expect anything in the way of real
character recognition from the cheapies.

> I did take a page of dot-matrix print in to one outfit selling a $2k 300 dpi
> scanner with OCR software for the IBM-PC line, but there were several problems
>    1) The spacing on my sample input overwhelmed the fixed spacing OCR
>       software available then.
>    2) No interface nor software was available for the Macintosh.
>    3) Operation seemed awkward and not at all intuitive.
>    4) Software costs took the package price up to around $3k.
>    5) The best recognition rates seemed to be only 95-98% correct.
> 
> I am told that a $36k Kurtzweil multi-font scanner will do just about
> everything I want.

Since you're in San Francisco you can easily find out.  Go downtown to
the Krishna Copy Center and buy an hour or two's time on the Kurzweil.
They can give you the data on mac disks, IBM disks, or by modem.

I tried to scan in the draft ANSI C Standard a few months ago on that
machine, and while I am not an experienced operator, it had too many
troubles to be useful.  It made 10-20 mistakes per page on the best
of pages (on multi font typeset text, probably offset printed) and in
many cases it would totally garble a line for no reason, while reading
the preceding and following lines without trouble.  As it was, it's faster
to just type the page yourself (or hire somebody who types 90-100 wpm
to do it) than to try to find and fix all the mistakes the scanner makes.
-- 
Copyright 1987 John Gilmore; you may redistribute only if your recipients may.
(This is an effort to bend Stargate to work with Usenet, not against it.)
{sun,ptsfa,lll-crg,ihnp4,ucbvax}!hoptoad!gnu	       gnu@ingres.berkeley.edu

----------------------------------------
Submissions to:   desktop%plaid@sun.com -OR- sun!plaid!desktop
Administrivia to: desktop-request%plaid@sun.com -OR- sun!plaid!desktop-request
Paths:  {ihnp4,decwrl,hplabs,seismo,ucbvax}!sun
Chuq Von Rospach	chuq@sun.COM		[I don't read flames]

There is no statute of limitations on stupidity

chuq%plaid@Sun.COM (Chuq Von Rospach) (05/19/87)

Date: Tue, 19 May 87 09:53:50 CDT
From: James Peterson <peterson@MCC.COM>

> From: hoptoad!gnu@ucbvax.Berkeley.EDU (John Gilmore)
> I tried to scan in the draft ANSI C Standard a few months ago on that
> machine, and while I am not an experienced operator, it had too many
> troubles to be useful. 

We have a Palantir that I have been using for several months to
see how well it works.  It is a 300 dpi scanner with built-in
ASCII conversion.  It comes complete with software to run on a
SUN, but I found it easier to write my own programs to interpret
the scanner output than to use theirs (personal taste -- their programs
run under SunWindows, and I don't).

Over all I've found that they do a pretty good job on most input, but
that the input that I want to scan is close to the margin of acceptable
input -- tables tend to be too small or on poor contrast paper or ...
I can scan, for example, the Zip code directory for Austin in about
an hour, but it then takes me two weeks of evening work to format
it and correct the scanning errors.

So far I have only scanned stuff that should have built in redundancy
that I can check by program.  For example, with the Zip codes, all
scanned zip codes should be in a small range of legal values, the
street names should all be alphabetic, and in order.  And so on.
This allows me to catch a lot of scanner errors without having to
read and compare every entry.  It also tends to expose errors in the
printed input -- no multi-page reference table that I have scanned
has been without printed errors.

----------------------------------------
Submissions to:   desktop%plaid@sun.com -OR- sun!plaid!desktop
Administrivia to: desktop-request%plaid@sun.com -OR- sun!plaid!desktop-request
Paths:  {ihnp4,decwrl,hplabs,seismo,ucbvax}!sun
Chuq Von Rospach	chuq@sun.COM		[I don't read flames]

There is no statute of limitations on stupidity

chuq%plaid@Sun.COM (Chuq Von Rospach) (05/29/87)

From: seismo!sun!cwruecmp!rbl%nitrex.uucp@RUTGERS.EDU ( Dr. Robin Lake )
Date: 26 May 87 13:04:38 GMT
Organization: The Standard Oil Co., Cleveland

>From: rbl@nitrex.UUCP ( Dr. Robin Lake )
>Date: 18 May 87 13:02:49 GMT

>We have a 4 year old Kurzweil.  It has not been able to handle dot
>matrix well, but we have not tried to fiddle with the threshold settings,
>etc. to make it do so.  We are looking at a new system made by Palantir
>this Thursday.  We'll give dot matrix a try and let you know.

We did look at the Palantir, but did not test dot matrix as one of our
"clients" had an 87 page copy of a typewritten document they wanted scanned.
Palantir ran about 30 seconds per page.  With no tune-up ("showroom stock")
it picked up every mark on the page, missed some blended letters, 
read 0 as o and completely satisfied the "client".  I plan to run the same
87 pages thru the Kurzweil, with and without tuning.  It may take 2 - 3 weeks,
so stay tuned!

THIS IS NOT AN ENDORSEMENT OR CRITICISM OF ANY PRODUCT!!  "One Test is Worth
a Thousand Expert Opinions"  The Riehle Axiom.

Rob Lake

----------------------------------------
Submissions to:   desktop%plaid@sun.com -OR- sun!plaid!desktop
Administrivia to: desktop-request%plaid@sun.com -OR- sun!plaid!desktop-request
Paths:  {ihnp4,decwrl,hplabs,seismo,ucbvax}!sun
Chuq Von Rospach	chuq@sun.COM		Delphi: CHUQ

Now, where did my ex-wife put my Fairy Dust?