camille@msn006.misemi (Camille Goudeseune) (06/08/88)
[] We've just got a 300 DPI scanner that can read in 8.5 by 11" pages, and store the bitmap (in Sun rasterfile format) on disk. People here would save quite some time if they could use it to read in text, instead of typing it in. Has anyone in Netland heard of, or using, programs to convert bitmaps to text? This wheel doesn't bear re-inventing. Please reply to ... !uunet!helios!spock!camille If I get enough "me too" requests (and at least one reference :-) ), I'll post a summary. Thank you, Camille Goudeseune. ------------------------------------------------------------------------------ "Screw the Prime Directive. Let's kill something." -- Jean-Luc Picard
kens@hpldola.HP.COM (Ken Shrum) (06/08/88)
The problem of converting bitmaps to text is quite difficult due to problems with multiple fonts, misaligned characters, etc. The additional problem is that the error rate has to be *extremely* low in order to have the output be useful. (One page of source has perhaps 3.5K character positions...) There are companies which sell products that do this job - it's probably worth contacting one of them. Ken Shrum hpldola!kens
eugene@pioneer.arpa (Eugene N. Miya) (06/09/88)
I've been looking at several OCRs. DEST, Kurzweil and others. I've developed a small "benchmark" [dirty word] to test these systems. Unfortunately, I can't post it as it varies point sizes (6-24 point sort of like an eye chart), font types (Times and Courier for instance), character spacing (variable and constant). This problem is tough, you have to distinguish between 0 O <Oh and zero, or did I type zero and oh?> l and 1 (ell and one) You can easily think of one, just take a bunch of text and numbers, special characters, and see how well the thing reads them in. Oh, commas and periods are also trouble. The ell and one problem is particularly annoying because we have older secretaries who started using the ell as 1. If you see some sci.space forwardings, there's a secretary at HQ who does this. Believe me, OCR has a long way to go. I've not developed quantatitive estimates how much these systems will save, but their utility is currently marginal. It depends on your application (what you are reading in). Another gross generalization from --eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov resident cynic at the Rock of Ages Home for Retired Hackers: "Mailers?! HA!", "If my mail does not reach you, please accept my apology." {uunet,hplabs,ncar,ihnp4,decwrl,allegra,tektronix}!ames!aurora!eugene "Send mail, avoid follow-ups. If enough, I'll summarize."
jru@etn-rad.UUCP (John Unekis) (06/09/88)
In article <367@msn006.misemi> camille@spock.UUCP (Camille Goudeseune) writes: >We've just got a 300 DPI scanner that can read in 8.5 by 11" pages, and > ... >in. Has anyone in Netland heard of, or using, programs to convert bitmaps >to text? This wheel doesn't bear re-inventing. ... I haven't heard of any stand-alone software to do this, but DEST of San Jose makes a 300 dpi scanner that reads the text (process is called Optical Character Recognition or OCR) on the fly.
eugene@pioneer.arpa (Eugene N. Miya) (06/11/88)
This is a brief summary of my experiences in the OCR field up to today when I had yet another Demo of OCR systems. I've been asked to look at one more and will report on it some time. But first: My background: I've worked off and on for about 10 years in digital image processing for remote sensing and computer graphics. No, no, significant paper, but I understand the problems of IP, and I know how hard OCR is. So far I have tried the following systems: DEST /DataCopy(first recommended by one of our Cray people, in fact, one of those friend of a friend things) DataCopy 1215 Terra Bella Mountain View, CA 94043 Kurzweil (A Xerox Corporation, best know for all kinds of interesting equipment) 185 Albany St. Cambridge, MA 02139 and today: Transimage 910 Benicia Ave Sunnyvale, CA 94086 (408)-733-411 You normally can't just contact these companies, they refer you to places you can demo units. (Like Western Office Supply). They range in price from $3,000 for H/W and S/W to $20,000. The first two are sheet readers and the Trans image is hand held. The first two are 300 dot per inch resolution. The DEST/Datacopy uses fixed font information which keeps it cheap. The recognition miss rate on say variable space laser printer stuff is not helpful. It's designed for reading fixed size typewriter faces. The Kurzweil is a system wich requires training. It takes several passes. It can handle variable spaced text. Both I felt were not worth the cost, I think the Kuzweil was something like $7,000 and getting cheaper. Neither is super fast, but it was the error rate which turned me off. You have to deal with false positives as well as true negatives (i.e., is that a blot or a period?) I already mentioned the ells and 1s, zeroes and ohs. Context is a very difficult thing to deal with. Oh, I should say something about CPUs, all run using IBM PC clones ATs or XTs, no PS2/OS2 support, and the DEST is the only Mac connectable. Well as of today I was most impressed with the TransImage, but I suspect this is because it has 1,000 dots per inch res. Hand held is initially novel, but sucks in the long run. (PC based, BTW). It did my test sheet fairly well (including special characters). Now the reason why I posted this detailed explanation. SO I can read text in with some rate of errors, yes it something of a pain but it's only $3K. The question is how to quantity the recognition rate. If I have to change 1 character (or edit) for every 100, is this too high (able one per line? Well the DEST/Kur* were 4-7 errors per line on my sample text, much too high. Why, my typo rate is one per line and this is about what the TI1000 is. Now a page is about 60 lines. Should a the document just be handed to the secretary, here type this? or should we go OCR? I have to figure out error rate per 1000 or so characters, and in the case of the TI1000 factor time to scan as well. Comments on error rate measurements? Note: only a bureaucratic Agency or company really needs one of these things. I think WE can blow away $3K of taxplayers money to try it. Like 90 page documents, etc. Next Pallentair(sp). $30K Tak-eye! [sorry can't spell Japanese phoentically well]. Another gross generalization from --eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov resident cynic at the Rock of Ages Home for Retired Hackers: "Mailers?! HA!", "If my mail does not reach you, please accept my apology." {uunet,hplabs,ncar,decwrl,allegra,tektronix}!ames!aurora!eugene "Send mail, avoid follow-ups. If enough, I'll summarize."
jbn@glacier.STANFORD.EDU (John B. Nagle) (06/11/88)
Datacopy and Dest are not the same company; they are competitors. Datacopy's OCR technology was developed internally at Datacopy, and rather recently, years after Dest had working OCR for typewritten material. Xerox is proposing a takeover of Datacopy at $6 per share. Xerox also owns Kurtzweil. John Nagle
truett@cup.portal.com (06/11/88)
Everyvody is doing a good job of not answering the question that was originally asked. The question was: Given the image of an 8.5 x 11" page in Sun raster- file format (presumably on a Sun), is their software that will perform OCR on that file, producing a text file as output. Many respondents mention the Dest products. I am aware of the PC oriented products from Dest, but do not know if they sell a version that interfaces to a Sun. Even at that, it is easy to choke a Dest scanner, though they do fine for clean typewritten copy. Also, at 300 dpi, you have a useful limit of about 6 pt for type size (even though the human eye can read clean 4 pt with a 300 dpi image). Basically, the OCR algorithm runs into a two-dimensional Nyquist problem. If the file can be converted into something like a TIFF file on an MS-DOS machine, then there are several standalone software packages that can do the job requested, though with varying degrees of accuracy and speed. The copanies selling such software include a company in Boca Raton, FL (I keep thinking of the name Solution Technology but I may be confusing that with something else), OCR Systems of Bensalem, PA, and another company called TNTI in Fremont, CA (hard to find). If there is interest, I can get the exact data out of my files at work on Monday and post them. The packages range in price from about $400 to $750. An interesting solution if you have lots of OCR to do is the OCR processor from Palantir which can be connected to a network. At $15K it is expensive but very powerful, on a par with the Kurzweil machines. Truett Lee Smith, Sunnyvale, CA UUCP: truett@cup.portal.com
hjp@bambam.UUCP (Howard J. Postley) (06/14/88)
We at On Word have been doing OCR in-house and building systems for various people for about five years. I think we have as near to at least one of everything as anyone (hardware and software). The only software system that we have found (which happens to run on Suns) is the Kurzweil OCR Toolbox, which is the software that they use in their PC-Board product (Discover 7320). This is good for about 30-60 CPS recognition and is reasonably flexible; however it has drawbacks. The biggest drawback for us it the methods for error correction (no OCR is even close to perfect), or lack thereof. The biggest problem for most everyone else is the price: About $50K. The price is designed for an OEM to buy it, integrate it with something else, then sell the whole package - license for resale is an additional $1K/copy. I know that someone is going to remind us that there are PC and Mac software products that do OCR of a bit-map, for $500-$1000. These are only useful if the source is *always* Courier 10. I don't care what the ads say. The solution that we usually go with is connecting a Palantir CDP or Recognition Server (Rec. Server is CDP without scanner) to the ethernet that the Sun is on and use it for OCR. This runs about $12K on the low end and $30K on the high end but these things work really and take maximum of the Sun's graphics for error correction. There are two models: 3000 and 9000. 3000 is 50-100 CPS, 9000 is 200-300 CPS. Both Palantir and Kurzweil are omni-font (will read just about everyrhing you throw at it) but the Palantir is somewhat more so. There are currently no other machines that compete with these two. The Palantir will work with PC as well as other things. It has RS-232, SCSI, and optionl Ethernet (TCP/IP) interfaces. Software is currently available only for PC (and clones) and Sun 2 & 3's. Kurzweil is PC only (hardware is PC board). We have piles of info on this stuff so if anyone wants anything specific, just ask - it's hard to cover it all, there are a lot of really subtle differences. For those of you who are interested, the Palantir 9000 has about 15 MIPS of compute power to get to 300 CPS, the Kurzweil has about 3 MIPS to do 60 CPS. Keep this in mind when you think about a software only solution, these machines are optimized for OCR. // Howard
elt@entire.UUCP (Edward L. Taychert) (06/14/88)
In article <10160@ames.arc.nasa.gov>, eugene@pioneer.arpa (Eugene N. Miya) writes: > This is a brief summary of my experiences in the OCR field up to today > when I had yet another Demo of OCR systems. I've been asked to look at > one more and will report on it some time. > (stuff gone) > > The Kurzweil is a system wich requires training. It takes several My Kurzweil Discover does not require training. Indeed, this is pushed as their advantage over other systems. No training is required. You put the document in the scanner and it converts it. > ..... > too high (able one per line? Well the DEST/Kur* were 4-7 errors > per line on my sample text, much too high. Why, my typo rate is It all depends, When I run typewriter courier through the Kurzweil, I typically get 100% recoginition accuracy. (That's ZERO errors, page after page). If I run 5x7 dot matrix through it, I typically get 100% gobbledy gook. Small print, < 8 points increases the error rate too. My Kurzweil Summary: typeset pages - perfect or near perfect (nothings really perfect...) laser writer print (300dpi) - two or three errors per page many pin(24) dot matrix - untested 5x9 dot matrix - useless > .... > Another gross generalization from > Yes, I think this might be the case for me too... :-) -- ____________________________________________________________________________ Ed Taychert Phone: USA (716) 381-7500 Entire Inc. UUCP: rochester!rocksanne!entire!elt 445 E. Commercial Street East Rochester, N.Y. 14445 _____________________________________________________________________________