tim@hoptoad.uucp (Tim Maroney) (10/30/88)
In article <3447@pt.cs.cmu.edu> ns@cat.cmu.edu (Nicholas Spies) writes: >In article <5772@hoptoad.uucp> tim@hoptoad.UUCP (Tim Maroney) writes: >>... >>And according to this estimate, a Next disk will hold 671 books at 256M. > >At $40/book that's $26,840.00 + $50.00 for the disc itself. Just the >author's royalties, figured at 15%, would make the disc cost $4,026 (after >all, why should the authors take a loss?). Therein lies the problem of very >dense media. Yep. All I was talking about was how many would fit. Whether it could ever be economically feasible to publish such a disk is another matter entirely. Even with public domain books, the costs of scanning and character-recognizing are pretty large. I made some estimates a few months ago, but I don't know where they've gotten to now, I'm afraid. Let's see, if it takes about two minutes to scan and convert a page, and the average book has 250 pages, then that's 500 minutes or over 8 hours per book -- let's say ten hours to be conservative. So it would take 6710 hours or about three and a third work years to scan in 671 books. And I think my two minutes a page estimate may be optimistic, not to mention extra costs for indexing and mastering. Not a basement project, I'm afraid. -- Tim Maroney, Consultant, Eclectic Software, sun!hoptoad!tim This message does represent the views of Eclectic Software.
bill@bilver.UUCP (Bill Vermillion) (10/31/88)
In article <5790@hoptoad.uucp> tim@hoptoad.UUCP (Tim Maroney) writes: >In article <3447@pt.cs.cmu.edu> ns@cat.cmu.edu (Nicholas Spies) writes: >>In article <5772@hoptoad.uucp> tim@hoptoad.UUCP (Tim Maroney) writes: >>>... >>>And according to this estimate, a Next disk will hold 671 books at 256M. >> >>At $40/book that's $26,840.00 + $50.00 for the disc itself. Just the >>author's royalties, figured at 15%, would make the disc cost $4,026 (after >>all, why should the authors take a loss?). Therein lies the problem of very >>dense media. > .... stuff deleted .... >entirely. Even with public domain books, the costs of scanning and >character-recognizing are pretty large. ....... ........ >Let's see, if it takes about two minutes to scan and convert a page, >and the average book has 250 pages, then that's 500 minutes or over 8 >hours per book -- let's say ten hours to be conservative. So it would >take 6710 hours or about three and a third work years to scan in 671 >books. And I think my two minutes a page estimate may be optimistic, >not to mention extra costs for indexing and mastering. Not a basement >project, I'm afraid. >-- Let's take a step back and look at this again. If a book is on disk we don't neccesarily need to be able to read it on a character basis. The idea is to be able to READ Shakespeare, not to re-edit, re-create, re-print, etc. I would suspect that it would be a bit difficult to get publishers to agree to that form of distribution. However - if we go to image storage we can still see the book on the screen, we could have images from the book, we would be able to search through the book (providing it was indexed - more in a later paragraph), we would be able to do almost anything except re-edit, re-(etc.).... So from 8 hours per pook at 2 minutes per page, we can go to 12.5 minutes per book at 3 seconds per page. Now before you say that can't be done - let me tell you I saw it. I forgot the company that makes it, but the system was a document storage and retreival system using high speed scanners, fast photo-copy type printers, and 12" laser disk media. One of the options was a 12 video juke box. I don't recall the exact capacity, but it was large. Let's just map this onto existing video technology. In a CAV mode a 12 disk can store approximate 55,000 frames per side. When these disks are used for data they are about 1.2 gigs. That is about 4.5 times more than the 256meg disk. That means we should be able to get about 11,750 (rounded) pages per 256Meg disk. Or 47 books per disk. Media cost then is approximately $1.00 per book which puts it just above paper-back printing costs, but below hard-bound. And I would estimate it would cost you under $2.00 to ship the disk first class, as opposed to $$$$ to ship 47 books that way. The document storage/retrieval system also had software so that you would index the document as you stored it. Then anytime you needed the document you would go to the index and get it. On a large juke-box that could take 20 to 30 seconds to find the disk, place it, search and then display. But on a large juke-box that was finding 1 document out of FIVE MILLION. THen at a touch of a button you had a full hard copy of the original, and the company had information on the legal acceptability of such documents. Quite impressive. So instead of 6700 books taking 3 years, we get 50 books taking 10 hours. This seems a more reasonable route. An aside - that relates to the above. Before Sony and Phillip cross-licensed their CD technology, Sony had developed a "digital audio disk". They could see no market for the disk. Why. Well they had this disk, 12" in diameter, and they could not conceive of being able to market a record that played for 20 HOURS per side. Phillips had a 4" (approx) disk. Playing time was under 1 hour. One of the favorite works of a Sony exec. was 73 minutes long, so the disk was designed for that. That is where the 12 cm disk came from. It is probably better to waste space and have a marketable item, than to achieve maximum capability and have no market at all. Who - execept a library would want 6700 disks on a volume. And what about accesibility to the 6699 other books when someone has the disk to read 1 volume. -- Bill Vermillion - UUCP: {uiucuxc,hoptoad,petsd}!peora!rtmvax!bilver!bill : bill@bilver.UUCP
tim@hoptoad.uucp (Tim Maroney) (11/01/88)
In article <5790@hoptoad.uucp> tim@hoptoad.UUCP (Tim Maroney) writes: >And I think my two minutes a page estimate may be optimistic, >not to mention extra costs for indexing and mastering. In article <557@metapsy.UUCP> sarge@metapsy.UUCP (Sarge Gerbode) writes: >There are fairly decent full-text retrieval and indexing programs >that would make a normal index obsolete. I was referring to an automatically generated inverted index, not an ordinary book index, which would be silly on a high-density optical medium. It would still require human checking in any case, just as optical character recognition does, so the time would be noticeable. Because of the slow seeks and large amounts of data, it is neccessary to set up an index on an optical read-only medium at publication time; run-time search algorithms are way too slow. -- Tim Maroney, Consultant, Eclectic Software, sun!hoptoad!tim "What's bad? What's the use of turning? In Hell I'll be there a-burning! Meanwhile, think of what I'm earning! All on account of my name." - Bill Sykes, "Oliver"
tim@hoptoad.uucp (Tim Maroney) (11/01/88)
In article <282@bilver.UUCP> bill@bilver.UUCP (Bill Vermillion) writes: >If a book is on disk we don't >neccesarily need to be able to read it on a character basis. The idea is to >be able to READ Shakespeare, not to re-edit, re-create, re-print, etc. Wrong. The idea is to be able to read Shakespeare, to copy and paste relevant sections for critical essays, to print sections for reading at leisure when away from the computer, to do word-frequency analyses, to follow cross-reference chains among related keywords and topics, and so on. Computers are a terrible medium for leisure reading -- less text shows on a screen than on a printed page, and the screen luminescence leads to eye fatigue, not to mention the lack of physical portability. If all you can do is read, what you have is far worse than a printed book. And I have yet to see a stage show where the director didn't do some editing of the script! >However - if we go to image storage we can still see the book on the screen, >we could have images from the book, we would be able to search through the >book (providing it was indexed - more in a later paragraph), we would be able >to do almost anything except re-edit, re-(etc.).... Almost anything; except everything you would expect to be able to do with computer text, such as copy and paste it, do keyword searches, etc. You'd be able to read it and print it out. What an awesome improvement over the printed page. >So from 8 hours per pook at 2 minutes per page, we can go to 12.5 minutes per >book at 3 seconds per page. 3 seconds a page? Is that using clairvoyance or what? Visualize the process of positioning a book on a flat-bed scanner for a moment. It takes anywhere from five to twenty seconds. Now add the scanning time, which is at the minimum 3 seconds a page. >Now before you say that can't be done - let me tell you I saw it. I forgot >the company that makes it, but the system was a document storage and retreival >system using high speed scanners, fast photo-copy type printers, and 12" laser >disk media. One of the options was a 12 video juke box. I don't recall the >exact capacity, but it was large. Perhaps you're referring to the Wang system that has gotten so much publicity. I don't see how it is well suited to mass distribution of books; it is meant for keeping copies of receipts and so forth. >The document storage/retrieval system also had software so that you would >index the document as you stored it. Then anytime you needed the document you >would go to the index and get it. On a large juke-box that could take 20 to >30 seconds to find the disk, place it, search and then display. But on a >large juke-box that was finding 1 document out of FIVE MILLION. That's a great approach for receipts. For books, you're talking at least two extra minutes per page, with a high error rate and an extremely inconvenient interface requiring that you "lasso" the words being indexed. You also have to type them out. >THen at a touch of a button you had a full hard copy of the original, and the >company had information on the legal acceptability of such documents. Quite >impressive. And quite irrelevant. >So instead of 6700 books taking 3 years, we get 50 books taking 10 hours. >This seems a more reasonable route. How about a trillion books for no money at all? That's much more attractive. Coming soon to your Isuzu dealer. -- Tim Maroney, Consultant, Eclectic Software, sun!hoptoad!tim "Because there is something in you that I respect, and that makes me desire to have you for my enemy." "Thats well said. On those terms, sir, I will accept your enmity or any man's." - Shaw, "The Devil's Disciple"
nujohnso@ndsuvax.UUCP (Ceej) (11/01/88)
In article <5790@hoptoad.uucp> tim@hoptoad.UUCP (Tim Maroney) writes: > >Let's see, [...] > So it would >take 6710 hours or about three and a third work years to scan in 671 >books. And I think my two minutes a page estimate may be optimistic, >not to mention extra costs for indexing and mastering. I would say that if you automated the process, it would cut that time down to around 2500 hours. By automating, I mean setting the process up so that the pages are fed into the process continually and ~24 hours a day. Note that this estimate is certainly not conservative, and the time required to set up this system is not included. Actual requirements may vary. Please consult your CD-ROM handbook for details. -- nujohnso@ndsuvax.bitnet nujohnso@plains.NoDak.edu ...!uunet!ndsuvax!nujohnso i want a shoehorn with teeth
cramer@optilink.UUCP (Clayton Cramer) (11/02/88)
In article <5800@hoptoad.uucp>, tim@hoptoad.uucp (Tim Maroney) writes: > In article <282@bilver.UUCP> bill@bilver.UUCP (Bill Vermillion) writes: > Wrong. The idea is to be able to read Shakespeare, to copy and paste > relevant sections for critical essays, to print sections for reading at > leisure when away from the computer, to do word-frequency analyses, to > follow cross-reference chains among related keywords and topics, and so > on. Computers are a terrible medium for leisure reading -- less text > shows on a screen than on a printed page, and the screen luminescence > leads to eye fatigue, not to mention the lack of physical portability. > If all you can do is read, what you have is far worse than a printed book. Issac Asimov wrote a marvelous parody of _The_Double_Helix_ about these wild, womanizing scientists at Oxford, a century or two from now, reinventing the book for exactly these reasons. If you doubt it, consider how many people curl up with a good machine- readable book and a computer at the end of long, busy day. Also, the number of people who bring along a laptop to sit in an open field and read for the pleasure of it. Anyone that wants to spend more time reading in front of a computer, instead of a printed page, isn't working hard enough! -- Clayton E. Cramer ..!ames!pyramid!kontron!optilin!cramer
geb@cadre.dsl.PITTSBURGH.EDU (Gordon E. Banks) (11/02/88)
In article <5790@hoptoad.uucp> tim@hoptoad.UUCP (Tim Maroney) writes: > >Yep. All I was talking about was how many would fit. Whether it could >ever be economically feasible to publish such a disk is another matter >entirely. Even with public domain books, the costs of scanning and >character-recognizing are pretty large. Not really. You can estimate it by the cost of getting books from University Microfilms. They have to photocopy each page. A normal sized book is around $50. This covers retrieving the book from whatever library has it, and the labor of copying it. Of course, they expect to sell more copies of the microfilm later, but this would apply in spades to optical disk versions. OCR programs will soon be sophisticated enough that it won't add much to the cost of simply photocoping the book. Compared to conventional publication (typesetting) this cost is trivial. If all books worth reading in the public domain were done, it would be a wonderful thing. I suspect people will start doing this as soon as the market is large enough. The real hang up is going to be with current books where royalties will have to be paid.
vnend@ms.uky.edu (D. W. James -- Staff Account) (11/03/88)
In article <282@bilver.UUCP> bill@bilver.UUCP (Bill Vermillion) writes:
)Now before you say that can't be done - let me tell you I saw it. I forgot
)the company that makes it, but the system was a document storage and retreival
)system using high speed scanners, fast photo-copy type printers, and 12" laser
)disk media. One of the options was a 12 video juke box. I don't recall the
)exact capacity, but it was large.
)Bill Vermillion - UUCP: {uiucuxc,hoptoad,petsd}!peora!rtmvax!bilver!bill
If it was the same system that I saw written up in (I think) PC_WEEK
last year it's capacity at the limit was 1.2 TERABYTES. Not a trivial
amount of storage...
--
Vnend, posting from his other account, on a machine about 100 yards
horizontally, and 40 yards vertically, from the other one.
vnend@ms.uky.edu or vnend@ukma.bitnet or vnend@engr.uky.edu
"A few days later, I got a letter... advising me to forsake my sordid lifestyle and give all my hickies to the living Terim." The Countess, CEREBUS #54
tim@hoptoad.uucp (Tim Maroney) (11/03/88)
In article <5790@hoptoad.uucp> tim@hoptoad.UUCP (Tim Maroney) wrote: >Even with public domain books, the costs of scanning and >character-recognizing are pretty large. In article <1676@cadre.dsl.PITTSBURGH.EDU> geb@cadre.dsl.pittsburgh.edu (Gordon E. Banks) has been writing: >Not really. You can estimate it by the cost of getting books from >University Microfilms. They have to photocopy each page. A normal >sized book is around $50. This covers retrieving the book from >whatever library has it, and the labor of copying it. So $50*671 = $33,500. Not a trivial investment. This is the cost to the publisher of making the book, though it would be spread out among the individual copies. And that's still not factoring in the OCR running and proofreading, not to mention pre-mastering and mastering and duplication. And promotion and.... >OCR programs will >soon be sophisticated enough that it won't add much to the cost >of simply photocoping the book. Disagree. It'll always take proofreading, and for 671 books that's quite a lot of skilled labor to pay for. >Compared to conventional publication (typesetting) this cost is trivial. Agree provisionally; per book it's relatively trivial; for hundreds of books it far exceeds the production cost of a single typeset book. >If all books worth reading in >the public domain were done, it would be a wonderful thing. I suspect >people will start doing this as soon as the market is large enough. >The real hang up is going to be with current books where royalties >will have to be paid. Completely agree! I hope it happens, but as someone who did a minor feasibility study on doing it himself, I have to say it seems a long way off. The barriers are formidable. -- Tim Maroney, Consultant, Eclectic Software, sun!hoptoad!tim "The time is gone, the song is over. Thought I'd something more to say." - Roger Waters, Time
dmocsny@uceng.UC.EDU (daniel mocsny) (11/04/88)
...then let's build a world our machines can work in. 100 years ago people were trying to replace the horse with internal combustion engine-driven vehicles. Now the obvious approach would have been to build some sort of mechanical analog of the horse, strap an engine on it, and keep everything transparent to the users. Since that was not possible, the next easiest thing was to change the world to accommodate the strength/weakness mix of the best way to run engines: on wheeled chassis. So we put $ billions into paving over some of the best real estate in the country. Now we have a world that accommodates motor vehicles, to some extent. In article <5821@hoptoad.uucp>, tim@hoptoad.uucp (Tim Maroney) writes: > So $50*671 = $33,500. Not a trivial investment. This is the cost to the > publisher of making the book, though it would be spread out among the > individual copies. And that's still not factoring in the OCR running and > proofreading, not to mention pre-mastering and mastering and duplication. > And promotion and.... > > It'll always take proofreading, and for 671 books that's quite > a lot of skilled labor to pay for. Let's not forget that virtually every book that makes it into print these days passes through a computer at some stage in its production. Most authors use word processors (either directly or through secretaries), most publishers use electronic typesetting, and some of us authors dabble in both. So most of the work the CD-ROM publishers have to do has already been done somewhere. Printing books degrades the utility that was present when that information was originally in electronic form. From the standpoint of the CD-ROM vendors and potential users, publishers and authors who release information in printed form exclusively are destroying wealth. By refusing to establish and adhere to electronic document standards, we are reducing the amount of information we can exploit and pass on to our progeny. In other words, we are shooting ourselves in the foot. A world optimized for horses was no good for automobiles. The latter was useless until a new world was built. Similarly, a world optimized for paper is no good for computers. To get the most benefit out of our new technology, we need to change the way we do things. Obviously the existing stock of printed information will not benefit from re-designing our world to match the strengths and weaknesses of computers. But I would hesitate to say that OCR will _always_ require proofreading. OCR is a hard problem, but certainly not an impossible problem. It is only a mapping from the (very large) vector space of possible letter bitmaps to the smaller space of letter codes and font descriptions. The structure of that mapping is complex, but not infinitely so, else we could not read. Connectionist approaches to OCR are already showing great promise. In ten years it might be essentially a solved problem. A harder problem will be to have a computer make sense of arbitrary figures and diagrams. But that won't be necessary; the OCR machine can simply vectorize or bitmap anything it can't otherwise interpret. Give a smart OCR device, we could ``mine'' libraries for their information content. Just load the hopper with books, press the button, and take the information out of those mouldering tombs and put it in the hands of people who can go out and create wealth with it. Dan Mocsny
geb@cadre.dsl.PITTSBURGH.EDU (Gordon E. Banks) (11/05/88)
In article <5821@hoptoad.uucp> tim@hoptoad.UUCP (Tim Maroney) writes: >Completely agree! I hope it happens, but as someone who did a minor >feasibility study on doing it himself, I have to say it seems a long >way off. The barriers are formidable. >-- I think you will find that libraries, including the Library of Congress will be doing this for us. Book preservation is very expensive and putting them all on CD while the actual copies get stored in CO2 or such is one answer to this problem. It may be a lot cheaper for a library to give you electronic access to its collection than actual access. The only thing is, I like to read in bed, and even a laptop gets heavy on my chest.
desnoyer@Apple.COM (Peter Desnoyers) (11/05/88)
>In article <5790@hoptoad.uucp> tim@hoptoad.UUCP (Tim Maroney) writes: >> So it would >>take 6710 hours or about three and a third work years to scan in 671 >>books. And I think my two minutes a page estimate may be optimistic, >>not to mention extra costs for indexing and mastering. Unbind the book, first, then put it through a sheet feeder. I'm sure there's a high-tech way to unbind a book, but zipping the binding off on a good circular saw works fine. (I've seen it done to Inside Mac, to loose-leaf bind it.) Should be ~5min per book, plus <5 sec. per page for per-sheet paper handling. (Use the guts of a good copy machine.) Peter Desnoyers
wetter@cit-vax.Caltech.Edu (Pierce T. Wetter) (11/08/88)
> I think you will find that libraries, including the Library of > Congress will be doing this for us. Book preservation is very > expensive and putting them all on CD while the actual copies get > stored in CO2 or such is one answer to this problem. It may be > a lot cheaper for a library to give you electronic access to > its collection than actual access. The only thing is, I like > to read in bed, and even a laptop gets heavy on my chest. The last time I was in the library of congress, they were scanning the books at 300dpi and displaying them on special terminals. Clearly not the most efficent way of doing this. What really needs to be done is to make a standard for electronic books. Here's my quick draft of a storage method: Every book is composed of a series of records. Each record consists of a header followed by some data. There are three major types of records: formatting, text and pictures. A format record contains formatting information for a following record of text or pictures. (Formatting codes could be either TeX or RichTextFormat or Postscript or something special.) Pictures are stored in Postscript, GIF or Tiff format depending on their origin (line art or pictures) Pierce ____________________________________________________________________________ You can flame or laud me at: wetter@tybalt.caltech.edu or wetter@csvax.caltech.edu or pwetter@caltech.bitnet Caution: All my postings are 100% accurate from my point of view. However, my point of view rarely translates into english. Therefore any errors in my posting are your fault for not interpreting it correctly.