[rec.arts.books] Hundreds of books on an optical disk

tim@hoptoad.uucp (Tim Maroney) (10/30/88)

In article <3447@pt.cs.cmu.edu> ns@cat.cmu.edu (Nicholas Spies) writes:
>In article <5772@hoptoad.uucp> tim@hoptoad.UUCP (Tim Maroney) writes:
>>...
>>And according to this estimate, a Next disk will hold 671 books at 256M.
>
>At $40/book that's $26,840.00 + $50.00 for the disc itself. Just the
>author's royalties, figured at 15%, would make the disc cost $4,026 (after
>all, why should the authors take a loss?). Therein lies the problem of very
>dense media.

Yep.  All I was talking about was how many would fit.  Whether it could
ever be economically feasible to publish such a disk is another matter
entirely.  Even with public domain books, the costs of scanning and
character-recognizing are pretty large.  I made some estimates a few
months ago, but I don't know where they've gotten to now, I'm afraid.
Let's see, if it takes about two minutes to scan and convert a page,
and the average book has 250 pages, then that's 500 minutes or over 8
hours per book -- let's say ten hours to be conservative.  So it would
take 6710 hours or about three and a third work years to scan in 671
books.  And I think my two minutes a page estimate may be optimistic,
not to mention extra costs for indexing and mastering.  Not a basement
project, I'm afraid.
-- 
Tim Maroney, Consultant, Eclectic Software, sun!hoptoad!tim
This message does represent the views of Eclectic Software.

bill@bilver.UUCP (Bill Vermillion) (10/31/88)

In article <5790@hoptoad.uucp> tim@hoptoad.UUCP (Tim Maroney) writes:
>In article <3447@pt.cs.cmu.edu> ns@cat.cmu.edu (Nicholas Spies) writes:
>>In article <5772@hoptoad.uucp> tim@hoptoad.UUCP (Tim Maroney) writes:
>>>...
>>>And according to this estimate, a Next disk will hold 671 books at 256M.
>>
>>At $40/book that's $26,840.00 + $50.00 for the disc itself. Just the
>>author's royalties, figured at 15%, would make the disc cost $4,026 (after
>>all, why should the authors take a loss?). Therein lies the problem of very
>>dense media.
>
.... stuff deleted ....
>entirely.  Even with public domain books, the costs of scanning and
>character-recognizing are pretty large.  .......
........
>Let's see, if it takes about two minutes to scan and convert a page,
>and the average book has 250 pages, then that's 500 minutes or over 8
>hours per book -- let's say ten hours to be conservative.  So it would
>take 6710 hours or about three and a third work years to scan in 671
>books.  And I think my two minutes a page estimate may be optimistic,
>not to mention extra costs for indexing and mastering.  Not a basement
>project, I'm afraid.
>-- 

Let's take a step back and look at this again.   If a book is on disk we don't
neccesarily need to be able to read it on a character basis.  The idea is to
be able to READ Shakespeare, not to re-edit, re-create, re-print, etc.

I would suspect that it would be a bit difficult to get publishers to agree to
that form of distribution.

However - if we go to image storage we can still see the book on the screen,
we could have images from the book, we would be able to search through the
book (providing it was indexed - more in a later paragraph), we would be able
to do almost anything except re-edit, re-(etc.)....

So from 8 hours per pook at 2 minutes per page, we can go to 12.5 minutes per
book at 3 seconds per page.

Now before you say that can't be done - let me tell you I saw it.  I forgot
the company that makes it, but the system was a document storage and retreival
system using high speed scanners, fast photo-copy type printers, and 12" laser
disk media.  One of the options was a 12 video juke box.  I don't recall the
exact capacity, but it was large.

Let's just map this onto existing video technology.  In a CAV mode a 12 disk
can store approximate 55,000 frames per side.  When these disks are used for
data they are about 1.2 gigs.  That is about 4.5 times more than the 256meg
disk.  That means we should be able to get about 11,750 (rounded) pages per
256Meg disk.  Or 47 books per disk.   Media cost then is approximately $1.00
per book which puts it just above paper-back printing costs, but below
hard-bound.  And I would estimate it would cost you under $2.00 to ship the
disk first class, as opposed to $$$$ to ship 47 books that way.

The document storage/retrieval system also had software so that you would
index the document as you stored it.  Then anytime you needed the document you
would go to the index and get it.  On a large juke-box that could take 20 to
30 seconds to find the disk, place it, search and then display.  But on a
large juke-box that was finding 1 document out of FIVE MILLION.

THen at a touch of a button you had a full hard copy of the original, and the
company had information on the legal acceptability of such documents.  Quite
impressive.

So instead of 6700 books taking 3 years, we get 50 books taking 10 hours.
This seems a more reasonable route.

An aside - that relates to the above.  Before Sony and Phillip cross-licensed
their CD technology, Sony had developed a "digital audio disk".  They could
see no market for the disk.  Why.  Well they had this disk, 12" in diameter,
and they could not conceive of being able to market a record that played for
20 HOURS per side.   Phillips had a 4" (approx) disk.   Playing time was under
1 hour.  One of the favorite works of a Sony exec. was 73 minutes long, so the
disk was designed for that.  That is where the 12 cm disk came from.

It is probably better to waste space and have a marketable item, than to
achieve maximum capability and have no market at all.  Who - execept a library
would want 6700 disks on a volume.  And what about accesibility to the 6699
other books when someone has the disk to read 1 volume.

-- 
Bill Vermillion - UUCP: {uiucuxc,hoptoad,petsd}!peora!rtmvax!bilver!bill
                      : bill@bilver.UUCP

tim@hoptoad.uucp (Tim Maroney) (11/01/88)

In article <5790@hoptoad.uucp> tim@hoptoad.UUCP (Tim Maroney) writes:
>And I think my two minutes a page estimate may be optimistic,
>not to mention extra costs for indexing and mastering.

In article <557@metapsy.UUCP> sarge@metapsy.UUCP (Sarge Gerbode) writes:
>There are fairly decent full-text retrieval and indexing programs
>that would make a normal index obsolete.

I was referring to an automatically generated inverted index, not an
ordinary book index, which would be silly on a high-density optical
medium.  It would still require human checking in any case, just as
optical character recognition does, so the time would be noticeable.

Because of the slow seeks and large amounts of data, it is neccessary
to set up an index on an optical read-only medium at publication time;
run-time search algorithms are way too slow.
-- 
Tim Maroney, Consultant, Eclectic Software, sun!hoptoad!tim
"What's bad? What's the use of turning?
 In Hell I'll be there a-burning!
 Meanwhile, think of what I'm earning!
 All on account of my name." - Bill Sykes, "Oliver"

tim@hoptoad.uucp (Tim Maroney) (11/01/88)

In article <282@bilver.UUCP> bill@bilver.UUCP (Bill Vermillion) writes:
>If a book is on disk we don't
>neccesarily need to be able to read it on a character basis.  The idea is to
>be able to READ Shakespeare, not to re-edit, re-create, re-print, etc.

Wrong.  The idea is to be able to read Shakespeare, to copy and paste
relevant sections for critical essays, to print sections for reading at
leisure when away from the computer, to do word-frequency analyses, to
follow cross-reference chains among related keywords and topics, and so
on.  Computers are a terrible medium for leisure reading -- less text
shows on a screen than on a printed page, and the screen luminescence
leads to eye fatigue, not to mention the lack of physical portability.
If all you can do is read, what you have is far worse than a printed book.

And I have yet to see a stage show where the director didn't do some
editing of the script!

>However - if we go to image storage we can still see the book on the screen,
>we could have images from the book, we would be able to search through the
>book (providing it was indexed - more in a later paragraph), we would be able
>to do almost anything except re-edit, re-(etc.)....

Almost anything; except everything you would expect to be able to do with
computer text, such as copy and paste it, do keyword searches, etc.  You'd
be able to read it and print it out.  What an awesome improvement over the
printed page.

>So from 8 hours per pook at 2 minutes per page, we can go to 12.5 minutes per
>book at 3 seconds per page.

3 seconds a page?  Is that using clairvoyance or what?  Visualize the
process of positioning a book on a flat-bed scanner for a moment.  It
takes anywhere from five to twenty seconds.  Now add the scanning time,
which is at the minimum 3 seconds a page.

>Now before you say that can't be done - let me tell you I saw it.  I forgot
>the company that makes it, but the system was a document storage and retreival
>system using high speed scanners, fast photo-copy type printers, and 12" laser
>disk media.  One of the options was a 12 video juke box.  I don't recall the
>exact capacity, but it was large.

Perhaps you're referring to the Wang system that has gotten so much
publicity.  I don't see how it is well suited to mass distribution of
books; it is meant for keeping copies of receipts and so forth.

>The document storage/retrieval system also had software so that you would
>index the document as you stored it.  Then anytime you needed the document you
>would go to the index and get it.  On a large juke-box that could take 20 to
>30 seconds to find the disk, place it, search and then display.  But on a
>large juke-box that was finding 1 document out of FIVE MILLION.

That's a great approach for receipts.  For books, you're talking at least
two extra minutes per page, with a high error rate and an extremely
inconvenient interface requiring that you "lasso" the words being indexed.
You also have to type them out.

>THen at a touch of a button you had a full hard copy of the original, and the
>company had information on the legal acceptability of such documents.  Quite
>impressive.

And quite irrelevant.

>So instead of 6700 books taking 3 years, we get 50 books taking 10 hours.
>This seems a more reasonable route.

How about a trillion books for no money at all?  That's much more
attractive.  Coming soon to your Isuzu dealer.
-- 
Tim Maroney, Consultant, Eclectic Software, sun!hoptoad!tim
"Because there is something in you that I respect, and that makes me desire
 to have you for my enemy."
"Thats well said.  On those terms, sir, I will accept your enmity or any
 man's."
    - Shaw, "The Devil's Disciple"

nujohnso@ndsuvax.UUCP (Ceej) (11/01/88)

In article <5790@hoptoad.uucp> tim@hoptoad.UUCP (Tim Maroney) writes:
>
>Let's see, [...]
> So it would
>take 6710 hours or about three and a third work years to scan in 671
>books.  And I think my two minutes a page estimate may be optimistic,
>not to mention extra costs for indexing and mastering.

I would say that if you automated the process, it would cut that time down
to around 2500 hours.  By automating, I mean setting the process up so that
the pages are fed into the process continually and ~24 hours a day.  Note that
this estimate is certainly not conservative, and the time required to set up
this system is not included.  Actual requirements may vary.  Please consult
your CD-ROM handbook for details.

--
nujohnso@ndsuvax.bitnet   nujohnso@plains.NoDak.edu   ...!uunet!ndsuvax!nujohnso
                        i want a shoehorn with teeth

cramer@optilink.UUCP (Clayton Cramer) (11/02/88)

In article <5800@hoptoad.uucp>, tim@hoptoad.uucp (Tim Maroney) writes:
> In article <282@bilver.UUCP> bill@bilver.UUCP (Bill Vermillion) writes:
> Wrong.  The idea is to be able to read Shakespeare, to copy and paste
> relevant sections for critical essays, to print sections for reading at
> leisure when away from the computer, to do word-frequency analyses, to
> follow cross-reference chains among related keywords and topics, and so
> on.  Computers are a terrible medium for leisure reading -- less text
> shows on a screen than on a printed page, and the screen luminescence
> leads to eye fatigue, not to mention the lack of physical portability.
> If all you can do is read, what you have is far worse than a printed book.

Issac Asimov wrote a marvelous parody of _The_Double_Helix_ about these
wild, womanizing scientists at Oxford, a century or two from now, 
reinventing the book for exactly these reasons.

If you doubt it, consider how many people curl up with a good machine-
readable book and a computer at the end of long, busy day.  Also, the
number of people who bring along a laptop to sit in an open field and
read for the pleasure of it.

Anyone that wants to spend more time reading in front of a computer,
instead of a printed page, isn't working hard enough!
-- 
Clayton E. Cramer
..!ames!pyramid!kontron!optilin!cramer

geb@cadre.dsl.PITTSBURGH.EDU (Gordon E. Banks) (11/02/88)

In article <5790@hoptoad.uucp> tim@hoptoad.UUCP (Tim Maroney) writes:
>
>Yep.  All I was talking about was how many would fit.  Whether it could
>ever be economically feasible to publish such a disk is another matter
>entirely.  Even with public domain books, the costs of scanning and
>character-recognizing are pretty large.

Not really.  You can estimate it by the cost of getting books from
University Microfilms.  They have to photocopy each page.  A normal
sized book is around $50.  This covers retrieving the book from
whatever library has it, and the labor of copying it.  Of course,
they expect to sell more copies of the microfilm later, but this
would apply in spades to optical disk versions.  OCR programs will
soon be sophisticated enough that it won't add much to the cost
of simply photocoping the book.  Compared to conventional publication
(typesetting) this cost is trivial.  If all books worth reading in
the public domain were done, it would be a wonderful thing.  I suspect
people will start doing this as soon as the market is large enough.
The real hang up is going to be with current books where royalties
will have to be paid.

vnend@ms.uky.edu (D. W. James -- Staff Account) (11/03/88)

In article <282@bilver.UUCP> bill@bilver.UUCP (Bill Vermillion) writes:
)Now before you say that can't be done - let me tell you I saw it.  I forgot
)the company that makes it, but the system was a document storage and retreival
)system using high speed scanners, fast photo-copy type printers, and 12" laser
)disk media.  One of the options was a 12 video juke box.  I don't recall the
)exact capacity, but it was large.
)Bill Vermillion - UUCP: {uiucuxc,hoptoad,petsd}!peora!rtmvax!bilver!bill

	If it was the same system that I saw written up in (I think) PC_WEEK
last year it's capacity at the limit was 1.2 TERABYTES.  Not a trivial 
amount of storage...

	
-- 
Vnend, posting from his other account, on a machine about 100 yards
horizontally, and 40 yards vertically, from the other one.
vnend@ms.uky.edu or vnend@ukma.bitnet or vnend@engr.uky.edu                          
"A few days later, I got a letter... advising me to forsake my sordid lifestyle and give all my hickies to the living Terim."  The Countess, CEREBUS #54

tim@hoptoad.uucp (Tim Maroney) (11/03/88)

In article <5790@hoptoad.uucp> tim@hoptoad.UUCP (Tim Maroney) wrote:
>Even with public domain books, the costs of scanning and
>character-recognizing are pretty large.

In article <1676@cadre.dsl.PITTSBURGH.EDU> geb@cadre.dsl.pittsburgh.edu
(Gordon E. Banks) has been writing:
>Not really.  You can estimate it by the cost of getting books from
>University Microfilms.  They have to photocopy each page.  A normal
>sized book is around $50.  This covers retrieving the book from
>whatever library has it, and the labor of copying it.

So $50*671 = $33,500.  Not a trivial investment.  This is the cost to the
publisher of making the book, though it would be spread out among the
individual copies.  And that's still not factoring in the OCR running and
proofreading, not to mention pre-mastering and mastering and duplication.
And promotion and....

>OCR programs will
>soon be sophisticated enough that it won't add much to the cost
>of simply photocoping the book.

Disagree.  It'll always take proofreading, and for 671 books that's quite
a lot of skilled labor to pay for.

>Compared to conventional publication (typesetting) this cost is trivial.

Agree provisionally; per book it's relatively trivial; for hundreds of books
it far exceeds the production cost of a single typeset book.

>If all books worth reading in
>the public domain were done, it would be a wonderful thing.  I suspect
>people will start doing this as soon as the market is large enough.
>The real hang up is going to be with current books where royalties
>will have to be paid.

Completely agree!  I hope it happens, but as someone who did a minor
feasibility study on doing it himself, I have to say it seems a long
way off.  The barriers are formidable.
-- 
Tim Maroney, Consultant, Eclectic Software, sun!hoptoad!tim
"The time is gone, the song is over.
 Thought I'd something more to say." - Roger Waters, Time

dmocsny@uceng.UC.EDU (daniel mocsny) (11/04/88)

...then let's build a world our machines can work in. 100 years ago people
were trying to replace the horse with internal combustion engine-driven
vehicles. Now the obvious approach would have been to build some sort of
mechanical analog of the horse, strap an engine on it, and keep everything
transparent to the users. Since that was not possible, the next easiest
thing was to change the world to accommodate the strength/weakness mix
of the best way to run engines: on wheeled chassis. So we put $ billions
into paving over some of the best real estate in the country. Now we have
a world that accommodates motor vehicles, to some extent.

In article <5821@hoptoad.uucp>, tim@hoptoad.uucp (Tim Maroney) writes:
> So $50*671 = $33,500.  Not a trivial investment.  This is the cost to the
> publisher of making the book, though it would be spread out among the
> individual copies.  And that's still not factoring in the OCR running and
> proofreading, not to mention pre-mastering and mastering and duplication.
> And promotion and....
> 
> It'll always take proofreading, and for 671 books that's quite
> a lot of skilled labor to pay for.

Let's not forget that virtually every book that makes it into print these
days passes through a computer at some stage in its production. Most
authors use word processors (either directly or through secretaries),
most publishers use electronic typesetting, and some of us authors
dabble in both. So most of the work the CD-ROM publishers have to do
has already been done somewhere. Printing books degrades the utility that 
was present when that information was originally in electronic 
form. From the standpoint of
the CD-ROM vendors and potential users, publishers and authors who
release information in printed form exclusively are destroying wealth.
By refusing to establish and adhere to electronic document standards,
we are reducing the amount of information we can exploit and pass on
to our progeny. In other words, we are shooting ourselves in the foot.

A world optimized for horses was no good for automobiles. The latter
was useless until a new world was built. Similarly, a world optimized for
paper is no good for computers. To get the most benefit out of our
new technology, we need to change the way we do things.

Obviously the existing stock of printed information will not benefit 
from re-designing our world to match the strengths and weaknesses of
computers. But I would hesitate to say that OCR will _always_ require
proofreading. OCR is a hard problem, but certainly not an impossible
problem. It is only a mapping from the (very large) vector space of
possible letter bitmaps to the smaller space of letter codes and
font descriptions. The structure of that mapping is complex, but
not infinitely so, else we could not read. Connectionist approaches to
OCR are already showing great promise. In ten years it might be 
essentially a solved problem. A harder problem will be to have a
computer make sense of arbitrary figures and diagrams. But that
won't be necessary; the OCR machine can simply vectorize or
bitmap anything it can't otherwise interpret.

Give a smart OCR device, we could ``mine'' libraries for their 
information content. Just load the hopper with books, press the
button, and take the information out of those mouldering tombs and
put it in the hands of people who can go out and create wealth
with it.

Dan Mocsny

geb@cadre.dsl.PITTSBURGH.EDU (Gordon E. Banks) (11/05/88)

In article <5821@hoptoad.uucp> tim@hoptoad.UUCP (Tim Maroney) writes:
>Completely agree!  I hope it happens, but as someone who did a minor
>feasibility study on doing it himself, I have to say it seems a long
>way off.  The barriers are formidable.
>-- 

I think you will find that libraries, including the Library of
Congress will be doing this for us.  Book preservation is very
expensive and putting them all on CD while the actual copies get
stored in CO2 or such is one answer to this problem.  It may be
a lot cheaper for a library to give you electronic access to
its collection than actual access.  The only thing is, I like
to read in bed, and even a laptop gets heavy on my chest.

desnoyer@Apple.COM (Peter Desnoyers) (11/05/88)

>In article <5790@hoptoad.uucp> tim@hoptoad.UUCP (Tim Maroney) writes:
>> So it would
>>take 6710 hours or about three and a third work years to scan in 671
>>books.  And I think my two minutes a page estimate may be optimistic,
>>not to mention extra costs for indexing and mastering.

Unbind the book, first, then put it through a sheet feeder. I'm sure
there's a high-tech way to unbind a book, but zipping the binding off
on a good circular saw works fine. (I've seen it done to Inside Mac,
to loose-leaf bind it.) Should be ~5min per book, plus <5 sec. per page
for per-sheet paper handling. (Use the guts of a good copy machine.)

				Peter Desnoyers

wetter@cit-vax.Caltech.Edu (Pierce T. Wetter) (11/08/88)

> I think you will find that libraries, including the Library of
> Congress will be doing this for us.  Book preservation is very
> expensive and putting them all on CD while the actual copies get
> stored in CO2 or such is one answer to this problem.  It may be
> a lot cheaper for a library to give you electronic access to
> its collection than actual access.  The only thing is, I like
> to read in bed, and even a laptop gets heavy on my chest.
   
  The last time I was in the library of congress, they were scanning the
books at 300dpi and displaying them on special terminals. Clearly not the
most efficent way of doing this. What really needs to be done is to make a
standard for electronic books. Here's my quick draft of a storage method:

Every book is composed of a series of records. Each record consists of a header
followed by some data. There are three major types of records: formatting,
text and pictures. A format record contains formatting information for a
following record of text or pictures. (Formatting codes could be either TeX or
RichTextFormat or Postscript or something special.) Pictures are stored in
Postscript, GIF or Tiff format depending on their origin (line art or pictures)

Pierce
____________________________________________________________________________
You can flame or laud me at:
wetter@tybalt.caltech.edu or wetter@csvax.caltech.edu or pwetter@caltech.bitnet

Caution: All my postings are 100% accurate from my point of view. However, my
point of view rarely translates into english. Therefore any errors in my 
posting are your fault for not interpreting it correctly.