[comp.arch] Big files, and lots of 'em: 32 bits is not enough

tbray@watsol.waterloo.edu (Tim Bray) (08/09/90)

mash@mips.COM (John Mashey) writes:
>jesup@cbmvax (Randell Jesup) writes:
>>Few machines
>>(percentage-wise) even have 4 GB of storage, let alone files larger that 4GB
>>(I've never even seen a file larger than 100MB, even on mainframes).

>However, I'd STRONGLY disagree with the idea that 64-bit machines will
>remain confined to the super- & minisuper world for 10-20 more years.
>So, here's a thought to stimulate discussion:
>	What applications (outside the scientific / MCAD ones that
>	can obviously consume the space) would benefit from 64-bit
>	machines?

An example: text database.  In a textbase, you must have addressability to the
byte, not to the record.  Also, it is very very convenient to regard all the
text in your universe as being in one linear address space.  32 bits worth of
text is not very much text in real-world terms.  Here is some 'ls' output from
a directory containing the electronic Oxford English Dictionary, Second
Edition, and some supporting files.

-r--r-----  1 tbray    572728830 Sep  7  1989 oed-2e
-r--r-----  1 tbray    179728816 Sep  7  1989 oed-2e.struct
-r--r-----  1 tbray    475589360 Sep  8  1989 oed-2e.tree

About 28 bits worth right there.  But I want a database with the OED and the
complete Shakespeare and Chemical Abstracts and the complete Library of
Congress Catalogue and a couple decades' worth of AP wire service; that's
almost enough text to be really useful.  But seriously folks, there's lots of
insurance companies and research institutions and government departments with
*lots* more than 4 Gb sitting around...

And I think it's a *bad* idea, as some have proposed, to create a new datatype
for file offsets as opposed to addresses as opposed to integers.  As Henry
Spencer and others have repeatedly pointed out, the VAX made us all sloppy by
allowing us to interchange pointers, integers, and offsets promiscuously.  But
too late, we're stuck with it; there's not enough programmer-years in the
lifetime of the universe to fix all the useful software that does this.  And
y'know, in my heart of hearts, I'm not sure it's a bad thing; it certainly
allows the use of some extremely elegant and rigorously simple programming
paradigms.

Cheers, Tim Bray, Open Text Systems, Waterloo, Ont.

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (08/09/90)

In article <1990Aug8.222644.23683@watdragon.waterloo.edu> tbray@watsol.waterloo.edu (Tim Bray) writes:

| And I think it's a *bad* idea, as some have proposed, to create a new datatype
| for file offsets as opposed to addresses as opposed to integers.

  However, X3J11 didn't agree with that idea, and there is a type for
offset in a file, and pointer types are not the same as integers.

|                                                                   As Henry
| Spencer and others have repeatedly pointed out, the VAX made us all sloppy by
| allowing us to interchange pointers, integers, and offsets promiscuously.  But
| too late, we're stuck with it; there's not enough programmer-years in the
| lifetime of the universe to fix all the useful software that does this.

  Actually, since there are a lot of machines which have hardware which
functions using a diferent paradigm then the VAX, a lot of old software
has been upgraded, and most new compilers generate warnings which
encourage programmers to write portable code. Commercial software is
being written more portably to allow use in more markets.

|                                                                         And
| y'know, in my heart of hearts, I'm not sure it's a bad thing; it certainly
| allows the use of some extremely elegant and rigorously simple programming
| paradigms.

  Absolutely no comment.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
            "Stupidity, like virtue, is its own reward" -me

zenith-steven@cs.yale.edu (Steven Ericsson Zenith) (08/10/90)

In article <1990Aug8.222644.23683@watdragon.waterloo.edu>,
tbray@watsol.waterloo.edu (Tim Bray) writes:
|> An example: text database.  In a textbase, you must have
addressability to the
|> byte, not to the record.  Also, it is very very convenient to regard
all the
|> text in your universe as being in one linear address space.  32 bits
worth of
|> text is not very much text in real-world terms.  Here is some 'ls'
output from
|> a directory containing the electronic Oxford English Dictionary,
Second
|> Edition, and some supporting files.
|> 
|> -r--r-----  1 tbray    572728830 Sep  7  1989 oed-2e
|> -r--r-----  1 tbray    179728816 Sep  7  1989 oed-2e.struct
|> -r--r-----  1 tbray    475589360 Sep  8  1989 oed-2e.tree

Can you explain to us what these files contain and how the data in them
is
structured/stored/encoded?
 
|> About 28 bits worth right there.  But I want a database with the OED
and the
|> complete Shakespeare and Chemical Abstracts and the complete Library
of
|> Congress Catalogue and a couple decades' worth of AP wire service;
that's
|> almost enough text to be really useful.  But seriously folks, there's
lots of
|> insurance companies and research institutions and government
departments with
|> *lots* more than 4 Gb sitting around...

Isn't there a preferable relative means to address your data? 
- surely it's more extensible and thus you don't have
to worry about the limits of word size. Not that I'm arguing for small
words
- but is linear addressing of data really the burning issue? How do you
manage
distributed data? I know .. decode the address into smaller components
..
so why do you want long words? Why not use several smaller words to
construct
an address in the first place? I address these comments refering to your
particular data set - text. Do you *really* want a means to linearly
address
the documents you describe? What particular advantage does this give
you
over the natural decomposition of the data? When we get to spaces this
size
would some paging mechanism be preferable?

--
Steven Ericsson Zenith              *            email:
zenith@cs.yale.edu
Fax: (203) 466 2768                 |            voice: (203) 432 1278
"The tower should warn the people not to believe in it." -
P.D.Ouspensky
Yale University Dept of Computer Science 51 Prospect St New Haven CT
06520 USA

zenith-steven@cs.yale.edu (Steven Ericsson Zenith) (08/10/90)

Good grief. My last posting ended up looking a real mess.
My apologies I shall gripe at the XRN people.

--
Steven Ericsson Zenith              *            email: zenith@cs.yale.edu
Fax: (203) 466 2768                 |            voice: (203) 432 1278
"The tower should warn the people not to believe in it." - P.D.Ouspensky
Yale University Dept of Computer Science 51 Prospect St New Haven CT 06520 USA