[comp.arch] >32 bits: why?

tbray@watsol.waterloo.edu (Tim Bray) (10/26/90)

jpk@ingres.com (Jon Krueger) writes:
>From aglew@crhc.uiuc.edu (Andy Glew):
>> People already have databases that exceed 32 bits worth of address if
>> memory mapped.
>Just to play devil's advocate, why use the processor to address
>persistent, shared data objects?  The processor is designed to address
>objects that live in volotile store...
>Well?

The point is well-taken, in particular when your nonvolatile objects are 
not particularly homogeneous and pointer arithmetic on them is not particularly
interesting - the kind of situation a person from Ingres is used to.

One counter-example is text databases.  Given the blurring between semantics
and structure in text, and the need to support things like full text search,
it is often necessary that pointers have character granularity.  And indexes,
which need to be persistent, are just collections of pointers.  And text
databases larger than 2^32 are starting to be come rather un-rare.

Of course, there are ways to work around this, just as there are ways to
more-or-less do real computation on 16 bit computers.  But there are many
wins from having your persistent pointers effiently dealt with via 
word-oriented computer operations.  For example, a not-uncommon text database
operation might be: which of these 1.87 million citations (pointers starting
at 0x23fc8932 in the index) contain instances of both "Japan" (37,210
occurrences in the database) and "telephone" (41,881 occurrences).  There
are algorithms that can do the pointer manipulations and give you answers
pretty quick, but they run a lot better if you can do a lot of the
comparisons and moves in hardware.

Cheers, Tim Bray, Open Text Systems

cutting@parc.xerox.com (Doug Cutting) (10/26/90)

In article <1990Oct26.012815.11941@watdragon.waterloo.edu> tbray@watsol.waterloo.edu (Tim Bray) writes:

   One counter-example is text databases.  Given the blurring between semantics
   and structure in text, and the need to support things like full text search,
   it is often necessary that pointers have character granularity.  And indexes,
   which need to be persistent, are just collections of pointers.  And text
   databases larger than 2^32 are starting to be come rather un-rare.

   Of course, there are ways to work around this, just as there are ways to
   more-or-less do real computation on 16 bit computers.  But there are many
   wins from having your persistent pointers effiently dealt with via 
   word-oriented computer operations.  For example, a not-uncommon text database
   operation might be: which of these 1.87 million citations (pointers starting
   at 0x23fc8932 in the index) contain instances of both "Japan" (37,210
   occurrences in the database) and "telephone" (41,881 occurrences).  There
   are algorithms that can do the pointer manipulations and give you answers
   pretty quick, but they run a lot better if you can do a lot of the
   comparisons and moves in hardware.

In the text indexing literature one usually speaks of addressing words
within documents, not characters within databases.  2^32 is large
enough to hold both the number of documents in todays large text
databases and the number of words (or characters) in the largest
individual documents.  Perhaps you regard this as a workaround, but it
too allows optimizations.  When intersecting document-major citation
lists one can skip all the occurences within a document if that
document does not occur in the other citation list.

	Doug

tbray@watsol.waterloo.edu (Tim Bray) (10/28/90)

jpk@ingres suggested that >32-bit addressing for persistent objects might not
make any sense, given that the number of *objects* one addresses is usually
<< 2^32.

I replied that in text databases, character addressing is often required and
that 2^32 characters is not hard to come by.

cutting@parc.xerox.com (Doug Cutting) writes:
>In the text indexing literature one usually speaks of addressing words
>within documents, not characters within databases.  2^32 is large
>enough to hold both the number of documents in todays large text
>databases and the number of words (or characters) in the largest
>individual documents.

(This should really be in comp.text.databases or comp.database.text, which
would be nice groups if they existed).

Cutting is right if you're content to address *words* in pre-cooked 
*documents*.  But

1. For many applications indexing words is counter-productive, indexing
   arbitrary-length byte strings much more useful.  The techniques for doing
   this pretty well require database byte granularity.
2. Assuming that you know what the "documents" are at index creation time
   is just plain wrong, given the nature of text.  Users should have the
   option of defining new "document" structures at arbitrary locations in
   the database at any time based on their own criteria.  This is why
   relational technology is sometimes a poor match for text applications.
3. Most databases contain many types of documents, which will typically nest
   within each other and overlap.  If a particular index point is contained 
   with instances of document type "article", "section", "subsection", 
   "paragraph", "page", and "cross reference", you're going to have trouble 
   building an index structure that efficiently supports this mapping without 
   byte granularity.

So, there are important application classes where persistent pointers >32
bits are necessary.  

But I'm not objective - our company is based on selling software that deals
with texts using a model constrained by #3 above.  I'm really hoping there
are lots of other applications types putting pressure on to leapfrog the 
32-bit curve.

Cheers, Tim Bray, Open Text Systems