tbray@watsol.waterloo.edu (Tim Bray) (10/26/90)
jpk@ingres.com (Jon Krueger) writes: >From aglew@crhc.uiuc.edu (Andy Glew): >> People already have databases that exceed 32 bits worth of address if >> memory mapped. >Just to play devil's advocate, why use the processor to address >persistent, shared data objects? The processor is designed to address >objects that live in volotile store... >Well? The point is well-taken, in particular when your nonvolatile objects are not particularly homogeneous and pointer arithmetic on them is not particularly interesting - the kind of situation a person from Ingres is used to. One counter-example is text databases. Given the blurring between semantics and structure in text, and the need to support things like full text search, it is often necessary that pointers have character granularity. And indexes, which need to be persistent, are just collections of pointers. And text databases larger than 2^32 are starting to be come rather un-rare. Of course, there are ways to work around this, just as there are ways to more-or-less do real computation on 16 bit computers. But there are many wins from having your persistent pointers effiently dealt with via word-oriented computer operations. For example, a not-uncommon text database operation might be: which of these 1.87 million citations (pointers starting at 0x23fc8932 in the index) contain instances of both "Japan" (37,210 occurrences in the database) and "telephone" (41,881 occurrences). There are algorithms that can do the pointer manipulations and give you answers pretty quick, but they run a lot better if you can do a lot of the comparisons and moves in hardware. Cheers, Tim Bray, Open Text Systems
cutting@parc.xerox.com (Doug Cutting) (10/26/90)
In article <1990Oct26.012815.11941@watdragon.waterloo.edu> tbray@watsol.waterloo.edu (Tim Bray) writes:
One counter-example is text databases. Given the blurring between semantics
and structure in text, and the need to support things like full text search,
it is often necessary that pointers have character granularity. And indexes,
which need to be persistent, are just collections of pointers. And text
databases larger than 2^32 are starting to be come rather un-rare.
Of course, there are ways to work around this, just as there are ways to
more-or-less do real computation on 16 bit computers. But there are many
wins from having your persistent pointers effiently dealt with via
word-oriented computer operations. For example, a not-uncommon text database
operation might be: which of these 1.87 million citations (pointers starting
at 0x23fc8932 in the index) contain instances of both "Japan" (37,210
occurrences in the database) and "telephone" (41,881 occurrences). There
are algorithms that can do the pointer manipulations and give you answers
pretty quick, but they run a lot better if you can do a lot of the
comparisons and moves in hardware.
In the text indexing literature one usually speaks of addressing words
within documents, not characters within databases. 2^32 is large
enough to hold both the number of documents in todays large text
databases and the number of words (or characters) in the largest
individual documents. Perhaps you regard this as a workaround, but it
too allows optimizations. When intersecting document-major citation
lists one can skip all the occurences within a document if that
document does not occur in the other citation list.
Doug
tbray@watsol.waterloo.edu (Tim Bray) (10/28/90)
jpk@ingres suggested that >32-bit addressing for persistent objects might not make any sense, given that the number of *objects* one addresses is usually << 2^32. I replied that in text databases, character addressing is often required and that 2^32 characters is not hard to come by. cutting@parc.xerox.com (Doug Cutting) writes: >In the text indexing literature one usually speaks of addressing words >within documents, not characters within databases. 2^32 is large >enough to hold both the number of documents in todays large text >databases and the number of words (or characters) in the largest >individual documents. (This should really be in comp.text.databases or comp.database.text, which would be nice groups if they existed). Cutting is right if you're content to address *words* in pre-cooked *documents*. But 1. For many applications indexing words is counter-productive, indexing arbitrary-length byte strings much more useful. The techniques for doing this pretty well require database byte granularity. 2. Assuming that you know what the "documents" are at index creation time is just plain wrong, given the nature of text. Users should have the option of defining new "document" structures at arbitrary locations in the database at any time based on their own criteria. This is why relational technology is sometimes a poor match for text applications. 3. Most databases contain many types of documents, which will typically nest within each other and overlap. If a particular index point is contained with instances of document type "article", "section", "subsection", "paragraph", "page", and "cross reference", you're going to have trouble building an index structure that efficiently supports this mapping without byte granularity. So, there are important application classes where persistent pointers >32 bits are necessary. But I'm not objective - our company is based on selling software that deals with texts using a model constrained by #3 above. I'm really hoping there are lots of other applications types putting pressure on to leapfrog the 32-bit curve. Cheers, Tim Bray, Open Text Systems