[net.database] full text database systems

jay@isis.UUCP (Jay) (03/04/86)

I was new to the net when I last posted this, and it could be that
I did it incorrectly, since I didn't get any answers.  On the other
hand, maybe nobody had anything to say.

Be that as it may, I'm trying again just to make sure.  Sorry in
advance if it's a bother to anybody.

	I am interested in the methods used to create full text retrieval
databases - e.g. select articles based on words in an article.  Specifically,
is there some general place I can go to get info on design/implementation of
such a system?  Detailed questions such as article storage vs. concordance
storage vs. query processing vs. on and on.

    /* elaboration on original posting */
	For example.  Westlaw (TM?), the court case on-line searching
system, allows a query to be built interactively, and a (set of) database(s)
selected (being the state in question, or the federal court system), and
the system goes off and finds all cases which contain the words in the query.
Now the system must be maintaining some kind of a concordance of words,
with associated word placement information to handle the connectives that
the query language allows, and probably uses relational logic to handle
the query processing.  What I am interested in is more complete discussion
of the organization and processing theories of such systems, and whether
basic database functions/systems/?? are available commercially (or in the
PD, although doubtful), or whether the ones that are out there are built
from scratch.

	Any help out there?

Jay Batson
{hao, nbires} isis!jay

johnl@ima.UUCP (John R. Levine) (03/06/86)

In article <362@isis.UUCP> jay@isis.UUCP (Jay) writes:
>	I am interested in the methods used to create full text retrieval
>databases - e.g. select articles based on words in an article.  Specifically,
>is there some general place I can go to get info on design/implementation of
>such a system?  Detailed questions such as article storage vs. concordance
>storage vs. query processing vs. on and on.

It is my impression that every full text data base yet built is a special hand 
crafted job.  The most popular one appears to be LEXIS/NEXIS, which keeps on 
line the full text of legal decisions, newspapers, encyclopedias, and such.  
They keep complete indices of all of the words in every document, leaving out 
only words like "the" which are too common to have much indexing use.  The 
documents are organized into libraries, e.g.  Vermont superior court decisions 
for 1973, but they seem to swoop through the indices to do anything.  A Lexis 
search usually takes the better part of a minute (although they're clever 
about sending stuff to your screen to keep you distracted in the meantime.) 
This only works because updates to the data base are applied very infrequently 
relative to the number of searches, so they add new text and remake the 
indices in the middle of the night.  

There have also been some attempts at making hardware engines that stream data 
from a disk as fast as the disk can provide it with full-track reads, and scan 
the text as it goes by.  None of them seem to have been very successful, 
probably because reading a whole disk, even at full speed, takes a long time 
if the disk is at all large.  The Britton-Lee IDM has a similar device which 
is used to speed up relational queries; it seems to work well but only because 
it is embedded in a data base system which structures and organizes data so 
that the speed-up board is not looking at whole disks.  

There are also systems that are hybrids between the Lexis approach and a 
conventional data base.  One I've seen from BRS divides each document into 
sections such as sender, recipient, and separate paragraphs.  This works well 
if your documents are fairly stylized, as business correspondence usually is, 
and lets you ask for "documents from Smith, to Jones, dated in 1978, 
containing a reference to 'grapefruit.'" 

I'd love to hear about more technically interesting text databases.  Note that 
technologies like CD-ROMs in a sense only make the problem worse, since they 
allow very large amounts of data with relatively slow access to any part of 
it.  There has to be some good way to organize it, and the problem will soon 
be upon us.  
-- 

John Levine, Javelin Software, Cambridge MA 617-494-1400
{ decvax | harvard | think | ihnp4 | cbosgd }!ima!johnl, Levine@YALE.ARPA

The opinions above are solely those of a 12 year old hacker who has broken
into my account, and not those of my employer or any other organization.

boughter@milano.UUCP (03/06/86)

I'm not sure if this is the info you're interested in but here goes.
I have benn looking into databases that process ordinary numeric data
as well as text and have found that there are commercially available
pakcages that do indexing on large text files and/or build concordances.
This list is not necessarily complete or accurate.

INFO-TEXT                    DRS                      STATUS
Henco, Inc.                  Advanced Data Management CP International
100 Fifth Ave.               15 Main St.              210 South Street
Waltham, MA 02154            Kingston, N.J. 08528     New York, NY 10002
617-890-8670                 609-799-4600             212-815-8691


BASIS                                 BRS/SEARCH
Battelle Software Prods               BRS Information Technologies
505 King Ave.                         1200 Route 7
Columbus, OH 43201                    Latham, NY  12110
800-328-2648                          800-235-1209

Disclaimer:No warranties, express or implied, accompany the above data;
I have no interest whatever in any of the above companies.

dts@cullvax.UUCP (Daniel T Senie) (03/17/86)

Without going in to hardware specific systems as another respondent did
(CD-ROM info), there are a few products on the market.

Such commercial systems as BRS handle the type of retrieval you discuss.
The system is expensive to use. If the text you are searching has already
been entered by someone on their system, though, it's cheaper than
putting it all in yourself. BRS, I believe, will sell you their software
to run on your machine. This is attractive if you have REALLY HIGH VOLUME.

Another product, ZyIndex, does similar work on PCs and (I think) maybe
Unix systems (at least there was talk of such). The cost on this product
is LOW! It can work with data stored on most disk media, and would probably
be well suited for WORM (Write Once Read Mostly) laserdisks.

Good Luck.

My views do not reflect those of Cullinet ...

-- 
Daniel T. Senie			TEL.: (617) 329-7700 x3168
Cullinet Software, Inc.		UUCP: seismo!{ll-xn,harvard}!rclex!cullvax!dts
400 Blue Hill Drive		ARPA: rclex!cullvax!dts@ll-xn.ARPA
Westwood, MA 02090-2198