jay@isis.UUCP (Jay) (03/04/86)
I was new to the net when I last posted this, and it could be that I did it incorrectly, since I didn't get any answers. On the other hand, maybe nobody had anything to say. Be that as it may, I'm trying again just to make sure. Sorry in advance if it's a bother to anybody. I am interested in the methods used to create full text retrieval databases - e.g. select articles based on words in an article. Specifically, is there some general place I can go to get info on design/implementation of such a system? Detailed questions such as article storage vs. concordance storage vs. query processing vs. on and on. /* elaboration on original posting */ For example. Westlaw (TM?), the court case on-line searching system, allows a query to be built interactively, and a (set of) database(s) selected (being the state in question, or the federal court system), and the system goes off and finds all cases which contain the words in the query. Now the system must be maintaining some kind of a concordance of words, with associated word placement information to handle the connectives that the query language allows, and probably uses relational logic to handle the query processing. What I am interested in is more complete discussion of the organization and processing theories of such systems, and whether basic database functions/systems/?? are available commercially (or in the PD, although doubtful), or whether the ones that are out there are built from scratch. Any help out there? Jay Batson {hao, nbires} isis!jay
johnl@ima.UUCP (John R. Levine) (03/06/86)
In article <362@isis.UUCP> jay@isis.UUCP (Jay) writes: > I am interested in the methods used to create full text retrieval >databases - e.g. select articles based on words in an article. Specifically, >is there some general place I can go to get info on design/implementation of >such a system? Detailed questions such as article storage vs. concordance >storage vs. query processing vs. on and on. It is my impression that every full text data base yet built is a special hand crafted job. The most popular one appears to be LEXIS/NEXIS, which keeps on line the full text of legal decisions, newspapers, encyclopedias, and such. They keep complete indices of all of the words in every document, leaving out only words like "the" which are too common to have much indexing use. The documents are organized into libraries, e.g. Vermont superior court decisions for 1973, but they seem to swoop through the indices to do anything. A Lexis search usually takes the better part of a minute (although they're clever about sending stuff to your screen to keep you distracted in the meantime.) This only works because updates to the data base are applied very infrequently relative to the number of searches, so they add new text and remake the indices in the middle of the night. There have also been some attempts at making hardware engines that stream data from a disk as fast as the disk can provide it with full-track reads, and scan the text as it goes by. None of them seem to have been very successful, probably because reading a whole disk, even at full speed, takes a long time if the disk is at all large. The Britton-Lee IDM has a similar device which is used to speed up relational queries; it seems to work well but only because it is embedded in a data base system which structures and organizes data so that the speed-up board is not looking at whole disks. There are also systems that are hybrids between the Lexis approach and a conventional data base. One I've seen from BRS divides each document into sections such as sender, recipient, and separate paragraphs. This works well if your documents are fairly stylized, as business correspondence usually is, and lets you ask for "documents from Smith, to Jones, dated in 1978, containing a reference to 'grapefruit.'" I'd love to hear about more technically interesting text databases. Note that technologies like CD-ROMs in a sense only make the problem worse, since they allow very large amounts of data with relatively slow access to any part of it. There has to be some good way to organize it, and the problem will soon be upon us. -- John Levine, Javelin Software, Cambridge MA 617-494-1400 { decvax | harvard | think | ihnp4 | cbosgd }!ima!johnl, Levine@YALE.ARPA The opinions above are solely those of a 12 year old hacker who has broken into my account, and not those of my employer or any other organization.
boughter@milano.UUCP (03/06/86)
I'm not sure if this is the info you're interested in but here goes. I have benn looking into databases that process ordinary numeric data as well as text and have found that there are commercially available pakcages that do indexing on large text files and/or build concordances. This list is not necessarily complete or accurate. INFO-TEXT DRS STATUS Henco, Inc. Advanced Data Management CP International 100 Fifth Ave. 15 Main St. 210 South Street Waltham, MA 02154 Kingston, N.J. 08528 New York, NY 10002 617-890-8670 609-799-4600 212-815-8691 BASIS BRS/SEARCH Battelle Software Prods BRS Information Technologies 505 King Ave. 1200 Route 7 Columbus, OH 43201 Latham, NY 12110 800-328-2648 800-235-1209 Disclaimer:No warranties, express or implied, accompany the above data; I have no interest whatever in any of the above companies.
dts@cullvax.UUCP (Daniel T Senie) (03/17/86)
Without going in to hardware specific systems as another respondent did (CD-ROM info), there are a few products on the market. Such commercial systems as BRS handle the type of retrieval you discuss. The system is expensive to use. If the text you are searching has already been entered by someone on their system, though, it's cheaper than putting it all in yourself. BRS, I believe, will sell you their software to run on your machine. This is attractive if you have REALLY HIGH VOLUME. Another product, ZyIndex, does similar work on PCs and (I think) maybe Unix systems (at least there was talk of such). The cost on this product is LOW! It can work with data stored on most disk media, and would probably be well suited for WORM (Write Once Read Mostly) laserdisks. Good Luck. My views do not reflect those of Cullinet ... -- Daniel T. Senie TEL.: (617) 329-7700 x3168 Cullinet Software, Inc. UUCP: seismo!{ll-xn,harvard}!rclex!cullvax!dts 400 Blue Hill Drive ARPA: rclex!cullvax!dts@ll-xn.ARPA Westwood, MA 02090-2198