pkr@media01.UUCP (Peter Kriens) (09/27/90)
We are currently looking for a very good text archival system for newspaper based on Unix. The archive should be able to handle free text and retrieval should be possible from any word in the text. Ofcourse "huge" databases ( of 4 to 5 gigabyte) should be possible. Are there any products (commercial and public domain) out there, or is there any experience with products like this? Please reply to me, I will post a summary on the net afterwards. Peter Kriens Tel. (31)-23-319075 Postbox 4932 Fax. (31)-23-315210 2003 EX Haarlem Holland pkr@media01.uucp
rotimi@accur8.UUCP (Rotimi Gbadamosi) (09/30/90)
In article <1411@media01.UUCP> pkr@media01.UUCP (Peter Kriens) writes: >We are currently looking for a very good text archival system >for newspaper based on Unix. The archive should be able to handle >free text and retrieval should be possible from any word in the >text. Ofcourse "huge" databases ( of 4 to 5 gigabyte) should be >possible. > I will be interested too, including any information on "huge" databases in the MS/PC-DOS environment. We are interested in both success and horror stories. Thanks. rotimi 201-754-7714 rotimi@accurate.com
lee@sq.sq.com (Liam R. E. Quin) (10/04/90)
>In article <1411@media01.UUCP> pkr@media01.UUCP (Peter Kriens) writes: > We [need] a very good text archival system for newspaper based on Unix. > [...] "huge" databases ( of 4 to 5 gigabyte) should be possible. In article <295@accur8.UUCP> rotimi@accur8.UUCP (Rotimi Gbadamosi) writes: > I will be interested too, including any information on "huge" databases in > the MS/PC-DOS environment. [...] I _strongly_ suggest that you not use MS/DOS for this sort of application. If you have gigabytes of data that you care about, buy a computer that will support what you need. MS/DOS is far from ideal for this sort of application. (I can mail some reasons if it helps...) There are several important factors, I think: * how much you want to spend (this is the biggie!) * how frequently the archive will be updated * who the main users will be When you have this much data, some important questions to ask of the packages you imvestigate might be * how are query results ranked? This is the largest difference between the packages at the top end, as far as I can tell, with two extremes being represented (as I see it) by PAT (no ranking at all) and TOPIC (extensive ranking based on clustering, and hierarchies of subject matter) For example, if I ask for all newspaper articles about Iraq, do I want to see them * in chronological order, starting with a Latin report of Roman activity in that area, and moving more recent * most recent first, going backwards * with ones that mention "Iraq" lots of times presented sooner than ones that only mention it once * with articles that contain many other words associated with "Iraq" before other articles? Of course, no-one wants to have to specify this explicitly each time. But clearly if there are (say) several hundred thousand (or even millions) of occurences of the word in the index, the order in which the results are presented is awfully important. * how big is the index with respect to the data -- if it's 3 times larger you'll need a lot more disk space! * can the index span multiple disks? Or do I have to buy a single disk large enough to hold fifteen Gigabytes of index to my 5GBytes of data? Where can I buy such a disk?!? (Note: there are limits on the maximum size of an individual file under Unix. This is usually either one or four gigabytes with current implementations. So a system like PAT that generate a single index file might well run into grief. Of course, PAT would be fine with less data, and looks wonderful with the OED!) OK, so you'll have to split the index. Does that mean that the user will have to use several browsing sessions/tools? * Speed will be an issue. Don't be fooled by the `Bible' demo -- the Bible is only five or six megabytes, so anything should be able to access it in well under a second even on a toy--er, even under MS/DOS. Human efficiency is _much_ more important than speed. If you get exactly the article you're looking for after a thirty second wait, that's much better than getting two hundred articles in alphabetical order after no wait at all, because you'll then spend the best part of an hour deciding which is the right one... I would certainly look at * Topic, primarily because of its ranking * Fulcrum's Ful/Text, which has an excellent user interface I might also consider * STATUS (from Harwell Computing Laboratroies at Rutherford in England), although the interface was designed on an IBM mainframe in the 1960s and stinks * Third Eye, if this isn't vapourware, whose signature-based system will generate a _much_ smaller index than the others. This can also handle a networked database, or so they say. * BRS Search, because it's one of the ``market leaders'', although my experience is that this is one of the packages with a 300% index... * There are also specialist newspaper systems, although I'm afraid that I don't know enough about them to comment further, sorry. * I might look at PAT too, although I think you'd need to wait for their next version that lets you add files without re-index everything. Then it might well be the fastest of all of these systems Lee -- Liam R. E. Quin, lee@sq.com, SoftQuad Inc., Toronto, +1 (416) 963-8337
tim@brspyr1.BRS.Com (Tim Northrup) (10/17/90)
lee@sq.sq.com (Liam R. E. Quin) writes: >>In article <1411@media01.UUCP> pkr@media01.UUCP (Peter Kriens) writes: >> We [need] a very good text archival system for newspaper based on Unix. >> [...] "huge" databases ( of 4 to 5 gigabyte) should be possible. ... Lot's of good things to think about when shopping for a full-text information retrieval package ... >I might also consider >* BRS Search, because it's one of the ``market leaders'', although my > experience is that this is one of the packages with a 300% index... ACKKK!!! This is certainly not our experience here, or with any of our current customers (that I know of, anyway). Our typical loaded/indexed database (with the 'C' based version of the product, which is what I am involved with) is 120-150% of the original input text. In most cases, the original text can be discarded after loading. This results in a 20-50% overhead usually, nowhere near the 300% mentioned. (As a quick example, we have Grolier's AAE loaded: input file ~65meg, indexed ~80meg). Of course, your milage may vary depending upon the data, but a 300% index is very, very, very, VERY RARE with BRS/Search. Now, if your including keeping the original text around and counting that into the index size, that's another matter (and not quite fair, IMHO). >Lee >-- >Liam R. E. Quin, lee@sq.com, SoftQuad Inc., Toronto, +1 (416) 963-8337 -- Tim Northrup +------------------------------------------+ +---------------------------------+ BRS Software Products, Inc. | UUCP: uunet!crdgw1!brspyr1!tim | 1200 Route 7, Latham NY 12110 | ARPA: tim@brspyr1.BRS.Com +------------------------------------------+
lee@sq.sq.com (Liam R. E. Quin) (10/23/90)
In an article on text retrieval, I mistakenly wrote: >> I might also consider >> * BRS Search, because it's one of the ``market leaders'', although my >> experience is that this is one of the packages with a 300% index... tim@brspyr1.BRS.Com (Tim Northrup) of BRS corrects me: > Our typical loaded/indexed database (with the 'C' based version of the > product, which is what I am involved with) is 120-150% of the original > input text. It seems that I had outdated or incorrect information. The experiments I saw led me to believe that there was a greater indexing overhead than this, but I apologise if I have given a wrong impression, as it seems. Lee -- Liam R. E. Quin, lee@sq.com, SoftQuad Inc., Toronto, +1 (416) 963-8337