[comp.databases] Text archive for newspapers needed

pkr@media01.UUCP (Peter Kriens) (09/27/90)

We are currently looking for a very good text archival system
for newspaper based on Unix. The archive should be able to handle
free text and retrieval should be possible from any word in the
text. Ofcourse "huge" databases ( of 4 to 5 gigabyte) should be
possible.

Are there any products (commercial and public domain) out there,
or is there any experience with products like this? Please
reply to me, I will post a summary on the net afterwards.

Peter Kriens			Tel. (31)-23-319075
Postbox 4932			Fax. (31)-23-315210
2003 EX Haarlem Holland		pkr@media01.uucp

rotimi@accur8.UUCP (Rotimi Gbadamosi) (09/30/90)

In article <1411@media01.UUCP> pkr@media01.UUCP (Peter Kriens) writes:
>We are currently looking for a very good text archival system
>for newspaper based on Unix. The archive should be able to handle
>free text and retrieval should be possible from any word in the
>text. Ofcourse "huge" databases ( of 4 to 5 gigabyte) should be
>possible.
>


I will be interested too, including any information on "huge" databases in
the MS/PC-DOS environment.  We are interested in both success and horror
stories. 

Thanks.


rotimi
201-754-7714
rotimi@accurate.com

lee@sq.sq.com (Liam R. E. Quin) (10/04/90)

>In article <1411@media01.UUCP> pkr@media01.UUCP (Peter Kriens) writes:
> We [need] a very good text archival system for newspaper based on Unix.
> [...] "huge" databases ( of 4 to 5 gigabyte) should be possible.

In article <295@accur8.UUCP> rotimi@accur8.UUCP (Rotimi Gbadamosi) writes:
> I will be interested too, including any information on "huge" databases in
> the MS/PC-DOS environment.  [...]

I _strongly_ suggest that you not use MS/DOS for this sort of application.
If you have gigabytes of data that you care about, buy a computer that
will support what you need.  MS/DOS is far from ideal for this sort of
application.  (I can mail some reasons if it helps...)

There are several important factors, I think:
* how much you want to spend (this is the biggie!)
* how frequently the archive will be updated
* who the main users will be

When you have this much data, some important questions to ask of the
packages you imvestigate might be
* how are query results ranked?
  This is the largest difference between the packages at the top end, as
  far as I can tell, with two extremes being represented (as I see it) by
  PAT (no ranking at all) and TOPIC (extensive ranking based on clustering,
  and hierarchies of subject matter)

  For example, if I ask for all newspaper articles about Iraq, do I want
  to see them
  * in chronological order, starting with a Latin report of Roman activity
    in that area, and moving more recent
  * most recent first, going backwards
  * with ones that mention "Iraq" lots of times presented sooner than ones
    that only mention it once
  * with articles that contain many other words associated with "Iraq"
    before other articles?
  Of course, no-one wants to have to specify this explicitly each time.
  But clearly if there are (say) several hundred thousand (or even millions)
  of occurences of the word in the index, the order in which the results
  are presented is awfully important.

* how big is the index with respect to the data -- if it's 3 times larger
  you'll need a lot more disk space!

* can the index span multiple disks?  Or do I have to buy a single disk
  large enough to hold fifteen Gigabytes of index to my 5GBytes of data?
  Where can I buy such a disk?!?  (Note: there are limits on the maximum
  size of an individual file under Unix.  This is usually either one or
  four gigabytes with current implementations.  So a system like PAT that
  generate a single index file might well run into grief.  Of course, PAT
  would be fine with less data, and looks wonderful with the OED!)

  OK, so you'll have to split the index.  Does that mean that the user will
  have to use several browsing sessions/tools?

* Speed will be an issue.  Don't be fooled by the `Bible' demo -- the Bible
  is only five or six megabytes, so anything should be able to access it
  in well under a second even on a toy--er, even under MS/DOS.
  Human efficiency is _much_ more important than speed.  If you get exactly
  the article you're looking for after a thirty second wait, that's much
  better than getting two hundred articles in alphabetical order after
  no wait at all, because you'll then spend the best part of an hour
  deciding which is the right one...

I would certainly look at
* Topic, primarily because of its ranking 
* Fulcrum's Ful/Text, which has an excellent user interface
I might also consider
* STATUS (from Harwell Computing Laboratroies at Rutherford in England),
  although the interface was designed on an IBM mainframe in the 1960s and
  stinks
* Third Eye, if this isn't vapourware, whose signature-based system will
  generate a _much_ smaller index than the others.  This can also handle
  a networked database, or so they say.
* BRS Search, because it's one of the ``market leaders'', although my
  experience is that this is one of the packages with a 300% index...

* There are also specialist newspaper systems, although I'm afraid that
  I don't know enough about them to comment further, sorry.

* I might look at PAT too, although I think you'd need to wait for their
  next version that lets you add files without re-index everything.  Then
  it might well be the fastest of all of these systems

Lee
-- 
Liam R. E. Quin,  lee@sq.com, SoftQuad Inc., Toronto, +1 (416) 963-8337

tim@brspyr1.BRS.Com (Tim Northrup) (10/17/90)

lee@sq.sq.com (Liam R. E. Quin) writes:

>>In article <1411@media01.UUCP> pkr@media01.UUCP (Peter Kriens) writes:
>> We [need] a very good text archival system for newspaper based on Unix.
>> [...] "huge" databases ( of 4 to 5 gigabyte) should be possible.

	... Lot's of good things to think about when shopping for
	    a full-text information retrieval package ...

>I might also consider
>* BRS Search, because it's one of the ``market leaders'', although my
>  experience is that this is one of the packages with a 300% index...

	ACKKK!!!
	
        This is certainly not our experience here, or with any of our
        current customers (that I know of, anyway).  Our typical
        loaded/indexed database (with the 'C' based version of the
        product, which is what I am involved with) is 120-150% of the
        original input text.   In most cases, the original text can be
        discarded after loading.   This results in a 20-50% overhead
        usually, nowhere near the 300% mentioned. (As a quick example,
	we have Grolier's AAE loaded: input file ~65meg, indexed ~80meg).
	Of course, your milage may vary depending upon the data, but a
	300% index is very, very, very, VERY RARE with BRS/Search.

	Now, if your including keeping the original text around and
	counting that into the index size, that's another matter (and
	not quite fair, IMHO).

>Lee
>-- 
>Liam R. E. Quin,  lee@sq.com, SoftQuad Inc., Toronto, +1 (416) 963-8337
-- 
Tim Northrup      		  +------------------------------------------+
+---------------------------------+  BRS Software Products, Inc.             |
UUCP: uunet!crdgw1!brspyr1!tim    |  1200 Route 7, Latham NY   12110         |
ARPA: tim@brspyr1.BRS.Com	  +------------------------------------------+

lee@sq.sq.com (Liam R. E. Quin) (10/23/90)

In an article on text retrieval, I mistakenly wrote:
>> I might also consider
>> * BRS Search, because it's one of the ``market leaders'', although my
>>  experience is that this is one of the packages with a 300% index...

tim@brspyr1.BRS.Com (Tim Northrup) of BRS corrects me:
> Our typical loaded/indexed database (with the 'C' based version of the
> product, which is what I am involved with) is 120-150% of the original
> input text.

It seems that I had outdated or incorrect information.  The experiments I
saw led me to believe that there was a greater indexing overhead than this,
but I apologise if I have given a wrong impression, as it seems.

Lee

-- 
Liam R. E. Quin,  lee@sq.com, SoftQuad Inc., Toronto, +1 (416) 963-8337