information retrieval program wanted

emv@math.lsa.umich.edu (Edward Vielmetti) (07/26/90)

In article <1990Jul21.212945.15889@bbt.se> pgd@bbt.se (P.Garbha) writes:

   I am looking for some software for unix to make some kind of on-line
   litterature reference search system. I have a >30MB database, with the
   full contents of a number of books.  For this i want a system that
   sets up an index to every word in the books, and that let me search in
   this to locate references from the books.
   For example, if i put a search for "unix" and "pc", the program should
   come up with a list of all logical units (paragraphs, chapters, or
   something else) with these two words in (the same sentence, or near
   each other). From this list i can narrow in further, or look at the
   references, to finally get a printout.

the "Pat" software from Open Text Systems should be able to do what
you want.  First, you have to mark up the documents to delimit boundaries
like paragraphs, pages, or chapters; the information may be available
in the texts as you have them (in which case no explicit markup would
be needed) or you may need to construct it.  Pat likes to use a 
simplified form of SGML to tag things.  You then create an index to the
whole text (every word is easy), make a pass over it to find the
boundaries, and you're set.

it supports queries like the ones you describe.  the command line
interface is kind of clunky, there are some nice glossy X11 interfaces
like the one used to present the OED.  best of all, once you do the
index (which takes a longish time and a lot of disk) the lookups are
amazingly fast, i.e. short enough so that you can run them interactively
on a sparcstation 1 and not have time to play hack in another window.

You might contact Tim Bray <tbray@watsol.waterloo.edu> for nitty
gritty like pricing.  

--Ed

Edward Vielmetti, U of Michigan math dept <emv@math.lsa.umich.edu>
comp.archives moderator

>> telebit
  3: 76 matches

>> modem
  4: 300 matches

>> 3 near 4
  5: 39 matches

>> docs article including 5
  6: 5 matches

>> pr.docs.header
  1824104, ..From comp.archives Wed Feb 14 20:20:55 EST 1990
Path: jarvis.csri.toronto.edu!mailrus!uwm.edu!zaphod.mps.ohio-state.edu!math.lsa
.umich.edu!emv
From: jeh@simpact.com
Newsgroups: comp.archives
Subject: [comp.os.vms] Re: is UUCP available for VMS?  (yes!  so is NEWS!)
....

lee@sq.sq.com (Liam R. E. Quin) (07/29/90)

P.Garbha <pgd@bbt.se> asked:
> I am looking for some software for unix to make some kind of on-line
> literature reference search system.  [...] I want a system that
> sets up an index to every word in the books, and [lets] me search in
> this to locate references from the books.

This is a longer answer (perhaps) than you wanted...  any corrections and/or
updates would be welcome.  I'm enclosing details of several commercial and a
few public domain packages that might help.  Feel free to mail me for more
detail on any of this.


There are a *lot* of systems around to do this, ranging from free or very
cheap shareware for the PC through to half a million dollars' worth of
main-frame software.

If you want software running under Unix, you could look at this month's
(August 1990) Unix Review for a few pointers.

There are at least two free systems for Unix, one called texan and one
called lq-text (I wrote the latter).


Some thoughts on some of the commercial systems, and some addresses,
follow:

* PAT
  Uses Patricia Trees, described in Knuth[1].
  I was told (last Autumn) that there were problems adding to the databse --
  adding a new document meant rebuilding the entire tree, which meant
  looking at the entire input data.  They did say they were working on
  this, though.

  Fairly competitive price (I was quoted about UK#20,000 as I recall, for
  a binary license on a sequent symmetry),

* STATUS
  Very old package from a company based near Oxford (next door to Rutherford
  Appleton Laboratories).  Designed on a big, bluish computer, and still
  has the user interface from the 1960s.   Fairly flexible, and one of the
  cheaper packages.  Seemed a little slow -- certainly no faster than
  lq-text on the King James Bible.... :-(

* TOPIC
  Expensive.  Sophisticated.  Intelligent clustering of terms and document
  relevance ordering.
  This was the quarter of a million pounds package, though....

* grep
  Don't laugh.  For 30MBytes it's too slow on older computers.  But try it.
  You'll be surprised.  Try the GNU grep, too.

* STAIRS
  Another product that needs a blue computer.

* Third Eye
  (actually the name of the company)
  This uses signatures, I understand, so there is always a chance that a
  document will be retrieved, checked, and later discarded.
  The index is said to be very small, and the package works over networks.
  I haven't seen this.

* BRS/Search
  In the UK at any rate, the market leader.
  Index at least as large again as the data.  Query language a little
  cryptic for anything other than very simple searches.
  Price comparable to Oracle.

* Fulcrum Ful/Text
  One of the fastest.  Uses a byte-index, so that every byte in the data
  is indexed (at least, that's what they told me....).  Good front end,
  and has been on Unix for at least three years.  Widely used by OEMs,
  e.g. HP for their manual, and Dallas Theological Whotsit [sorry] for
  their CD ROM of several versions of the Bible.
  Can understand SGML.
  If you are looking at commercial products, 3rd Eye and Ful/Text are
  absolute, absolute musts.  Or so say I.

* Owl
  A Hypertext/retrieval package based on Guide, that understands SGML.

* Microlytics
  Have ported their MS/DOS package to Unix.  I don't know if it still has
  limits (which tended to be 32767 of these, and 16K of those...), but I
  doubt it.  The DOS program is excellent value and includes a thesaurus,
  but indexing 30M is virtually impossible on old DOS systems, so I have
  no idea how it would cope.

There are a great many DOS packages.  For the most part, if you have a
lot of data, don't bother.

There are also some MACINTOSH packages, and at least one of these has made
it to Unix, but my feeling was that the port was fighting the Unix file-
system rather than working with it, and I wasn't happy enough with it to
name it in the company of any of the above...  Probably fine on the mac.


* Free packages....

  Bib is for bibligraphic references, and essentially uses grep on the
  inverted index.  This is practical until you get to having megabytes,
  or hundreds of files.  The indexx is larger than the data.\

  There is a commercial version of Bib called BibIX that is advertised
  from time to time on comp.text.  It is not very expensive (they tell me).

  There is something called "id" or "idgrep" that was designed for
  identifiers and has been hacked for mail folders.  It uses signatures
  and then checks for bad drops, so it has very small index files.

  I don't know anything about Texan, except I think it might have come from
  Texas (seriously).  If you know more, *please* tell me!

  lq-text is my own inverted-index package.  It indexes every word that
  is at least 3 characters long and does not appear in an optional
  CommonWords file.  For 30M of data, the index will probably be between
  10 and 20 MBytes, depending on the size of your CommonWords file, etc...

  Note: Lq-text doesn't work with more than about 6 MBytes owing to a
	storage allocation bug, but I hope to have that fixed very soon now
	that (after several weeks trying off and on) it can be reproduced.
	If you have mailed me about it in the past, I'll send you an update
	as soon as the known serious bugs are fixed.  This seems to be an
	artifact of the port to SunOS... :-(
  
Finally, here some addresses from Unix Review.  No, I have no connection
with any of them [thanks for the beer, though, Eric :-) :-)];
The systems listed above are ones I have come across previously.


Star		Sun only
  Cuadra Associates
  11835 W. Olympic Blvd. Ste. 855,
  Los Angeles,
  California CA 90064
  USA
  +1 213 478-0066

Ful/Text	3B, Aviion, Ultrix, HP 9000, Sun 3 & 4, 386/ix, SCO Xenix,
		VMS, DOS...
  Fulcrum Technologies
  560 Rochester Street,
  Ottawa,
  Ontario K1S 5K2
  Canada
  +1 613

Knowledge Retrieval System
		SysV, X, 10.4 and up [sic], BSD 4.2 and up
		Dos & Mac
  Knowledge Set
  888 Villa St.,
  Ste. 500,
  Mountain View,
  California  CA 94041
  USA
  +1 415 968-9888

PAT		Sun $.X, Ultrix, Xenix, HP/Apollo, MIPS, Sequent
  Oppen Text Systems
  Waterloo Town Square,
  Stte. 622
  Waterloo,
  Ontario  N2J 1P2
  Canada
  +1 519 746-8288

Elixir		Sun 386i, Sun4, 3B2, HP 9000, Ultrix
  Third Eye Software
  535 Middlefield Rd.,
  Ste. 170,
  Menlo Park,
  California  CA 94025
  +1 415 321-0967

Metamorph	UNIX [sic]; DOS; MVS
  Thunderstone
  535 Superior Viaduct NW,
  Cleveland,
  Ohioo  OH 44113
  USA
  +1 216 771-7880

Topic		HP 9000, Ultrix, Sun, MIPS, Pyramid, SCO UNIX/XENIX
		DOS, MVS
  Verity
  1500 Plymouth,
  Mountain View,
  California  CA 94043
  +1 415 960-7600

I.Q.Text	OEM only
  +1 206 746-4424

Zyindex		SCO, Interactive [and DOS, I think]
  ZyLab
  3105-T N. Wilke Rd.,
  Arlington Heights,
  IL 60004
  USA
  +1 708 632-1100

Hope this helps.

Lee
-- 
Liam R. E. Quin,  lee@sq.com, {utai,utzoo}!sq!lee,  SoftQuad Inc., Toronto
``He left her a copy of his calculations [...]  Since she was a cystologist,
  she might have analysed the equations, but at the moment she was occupied
  with knitting a bootee.''  [John Boyd, Pollinators of Eden, 217]