emv@math.lsa.umich.edu (Edward Vielmetti) (07/26/90)
In article <1990Jul21.212945.15889@bbt.se> pgd@bbt.se (P.Garbha) writes: I am looking for some software for unix to make some kind of on-line litterature reference search system. I have a >30MB database, with the full contents of a number of books. For this i want a system that sets up an index to every word in the books, and that let me search in this to locate references from the books. For example, if i put a search for "unix" and "pc", the program should come up with a list of all logical units (paragraphs, chapters, or something else) with these two words in (the same sentence, or near each other). From this list i can narrow in further, or look at the references, to finally get a printout. the "Pat" software from Open Text Systems should be able to do what you want. First, you have to mark up the documents to delimit boundaries like paragraphs, pages, or chapters; the information may be available in the texts as you have them (in which case no explicit markup would be needed) or you may need to construct it. Pat likes to use a simplified form of SGML to tag things. You then create an index to the whole text (every word is easy), make a pass over it to find the boundaries, and you're set. it supports queries like the ones you describe. the command line interface is kind of clunky, there are some nice glossy X11 interfaces like the one used to present the OED. best of all, once you do the index (which takes a longish time and a lot of disk) the lookups are amazingly fast, i.e. short enough so that you can run them interactively on a sparcstation 1 and not have time to play hack in another window. You might contact Tim Bray <tbray@watsol.waterloo.edu> for nitty gritty like pricing. --Ed Edward Vielmetti, U of Michigan math dept <emv@math.lsa.umich.edu> comp.archives moderator >> telebit 3: 76 matches >> modem 4: 300 matches >> 3 near 4 5: 39 matches >> docs article including 5 6: 5 matches >> pr.docs.header 1824104, ..From comp.archives Wed Feb 14 20:20:55 EST 1990 Path: jarvis.csri.toronto.edu!mailrus!uwm.edu!zaphod.mps.ohio-state.edu!math.lsa .umich.edu!emv From: jeh@simpact.com Newsgroups: comp.archives Subject: [comp.os.vms] Re: is UUCP available for VMS? (yes! so is NEWS!) ....
lee@sq.sq.com (Liam R. E. Quin) (07/29/90)
P.Garbha <pgd@bbt.se> asked: > I am looking for some software for unix to make some kind of on-line > literature reference search system. [...] I want a system that > sets up an index to every word in the books, and [lets] me search in > this to locate references from the books. This is a longer answer (perhaps) than you wanted... any corrections and/or updates would be welcome. I'm enclosing details of several commercial and a few public domain packages that might help. Feel free to mail me for more detail on any of this. There are a *lot* of systems around to do this, ranging from free or very cheap shareware for the PC through to half a million dollars' worth of main-frame software. If you want software running under Unix, you could look at this month's (August 1990) Unix Review for a few pointers. There are at least two free systems for Unix, one called texan and one called lq-text (I wrote the latter). Some thoughts on some of the commercial systems, and some addresses, follow: * PAT Uses Patricia Trees, described in Knuth[1]. I was told (last Autumn) that there were problems adding to the databse -- adding a new document meant rebuilding the entire tree, which meant looking at the entire input data. They did say they were working on this, though. Fairly competitive price (I was quoted about UK#20,000 as I recall, for a binary license on a sequent symmetry), * STATUS Very old package from a company based near Oxford (next door to Rutherford Appleton Laboratories). Designed on a big, bluish computer, and still has the user interface from the 1960s. Fairly flexible, and one of the cheaper packages. Seemed a little slow -- certainly no faster than lq-text on the King James Bible.... :-( * TOPIC Expensive. Sophisticated. Intelligent clustering of terms and document relevance ordering. This was the quarter of a million pounds package, though.... * grep Don't laugh. For 30MBytes it's too slow on older computers. But try it. You'll be surprised. Try the GNU grep, too. * STAIRS Another product that needs a blue computer. * Third Eye (actually the name of the company) This uses signatures, I understand, so there is always a chance that a document will be retrieved, checked, and later discarded. The index is said to be very small, and the package works over networks. I haven't seen this. * BRS/Search In the UK at any rate, the market leader. Index at least as large again as the data. Query language a little cryptic for anything other than very simple searches. Price comparable to Oracle. * Fulcrum Ful/Text One of the fastest. Uses a byte-index, so that every byte in the data is indexed (at least, that's what they told me....). Good front end, and has been on Unix for at least three years. Widely used by OEMs, e.g. HP for their manual, and Dallas Theological Whotsit [sorry] for their CD ROM of several versions of the Bible. Can understand SGML. If you are looking at commercial products, 3rd Eye and Ful/Text are absolute, absolute musts. Or so say I. * Owl A Hypertext/retrieval package based on Guide, that understands SGML. * Microlytics Have ported their MS/DOS package to Unix. I don't know if it still has limits (which tended to be 32767 of these, and 16K of those...), but I doubt it. The DOS program is excellent value and includes a thesaurus, but indexing 30M is virtually impossible on old DOS systems, so I have no idea how it would cope. There are a great many DOS packages. For the most part, if you have a lot of data, don't bother. There are also some MACINTOSH packages, and at least one of these has made it to Unix, but my feeling was that the port was fighting the Unix file- system rather than working with it, and I wasn't happy enough with it to name it in the company of any of the above... Probably fine on the mac. * Free packages.... Bib is for bibligraphic references, and essentially uses grep on the inverted index. This is practical until you get to having megabytes, or hundreds of files. The indexx is larger than the data.\ There is a commercial version of Bib called BibIX that is advertised from time to time on comp.text. It is not very expensive (they tell me). There is something called "id" or "idgrep" that was designed for identifiers and has been hacked for mail folders. It uses signatures and then checks for bad drops, so it has very small index files. I don't know anything about Texan, except I think it might have come from Texas (seriously). If you know more, *please* tell me! lq-text is my own inverted-index package. It indexes every word that is at least 3 characters long and does not appear in an optional CommonWords file. For 30M of data, the index will probably be between 10 and 20 MBytes, depending on the size of your CommonWords file, etc... Note: Lq-text doesn't work with more than about 6 MBytes owing to a storage allocation bug, but I hope to have that fixed very soon now that (after several weeks trying off and on) it can be reproduced. If you have mailed me about it in the past, I'll send you an update as soon as the known serious bugs are fixed. This seems to be an artifact of the port to SunOS... :-( Finally, here some addresses from Unix Review. No, I have no connection with any of them [thanks for the beer, though, Eric :-) :-)]; The systems listed above are ones I have come across previously. Star Sun only Cuadra Associates 11835 W. Olympic Blvd. Ste. 855, Los Angeles, California CA 90064 USA +1 213 478-0066 Ful/Text 3B, Aviion, Ultrix, HP 9000, Sun 3 & 4, 386/ix, SCO Xenix, VMS, DOS... Fulcrum Technologies 560 Rochester Street, Ottawa, Ontario K1S 5K2 Canada +1 613 Knowledge Retrieval System SysV, X, 10.4 and up [sic], BSD 4.2 and up Dos & Mac Knowledge Set 888 Villa St., Ste. 500, Mountain View, California CA 94041 USA +1 415 968-9888 PAT Sun $.X, Ultrix, Xenix, HP/Apollo, MIPS, Sequent Oppen Text Systems Waterloo Town Square, Stte. 622 Waterloo, Ontario N2J 1P2 Canada +1 519 746-8288 Elixir Sun 386i, Sun4, 3B2, HP 9000, Ultrix Third Eye Software 535 Middlefield Rd., Ste. 170, Menlo Park, California CA 94025 +1 415 321-0967 Metamorph UNIX [sic]; DOS; MVS Thunderstone 535 Superior Viaduct NW, Cleveland, Ohioo OH 44113 USA +1 216 771-7880 Topic HP 9000, Ultrix, Sun, MIPS, Pyramid, SCO UNIX/XENIX DOS, MVS Verity 1500 Plymouth, Mountain View, California CA 94043 +1 415 960-7600 I.Q.Text OEM only +1 206 746-4424 Zyindex SCO, Interactive [and DOS, I think] ZyLab 3105-T N. Wilke Rd., Arlington Heights, IL 60004 USA +1 708 632-1100 Hope this helps. Lee -- Liam R. E. Quin, lee@sq.com, {utai,utzoo}!sq!lee, SoftQuad Inc., Toronto ``He left her a copy of his calculations [...] Since she was a cystologist, she might have analysed the equations, but at the moment she was occupied with knitting a bootee.'' [John Boyd, Pollinators of Eden, 217]