[comp.databases] Text retrieval... shareware?

paul@csnz.nz (Paul Gillingwater) (08/02/88)

In the following article I am seeking software for my own personal
use; this has nothing to do with my business interests.

There is a large body of textual material that I would like to
make available to a wider audience for study purposes.  This 
material has already been printed in books, of course, but it
seemed like a nice idea to get it onto a CD-ROM so that we can
do searches on it.  The books are printed mostly in English, but
also have copious portions in Cyrillic, Hebrew, Sanskrit, Greek,
Latin, French, German, etc.  There are some diagrams, charts and
tables, and also extensive footnotes.  There are also several
thousand pages, scattered across over 20 volumes - by one author!

It is my intention to get this material onto CD-ROM for access.
The first problem is of course the scanning - what scanner can
process lots of different fonts, in different languages and scripts?

Second problem is the internal representation.  PC ASCII can handle
the European languages mostly, but the only way of handling the
non-Roman alphabets is with a context switch, e.g. some sort of
escape sequence embedded in the text.  I guess I could get away
with a 31 character alphabet (ignoring case), and use a special
sequence to switch alphabets when printing or displaying on-screen
(of course using an EGA with customised characters in graphics mode!).

The third problem is searching the stuff.  It is of arbitrary length,
so text-retrieval methods are indicated.  There is no numeric data
to be processed, so an inverted index would do the trick - but it
would be huge! - like twice the size of the text.  What I would like
would be a way of searching on any word, anywhere in the text.
Context searches and adjacency searches would be nice, but I could
live with the overhead of doing a word-level search with further
elimination matches.  If I could afford it I would use BRS/Search,
since that offers many of these features, but it's expensive.  
I have heard that a special tree structure is used for compressing 
English dictionaries that are RAM resident, e.g the Borland Sprint 
package uses this I think.  Could this be applied to up to 100 Mb of 
data on a CD-ROM?  Would it be any faster, or unreliable in any way?  
Perhaps some scheme where 5 bits are used for building each twig on 
the tree (allowing 31 chars plus a null), so that the letter itself 
would help to point to the next twig?

Well, I would appreciate any assistance on this, or pointers to
texts that would elucidate on this rather specialised application.
I plan to write software to do this, for zero profit, because I
happen to like the material, and I think more people should 
benefit from it.  I'll use Zortech's C++ compiler, and target it
to run under DOS, because most CD-ROM players only have drivers
for DOS.  Oh yes - in case anyone is interested, all of the
material is written by one woman - Helena Petrovna Blavatsky.
-- 
Paul Gillingwater, Computer Sciences	Call this BBS - Magic Tower (24 hours)
paul@csnz.nz  (vuwcomp!dsiramd!csnz)	NZ +64 4 753 561 8N1 TowerNet software
P.O.Box 929, Wellington, NEW ZEALAND	V21/V23/V22/V22bis/Bell 103/Bell 212A
Vox: +64 4 846194, Fax: +64 4 843924	"All things must parse"-ancient proverb