paul@csnz.nz (Paul Gillingwater) (08/02/88)
In the following article I am seeking software for my own personal use; this has nothing to do with my business interests. There is a large body of textual material that I would like to make available to a wider audience for study purposes. This material has already been printed in books, of course, but it seemed like a nice idea to get it onto a CD-ROM so that we can do searches on it. The books are printed mostly in English, but also have copious portions in Cyrillic, Hebrew, Sanskrit, Greek, Latin, French, German, etc. There are some diagrams, charts and tables, and also extensive footnotes. There are also several thousand pages, scattered across over 20 volumes - by one author! It is my intention to get this material onto CD-ROM for access. The first problem is of course the scanning - what scanner can process lots of different fonts, in different languages and scripts? Second problem is the internal representation. PC ASCII can handle the European languages mostly, but the only way of handling the non-Roman alphabets is with a context switch, e.g. some sort of escape sequence embedded in the text. I guess I could get away with a 31 character alphabet (ignoring case), and use a special sequence to switch alphabets when printing or displaying on-screen (of course using an EGA with customised characters in graphics mode!). The third problem is searching the stuff. It is of arbitrary length, so text-retrieval methods are indicated. There is no numeric data to be processed, so an inverted index would do the trick - but it would be huge! - like twice the size of the text. What I would like would be a way of searching on any word, anywhere in the text. Context searches and adjacency searches would be nice, but I could live with the overhead of doing a word-level search with further elimination matches. If I could afford it I would use BRS/Search, since that offers many of these features, but it's expensive. I have heard that a special tree structure is used for compressing English dictionaries that are RAM resident, e.g the Borland Sprint package uses this I think. Could this be applied to up to 100 Mb of data on a CD-ROM? Would it be any faster, or unreliable in any way? Perhaps some scheme where 5 bits are used for building each twig on the tree (allowing 31 chars plus a null), so that the letter itself would help to point to the next twig? Well, I would appreciate any assistance on this, or pointers to texts that would elucidate on this rather specialised application. I plan to write software to do this, for zero profit, because I happen to like the material, and I think more people should benefit from it. I'll use Zortech's C++ compiler, and target it to run under DOS, because most CD-ROM players only have drivers for DOS. Oh yes - in case anyone is interested, all of the material is written by one woman - Helena Petrovna Blavatsky. -- Paul Gillingwater, Computer Sciences Call this BBS - Magic Tower (24 hours) paul@csnz.nz (vuwcomp!dsiramd!csnz) NZ +64 4 753 561 8N1 TowerNet software P.O.Box 929, Wellington, NEW ZEALAND V21/V23/V22/V22bis/Bell 103/Bell 212A Vox: +64 4 846194, Fax: +64 4 843924 "All things must parse"-ancient proverb