pbiron@weber.ucsd.edu (Paul Biron) (11/19/90)
Here is a question for those who've gotten into the internals of using ixBuild, etc for building Librarian indexes. According to the man page for pword(1) (in the BUGS section:-( Pword considers a word to be any string of letters, possibly including a single apostrophe or hyphen. This is rather unfortunate, for many instances this is just not suitable. A few cases which show this are: 1. It would be nice if mail addresses could be indexed as one "word" (whereas pword treats "pbiron@weber.ucsd.edu" as 4 "words") 2. Obj-C method names containing ':' should include the ':' in the name (consider the methods "new" and "new:foo:") [this is the case which got me thinking about this] There are many others as well, as you can guess. A while back I wrote a set of routines (in C) based on an article in AI Expert ("Extracting Words from Natuarl Language Text", by Thomas W. Mardon, April 1989, pp 30-35). An excerpt from the article will illustrate the basic method: In English the basic concordable characters are A-Z, a-z and 0-9. Some characters, however, are sometimes concordable and some- times not. These are termed "optionally conconcrdable," such as hyphens (-), periods (.), and slashes (/). All other special characters are nonconcordable. With the understanding of concordable, optionally concordable, and nonconcordable characters, we can define a word as "a finite sequence of concordable and optionally concordable characters delimited by either a nonconcordable or an optionally concordable character adjacent to a nonconcordable or optionally concordable one." This method works very well, especially since you can change which characters fall under each class as the context demands. I would like to replace pword with something which understands the definition of a "word" given by Mardon (Mardon cites _Princples of Text Processing_, F.N. Teskey, Elis Horwood Ltd., 1982; as his source for the definition). However, I don't want to have to duplicate all of the other functionality of pword. Now for the question. Does anyone have any ideas how I might do this? The only solution I've come up with so far is a *REAL* kludge. It involves using my routines to get the words, pass the output thru sed replacing each optionally concordable character with some unique sequence of characters which pword will include in a "word" and then passing the output of pword back thru sed to convert those unique sequences back into the original optionally concordable characters. Is the source to pword available (I suppose pword uses the routines in the -ltext library (see man text(3))? I know NeXT will give you the source for cc, make (and the other GNU executables) for the price of a floptical. But would pword be included in this? Is there a more appropriate place to send this request than bug_next@next.com? While I'm on the subject, isn't there a NeXT programmer's mailing list? Does anyone know where I'd send to subscribe? Any comments or suggestions are appreciated. Also, if anyone wants the soruce to the routines I wrote, let me know and I'll mail them to you. [come to think of it, I ought to turn them into a class and put them at Purdue] Paul Biron pbiron@ucsd.edu (619) 534-5758 Central University Library, Mail Code C-075-R Social Sciences DataBase Project University of California, San Diego, La Jolla, Ca. 92093