[comp.sys.next] indexing using pword

pbiron@weber.ucsd.edu (Paul Biron) (11/19/90)

Here is a question for those who've gotten into the internals of using
ixBuild, etc for building Librarian indexes.

According to the man page for pword(1) (in the BUGS section:-(

	Pword considers a word to be any string of letters, possibly
	including a single apostrophe or hyphen. 

This is rather unfortunate, for many instances this is just not suitable.
A few cases which show this are:

1. It would be nice if mail addresses could be indexed as one "word"
   (whereas pword treats "pbiron@weber.ucsd.edu" as 4 "words")
2. Obj-C method names containing ':' should include the ':' in the
   name (consider the methods "new" and "new:foo:")
   [this is the case which got me thinking about this]

There are many others as well, as you can guess.

A while back I wrote a set of routines (in C) based on an article in
AI Expert ("Extracting Words from Natuarl Language Text", by Thomas
W. Mardon, April 1989, pp 30-35). An excerpt from the article will
illustrate the basic method:

	In English the basic concordable characters are A-Z, a-z and 0-9.
	Some characters, however, are sometimes concordable and some-
	times not.  These are termed "optionally conconcrdable," such as
	hyphens (-), periods (.), and slashes (/).  All other special
	characters are nonconcordable.

	With the understanding of concordable, optionally concordable, and
	nonconcordable characters, we can define a word as "a finite sequence
	of concordable and optionally concordable characters delimited by
	either a nonconcordable or an optionally concordable character
	adjacent to a nonconcordable or optionally concordable one."

This method works very well, especially since you can change which
characters fall under each class as the context demands.

I would like to replace pword with something which understands the
definition of a "word" given by Mardon (Mardon cites _Princples of Text
Processing_, F.N. Teskey, Elis Horwood Ltd., 1982; as his source for
the definition).  However, I don't want to have to duplicate all of the
other functionality of pword.

Now for the question.  Does anyone have any ideas how I might do this?

The only solution I've come up with so far is a *REAL* kludge.  It
involves using my routines to get the words, pass the output thru sed
replacing each optionally concordable character with some unique sequence
of characters which pword will include in a "word" and then passing the
output of pword back thru sed to convert those unique sequences back
into the original optionally concordable characters.

Is the source to pword available (I suppose pword uses the routines in
the -ltext library (see man text(3))?  I know NeXT will give you the source
for cc, make (and the other GNU executables) for the price of a
floptical.  But would pword be included in this? Is there a more appropriate
place to send this request than bug_next@next.com?  While I'm on the
subject, isn't there a NeXT programmer's mailing list?  Does anyone
know where I'd send to subscribe?

Any comments or suggestions are appreciated.  Also, if anyone wants the
soruce to the routines I wrote, let me know and I'll mail them to
you. [come to think of it, I ought to turn them into a class and put
them at Purdue]

Paul Biron      pbiron@ucsd.edu        (619) 534-5758
Central University Library, Mail Code C-075-R
Social Sciences DataBase Project
University of California, San Diego, La Jolla, Ca. 92093