jaw@eos.UUCP (James A. Woods) (08/30/89)
# "Information is any difference that makes a difference." -- G. Bateson
Here is tonic to aid with fast USENET subject search. A big assumption
is that you subscribe to 'nn' and its delightful article header database.
For a cheap keyword-based news selector, the operation
nn -mxX -sword all
is useful, except that on many systems (say a VAX 11/785 or Sun 3-class
server with a fortnight of news), it may take upwards of a minute or two
of wall clock time on a moderately-loaded machine. This can still be
faster than ransacking 80MB of /usr/spool/news with 'grep', but not
nearly enough for newsaholics.
The changes which follow reduce the time to a couple of seconds
or less. The catch involves keeping another file (about 1% of the news
system size) updated a few times a day, just like what 'fastfind' does.
First, define the shell command 'nngrab', where
nngrab word
is simple shorthand for the 'nn' one-liner above:
----------------------------------------------------------------------------
#!/bin/sh
# nngrab -- quick news retrieval by subject search
trap "rm -f /tmp/nngrab$$" 0 1 2 15
egrep -i "^....*$1" /usr/spool/news/.nn/subjects |\
sed 's/^\(...\).*/\1/' | uniq > /tmp/nngrab$$
v=`fgrep -f /tmp/nngrab$$ /usr/spool/news/.nn/map | sed 's/.* //'`
case $v in "") exit ;; esac
nn -Q -mxX -s$1 $v
----------------------------------------------------------------------------
'nngrab' returns silently if there is no relevant news, and fires up
normal 'nn' otherwise. It operates by mapping submatched subject lines
containing pre-stored three-digit hex code group IDs to real newsgroup
names (along with rare numeric false drops) for subsequent input
to 'nn.' [The hex output is a suboptimal, but simple space-saving code.]
Naturally, you're running fast e?grep (GNU-style) or this is
all for naught. One possible problem is that subject underspecification
might tickle a limit with 'fgrep' in degenerate cases on some systems
("wordlist too large"). Now some folks might consider this good [you
didn't really want to call up all news articles containing the letter
'e', did you?] But if you don't consider this a feature, then
either: (1) raze the limits in the Berkeley 'fgrep' source, or (2)
bug Andrew Hume at AT&T for his implementation of the Commentz-Walter
algorithm in Unix Edition Nine 'fgrep', or (3) write a (slothful)
two-line 'awk' program utilizing associative arrays to map newsgroups
completely, or (4) make C code for same.
Next, update the auxiliary 'map' and 'subjects' files via a
system-wide 'cron' or 'at' command, using a modified 'nn' (here called
from a script dubbed 'nnspew') to do the hard part:
----------------------------------------------------------------------------
2 6,9,12,15,18,21 * * * root /bin/nice /bin/sh /usr/lib/news/nnspew
----------------------------------------------------------------------------
is what's on one NASA Ames Research Center machine, along with:
----------------------------------------------------------------------------
# nnspew -- generate subject line database and newsgroup map
trap "rm /tmp/nnmap$$ /tmp/nnsubj$$; exit" 0 1 2 15
export TERM
TERM=vt100 # arbitrary
MAP=/usr/spool/news/.nn/map
SUBJECTS=/usr/spool/news/.nn/subjects
awk '{printf "+ 000000 %s\n", $1}' /usr/lib/news/active > /.nn/rc
/usr/lib/news/nnx -Q -mxX -sipsissima_verba all \
2>/tmp/nnmap$$ | sort -u > /tmp/nnsubj$$
case $? in 0) mv /tmp/nnmap$$ $MAP; mv /tmp/nnsubj$$ $SUBJECTS;;
esac
----------------------------------------------------------------------------
What we have here is a mutated form of 'nn' ('nnx') to spit out all
news article subject headers in all groups. (The Latinism is presumably
a non-occurring subject phrase whose sole purpose is to suppress 'nnx'
output other than from the change below.) Sorting with the "unique"
flag saves disk (~30-50%) by condensing followup verbiage. Another 15%
could be trimmed by more complicated logic to merge cross-posted subjects.
Another time/space tradeoff would allow a half-size subject file
by using 'compress'. [Only recommended for severely unbalanced
systems, e.g. an adrenal CPU like an Amdahl or high-freq Mips R2000
feeding a very slow filesystem or datapipe, say NFS<->optical disk.]
In 'nnspew', the awk line fixates a master newsgroup file for 'root',
the executor of 'nnx'. The lines containing "TERM" are a kludge, frankly,
there just to quash the interactive 'nn' milieu. This hack could go away
with more C code mods to 'nnx', I suppose.
----------------------------------------------------------------------------
Finally, there is 'nnx' itself, which is just a recompiled 'nn'
with three changes to file src/nn/group.c:
(1) to suppress the groupname indicator, comment out the line
printf("\r%s", cur->group_name); clrline();
(2) add two lines of declaration near the top of access-group()
static int gcount = -1;
static char gsave[100];
(more precisely, after the line containing "static char subptext[80];")
(3) immediately after the spot where the subject is read:
ah->subject = alloc_str((int)hdr.dh_subject_length);
if (fread(ah->subject, sizeof(char), (int)hdr.dh_subject_length, data)
!= hdr.dh_subject_length) goto data_error;
add:
if (strcmp(gsave, gh->group_name)) {
gcount++;
strcpy (gsave, gh->group_name);
fprintf (stderr, "%03x %s\n", gcount, gh->group_name);
}
printf ("%03x%*s\n", gcount, (int)hdr.dh_subject_length, ah->subject);
----------------------------------------------------------------------------
These changes really just feed a subject list to 'stdout' and a newsgroup
mapping to 'stderr'.
Wrapping up, all this pre-computation (a three-minute process, including
sorting, per 50 meg of news at one VAX MIP) is worth it if there are just
a few invocations of 'nngrab' per day. In fact, the setup time of 'nnspew'
is completely amortized in less than two uses of 'nngrab', so crontab entries
can be fairly liberal.
I have found the script invaluable as shorthand to call up already-
perused as well as unsubscribed topics. My only wish would be for standard
'nn' to add the "Keywords:" fields to its database. Ah yes, all in keeping
with the silk-purse solution to a sow's ear!
James A. Woods (ames!jaw)
NASA Ames Research Center