[news.software.b] NNGRAB -- a netnews subject search speedup for 'nn'

jaw@eos.UUCP (James A. Woods) (08/30/89)

# "Information is any difference that makes a difference."  -- G. Bateson

     Here is tonic to aid with fast USENET subject search.  A big assumption
is that you subscribe to 'nn' and its delightful article header database.

     For a cheap keyword-based news selector, the operation

	nn -mxX -sword all

is useful, except that on many systems (say a VAX 11/785 or Sun 3-class
server with a fortnight of news), it may take upwards of a minute or two
of wall clock time on a moderately-loaded machine.  This can still be
faster than ransacking 80MB of /usr/spool/news with 'grep', but not
nearly enough for newsaholics. 

      The changes which follow reduce the time to a couple of seconds
or less.  The catch involves keeping another file (about 1% of the news
system size) updated a few times a day, just like what 'fastfind' does.

      First, define the shell command 'nngrab', where

	nngrab word

is simple shorthand for the 'nn' one-liner above:

----------------------------------------------------------------------------
#!/bin/sh
# nngrab -- quick news retrieval by subject search

trap "rm -f /tmp/nngrab$$" 0 1 2 15

egrep -i "^....*$1" /usr/spool/news/.nn/subjects |\
	sed 's/^\(...\).*/\1/' | uniq > /tmp/nngrab$$

v=`fgrep -f /tmp/nngrab$$ /usr/spool/news/.nn/map | sed 's/.* //'`
case $v in "") exit ;; esac

nn -Q -mxX -s$1 $v
----------------------------------------------------------------------------

'nngrab' returns silently if there is no relevant news, and fires up
normal 'nn' otherwise.  It operates by mapping submatched subject lines
containing pre-stored three-digit hex code group IDs to real newsgroup
names (along with rare numeric false drops) for subsequent input
to 'nn.' [The hex output is a suboptimal, but simple space-saving code.]

     Naturally, you're running fast e?grep (GNU-style) or this is
all for naught.  One possible problem is that subject underspecification
might tickle a limit with 'fgrep' in degenerate cases on some systems
("wordlist too large").  Now some folks might consider this good [you
didn't really want to call up all news articles containing the letter 
'e', did you?]   But if you don't consider this a feature, then
either: (1) raze the limits in the Berkeley 'fgrep' source, or (2)
bug Andrew Hume at AT&T for his implementation of the Commentz-Walter
algorithm in Unix Edition Nine 'fgrep', or (3) write a (slothful)
two-line 'awk' program utilizing associative arrays to map newsgroups 
completely, or (4) make C code for same.

     Next, update the auxiliary 'map' and 'subjects' files via a
system-wide 'cron' or 'at' command, using a modified 'nn' (here called
from a script dubbed 'nnspew') to do the hard part:

----------------------------------------------------------------------------
2 6,9,12,15,18,21 * * *  root /bin/nice /bin/sh /usr/lib/news/nnspew
----------------------------------------------------------------------------

is what's on one NASA Ames Research Center machine, along with:

----------------------------------------------------------------------------
# nnspew -- generate subject line database and newsgroup map

trap "rm /tmp/nnmap$$ /tmp/nnsubj$$; exit" 0 1 2 15
export TERM
TERM=vt100	# arbitrary

MAP=/usr/spool/news/.nn/map
SUBJECTS=/usr/spool/news/.nn/subjects

awk '{printf "+ 000000 %s\n", $1}' /usr/lib/news/active > /.nn/rc

/usr/lib/news/nnx -Q -mxX -sipsissima_verba all \
   2>/tmp/nnmap$$ | sort -u > /tmp/nnsubj$$

case $? in 0) mv /tmp/nnmap$$ $MAP; mv /tmp/nnsubj$$ $SUBJECTS;;
esac
----------------------------------------------------------------------------

What we have here is a mutated form of 'nn' ('nnx') to spit out all
news article subject headers in all groups.  (The Latinism is presumably
a non-occurring subject phrase whose sole purpose is to suppress 'nnx'
output other than from the change below.)  Sorting with the "unique"
flag saves disk (~30-50%) by condensing followup verbiage.  Another 15%
could be trimmed by more complicated logic to merge cross-posted subjects.

     Another time/space tradeoff would allow a half-size subject file
by using 'compress'.  [Only recommended for severely unbalanced
systems, e.g. an adrenal CPU like an Amdahl or high-freq Mips R2000
feeding a very slow filesystem or datapipe, say NFS<->optical disk.]

     In 'nnspew', the awk line fixates a master newsgroup file for 'root',
the executor of 'nnx'.  The lines containing "TERM" are a kludge, frankly,
there just to quash the interactive 'nn' milieu.  This hack could go away
with more C code mods to 'nnx', I suppose. 

----------------------------------------------------------------------------
     Finally, there is 'nnx' itself, which is just a recompiled 'nn'
with three changes to file src/nn/group.c:

(1) to suppress the groupname indicator, comment out the line

      printf("\r%s", cur->group_name); clrline();

(2) add two lines of declaration near the top of access-group()

      static int gcount = -1;
      static char gsave[100];

(more precisely, after the line containing "static char	subptext[80];")
      
(3) immediately after the spot where the subject is read:

      ah->subject = alloc_str((int)hdr.dh_subject_length);
      if (fread(ah->subject, sizeof(char), (int)hdr.dh_subject_length, data)
  	  !=  hdr.dh_subject_length) goto data_error;
add:
      if (strcmp(gsave, gh->group_name)) {
	  gcount++;
	  strcpy (gsave, gh->group_name);
	  fprintf (stderr, "%03x %s\n", gcount, gh->group_name);
      }
      printf ("%03x%*s\n", gcount, (int)hdr.dh_subject_length, ah->subject);
----------------------------------------------------------------------------

These changes really just feed a subject list to 'stdout' and a newsgroup
mapping to 'stderr'.

    Wrapping up, all this pre-computation (a three-minute process, including
sorting, per 50 meg of news at one VAX MIP) is worth it if there are just
a few invocations of 'nngrab' per day.  In fact, the setup time of 'nnspew'
is completely amortized in less than two uses of 'nngrab', so crontab entries
can be fairly liberal.

     I have found the script invaluable as shorthand to call up already-
perused as well as unsubscribed topics.  My only wish would be for standard
'nn' to add the "Keywords:" fields to its database.  Ah yes, all in keeping
with the silk-purse solution to a sow's ear!


James A. Woods (ames!jaw)
NASA Ames Research Center