[comp.lang.perl] selective dissemination of usenet information

wyle@lavi.uucp (Mitchell Wyle) (03/02/90)

Usenet News readers present you with *ALL* new articles in the groups you
are reading.  Kill files help filter, but you are still bombarded with
articles.

Information Retrieval (boolean search full text databases) present you
with all articles matching your regular expression, regardless of whether
you have seen them previously or not.  You actively search, instead of
passively read, junk, or kill.

SDI systems (Selective Dissemination of Information) are sort of like
boolean retrieval but they give you only new items since the last query,
and query for you automatically and periodically.  The system searches for
you.

I want to build a very simple e-mail based SDI system for usenet news.
Perl seems ideal for such an application.

If I create a directory of user-profiles with files in them like:

          |-.newsrc
          |-address 
|-person1-|-limit   
          |-query

and have the program maintain the individual .newsrc's, it should be
easier than using find based on date and a "touchfile."  query contains
an egrep regular expression, and limit contains the maximum number of
articles the user wants to receive.

Here is a rather weak first attempt in awk:

#!/bin/sh

# @(#)server    1.2   90/02/28 11:50:36

echo " " |

awk 'BEGIN {
  sp="/usr/spool/news/spool/"
  ac="/usr/spool/news/lib/active"
  pr="/usr/wyle/pasadina/3/profiles"

  while ( getline < ac  > 0) {  # load active file in assoc arrays
    active_high[$1] = $2 + 0    # highest active article
  }

  "ls " pr | getline t          # load all users profile dirs
  split(t,profs)                # with a hack (using split)
  close("ls " pr);
}
{
  for (ind in profs) {
    prof = profs[ind]
    ne = pr "/" prof "/newsrc"
    system("rm /tmp/re " ne)                    # remove old work file(s)
    f = pr "/" prof "/.newsrc"
    while ( getline < f > 0 ) {                 # go through the .newsrc
      high_nrc = substr($2,index($2,"-")+1,9)   # get list of read articles
      low_nrc  = substr($2,1,index($2,"-")-1) 
      ng=substr($1,1,(length($1)-1))            # get newsgroup name
      upper=active_high[ng]                     # get active files high val
      for ( i=high_nrc+1 ; i < upper; i++) {    # for all unread articles,
        gsub("\\.","/",ng)                      # convert . to /
        print sp ng "/" i | "xargs  egrep -l \"`cat query`\" >> /tmp/re"
        close "xargs  egrep -l \"`cat query`\" >> /tmp/re"
      }  # end for
      gsub("/",".",ng)
      print ng ": " low_nrc "-" upper >> ne
    }    # end while
    close(ne);
    system("mv " ne " " f)  # update current .newsrc

# We have a list of articles which have matched the regex in
# the file /tmp/re.   Loop through it and send stuff to subscribers

    getline ad < pr "/address"  # users address
    getline li < pr "/limit"    # message limit
    li += 0

    while ((getline < "/tmp/ne" > 0 ) && (mn <= li) ) {
      system("cat " $0 " | mail " ad)
    }
  }
}'


This system is pretty slow because it forks and execs an 
"xargs egrep ..."   Anyone want to give this a whirl
in perl?  You could use it to supplement your usenet habit ;-)
I'd be much obliged.

Another hack using rn:

-----------------------------------------------------------------
#!/bin/sh

export RNMACRO RNINIT WODIR

   TDIR=/home/antares1/wyle/research/wan/pasadina/3
  WODIR=$TDIR"/work/news"
  PRDIR=$TDIR"/profiles"
 RNINIT="-s -T -t -d$WoDir"
RNMACRO=/tmp/rnkill$$

trap 'rm -f $RNMACRO; exit' 1 2 3 15

/bin/rm $PRDIR/*/hits   # remove old hitlists
/bin/rm -rf $WODIR/*    # remove old KILL file macros

# create all KILL file directories, and empty KILL files:

cat $PRDIR/*/groups | 
sed -e 's;\.;/;g' -e "s;^;mkdir -p $WODIR/;" > $WODIR/x
. $WODIR/x
sed -e "s/mkdir -p/touch/" -e "s;$;/KILL;" < $WODIR/x > $WODIR/y
. $WODIR/y
/bin/rm -f $WODIR/x $WODIR/y


# create all KILL file entries:

cd $PRDIR
for p in * ; do
  touch $p/hits
  groups=`cat $p"/groups"`
  c=`cat $p/query`
  c=`echo $c | sed "s;^;/:\.\*;"`
  c=`echo $c | sed "s;$;\.\*/a:\!echo \%A \>\> $PRDIR/$p/hits;"`
  for group in $groups; do
    kf=$WODIR"/"`echo $group | sed 's;\.;/;g'`"/KILL"
    echo $c >> $kf
  done
done

# We have to hack the .newsrc here for testing:

export RNMACRO RNINIT WODIR
  WODIR=$TDIR"/work"
# RNINIT="-s -T -t -d$WODIR"
 RNINIT="-s -T -d$WODIR"
RNMACRO=/tmp/rnkill$$

mv $HOME/.newsrc $HOME/s.newsrc
cp $WODIR/.newsrc $HOME/.newsrc

# use rn to create hit lists:

echo "z %(%m=n?.qcy^M:n)^(z^)" > $RNMACRO
echo "z" | rn
rm $RNMACRO

# mv $HOME/s.newsrc $HOME/.newsrc


# Now we should have a for each subscriber a new "hit" list of 
# articles which match the regular expressions in the rn kill files

exit 0
-------------------------------------------------------------------

I'd be much obliged.

Does anything (other than newsclip) like this exist?

-Mitchell F. Wyle
Institut fuer Informationssysteme         wyle@inf.ethz.ch 
ETH Zentrum / 8092 Zurich, Switzerland    +41 1 254 7224
Kleptomaniac, n.:
	A rich thief.
		-- Ambrose Bierce, "The Devil's Dictionary"

emv@math.lsa.umich.edu (Edward Vielmetti) (03/03/90)

Mitchell Wylie, in <1190@gorath.cs.utexas.edu>, writes

  SDI systems (Selective Dissemination of Information) are sort of like
  boolean retrieval but they give you only new items since the last query,
  and query for you automatically and periodically.  The system searches for
  you.

  I want to build a very simple e-mail based SDI system for usenet news.
  Perl seems ideal for such an application.

I do this in 'gnus', the gnu emacs newsreader.

Here's a sample kill file:

~News/comp.lang.c.KILL
(gnus-kill "" "FTP" "u")
(gnus-kill "From" "Torek\\|Spencer\\|Gwyn\\|pardo" "u")
(gnus-kill "Subject" ".")
(gnus-expunge "X")

i.e. search everywhere for FTP, read stuff by these folks, and
ditch the rest.  For comp.sys.amiga.hardware:

(gnus-kill "" "FTP\\|SCSI" "u")
(gnus-kill "Subject" ".")
(gnus-expunge "X")

No doubt there's an equally good way with perl.