[comp.binaries.ibm.pc] v03i022: unnews, automatic news extract & uudecode

rusty@cadnetix.com (Rusty Carruth) (06/03/89)
Checksum: 3093563129  (Verify with "brik -cv")
Posting-number: Volume 03, Issue 022
Submitted-by: rusty@cadnetix.com (Rusty Carruth)
Archive-name: un_news/un_news.sha

[ This is a package of shell and awk scripts that will automatically
extract and uudecode Usenet postings.  Because it will be useful to
readers of this newsgroup, and because not all of them might read
comp.sources.misc, I decided it was appropriate for posting here.
Since these scripts are probably only usable on Berkeley UNIX systems,
they are posted as they arrived, in shar form.  They have not been
tested here.  -- R.D.  ]

[ From: ]

Rusty Carruth  UUCP:{uunet,boulder}!cadnetix!rusty  DOMAIN: rusty@cadnetix.com
Daisy/Cadnetix Corp. (303) 444-8075\  5775 Flatiron Pkwy. \ Boulder, Co 80301
Radio: N7IKQ    'home': P.O.B. 461 \  Lafayette, CO 80026

#! /bin/sh
# This is a shell archive.  Remove anything before this line, then unpack
# it by saving it into a file and typing "sh file".  To overwrite existing
# files, type "sh file -c".  You can also feed this as standard input via
# unshar, or by typing "sh <file", e.g..  If this archive is complete, you
# will see the following message at the end:
#		"End of shell archive."
# Contents:  README news.un-news-er .news.autodearc news.unnews.awk1
#   news.unnews.awk2
# Wrapped by rusty@rusty on Tue Apr  4 16:11:39 1989
PATH=/bin:/usr/bin:/usr/ucb ; export PATH
if test -f 'README' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'README'\"
else
echo shar: Extracting \"'README'\" \(4874 characters\)
sed "s/^X//" >'README' <<'END_OF_FILE'
April 4, 1989
X
X
X
Well, here it is, as promised.  This little set of shell scripts,
awk programs, and other junk is designed to make my life a bit
nicer.  I was regularly grabbing ALL articles in the source
and binary newsgroups and then trying to find time to un-news
them (un-news, n. - a term created by Rusty Carruth (at least,
I think I created it) to denote the process of examining saved
news articles for complete packages and applying the appropriate
programs for reconstruction of the original package).
X
XFiguring that there had to be a better way, I wrote an automatic
news-grabber which would copy all news articles from the
newsgroups I was interested in over to my machine for later
un-newsing.  That kept me from missing any news articles.
X
However, I had a problem - I had megabytes of articles which
had not yet been un-newsed.  My disk was filling up!  (Pretty 
bad for a 50 meg disk)  So, the un-news scripts were born.
X
The version I now have (and am now posting) will attempt to
process all uuencoded archives and all multiple-part shell archives.
Both single and multiple-part uuencoded archives are handled.
I have attempted to be sure that the articles are really 
uuencoded (or shar-ed, in the case of shell archives), and
currently have no procedure for automatically patching,
nor for automatically un-shar'ing the patch file.  Because
of the testing I have done, it is highly likely that some
valid articles will be missed.
X
When the scripts have finished running, they will email the 
results to whatever userid you specify.  (I have my scripts 
set up to run automatically every day, you may wish to not use
this feature)
X
One thing to be aware of is that these scripts automatically
run sh on archives which are recognized.  If you wish to be
very careful, you will not want un-checked programs running
automatically.  The first line of caution would be to run 
a 'safe shell' which disallows certain actions.  To be
really safe, remove the automatic un-shar portion of
the script and inspect all shar archives by hand.
X
You will need to move the files to their 'correct' place
after un-sharing them, see the file news.un-news-er for
more info.  If you want to run the script automatically, add
a line like the following to your /usr/lib/crontab (or
use at, or whatever):
X
X0 6 * * * su rusty < /usr/rusty/bin/news.un-news-er.csh >/dev/console
X
In my case, I run as rusty, and then run the 'news.un-news-er.csh'
script, which is simply:
X
csh ~rusty/bin/news.un-news-er
X
to make the script run under the cshell rather than the bourne shell.
X
These scripts are probably very BSD-dependant, sorry about that.
If you have SysV, you can remove a bunch of junk in the first
awk script (<blah>.awk1) which makes up for not having a 'match'
function in BSD awk.  (Look at the long IF string...)
X
I have probably forgotten to tell you something very important,
so I will apologize in advance:  I apologize for forgetting something
which made this not work on your system! :-)
X
Don't forget to move the files from where you un-shar them to
wherever you decide you want them to live.
X
And a final note about the directory structure this thing expects.
X
The assumption about directory structures I made is that you will run
the script from the top of a tree which matches the news directory
structure, at least as far as you travel down it (**INCLUDING**
having any directories contained at the level you descend to, e.g.
if you go to comp/binaries/ibm/pc, then you must have the 'd'
directory even if you don't save the comp/binaries/ibm/pc/d
articles... THIS IS VERY IMPORTANT!).
X
Another assumption I made about directory structure is that only
news articles would be 'visible'.  All files which you do not wish 
X'un-news' to look at had better be hidden (start with a '.').  
That is why '.TOTAKE' and '.COMPRESSED.STUFF' start with a '.',
to keep them from being seen by the 'grep *' command.
X
You will find the un-newsed articles in the '.TOTAKE' subdir of
the newsgroup from which the article originated, and the compressed
article ends up in the '.COMPRESSED.STUFF' subdir.  You may wish to 
link that one off somewhere (oops, BSD creeping in again!) so that
the space is not used on your machine.
X
I do not have 'zoo' running on my sun yet, so the source files
are not zoo-ed together.  My "final" version will make zooing
the last step of the un-shar process.  If you have zoo on your
Un*x box, you may wish to do this also.
X
This 'software' is 'unleashed', 1989 Carroll D. Carruth, Jr. 
X(thats my legal name).  Please use/abuse/trash/ignore these
scripts as you see fit.  I would appreciate it if you send
me bug fixes and enhancements (I think I will, anyway) at
uunet!cadnetix!rusty or rusty@cadnetix.com.
X
Please don't sell these things, they are not worth it!
X
XFor all practical purposes, this is version 2.0 of this thing.
X
X--rusty carruth 
END_OF_FILE
if test 4874 -ne `wc -c <'README'`; then
    echo shar: \"'README'\" unpacked with wrong size!
fi
# end of 'README'
fi
if test -f 'news.un-news-er' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'news.un-news-er'\"
else
echo shar: Extracting \"'news.un-news-er'\" \(7270 characters\)
sed "s/^X//" >'news.un-news-er' <<'END_OF_FILE'
X#! /bin/csh
X# this script is a test script to be run in a subdir in which you wish to
X# convert news articles into their uudecoded or unshar'd (as appropriate)
X# peices. output goes into a directory called '.TOTAKE', compressed 
X# copies of the original articles go into '.COMPRESSED.STUFF'.  A list
X# of all subject lines can be found in '.subjects', and a 
X# list of sorted subject lines is in '.subjects.sorted'.
X#
X# Variables used by this script to access other files or users,
X# and their default values
X#	HOMEDIR		~/News		Where you have those
X#					directories set up
X#	GROUPLIST	~/.news.autodearc  list of newsgroup dirs
X#						to scan
X#	SCRIPTDIR	~/bin		where to find the secondary
X#					files
X#	MAILFILE	~/news.autodearc.results
X#					what file to use to put the
X#					results of this script
X#	user		<from shell>	who to mail results to.
X#	AWKSCRIPT	news.unnews.awk1 name of awk script which
X#					finds subject lines of the
X#					form '<blah> part number/number'
X#					(and variations thereof) where
X#					both numbers are the same
X#	AWKSCRIPT2	news.unnews.awk2 same as above, but does not
X#					require numbers to be equal
X#
X#  Directory structure example for comp.binaries.ibm.pc is:
X#   News/comp/binaries/ibm/pc		<----ibm pc binary directory
X#		1234	1235	1236	<----articles
X#		.TOTAKE			<----where the uncrunched goes
X#		.COMPRESSED.STUFF	<----where the article gets
X#						sent after un-doing
X#		d			<----directory (empty, in 
X#						this case) which
X#						corresponds to the
X#						newsgroup c.b.i.p.d
X#						THIS IS NEEDED!
X#
X# Note that there is a directory 'd' which is required to be kept
X# even though you may not use it.  The script checks to be sure
X# that it is not trying to deal with directories on the news
X# server, and it uses the local environment to filter them out.
X# (Since we are not supposed to be munging around on the news
X# server!)
X#
X# For shell archives, .TOTAKE will hold directories which contain
X# the un-shar'd files for the archive.  The name of the directory
X# will be the name of the first archive volume.
X#
X# For the multiple-part decoder to work, the subject line must
X# be of the form "Subject: vxxiyy: <blah> part n/n" (and some 
X# variations on that, see the awk script :-).  Currently, the 
X# following  newsgroups seem to (mostly) follow that convention:
X#	
X#	comp/sources/games
X#	comp/sources/misc
X#	comp/sources/x
X#	comp/binaries/ibm/pc
X# (there may be others as well)
X
X#set echo ; set verbose
X
set HOMEDIR =   ~/News
set MAILFILE =  ~/news.autodearc.results
X
set GROUPLIST = ~/.news.autodearc
X
set SCRIPTDIR = ~/bin
set AWKSCRIPT =  news.unnews.awk1
set AWKSCRIPT2 = news.unnews.awk2
X
cd $HOMEDIR
set subdirlist = `egrep -v '(^#)' $GROUPLIST`
echo results of automatic de-newsing as of `date` >>$MAILFILE
foreach subdir ($subdirlist)
X   pushd $subdir
X   echo "-----Doing $subdir-------------" >> $MAILFILE
X   if( ! -d .COMPRESSED.STUFF ) mv .COMPRESSED.STUFF COMPRESSED.STUFF.WHATSIT
X   if( ! -e .COMPRESSED.STUFF ) mkdir .COMPRESSED.STUFF
X   if( ! -d .TOTAKE ) mv .TOTAKE TOTAKE.WHATSTHIS
X   if( ! -e .TOTAKE ) mkdir .TOTAKE
X   set nonomatch ; rm .subjects* ; unset nonomatch
X   egrep '(^Subject: )' * |  sort -f +2.0 > .subjects.sorted
X   awk -f $SCRIPTDIR/$AWKSCRIPT .subjects.sorted > .subjects.todo
X   set SUBLIST = `awk -F: '{print $1}' .subjects.todo`
X# the above awk script returns the list of files which have 'part x/x' in them
X   set SETLIST = ''
X   foreach SET ($SUBLIST)
echo $SET
X      set ZZZ = "(^$SET)"
X      set LINE = `egrep $ZZZ .subjects.todo`
X      set ARCNAME = `echo $LINE | awk -F: '{print $3}'`
X      @ NFILES   = `echo $LINE | awk '{print $(NF-1)}'`
X      @ ENDLINE = `echo $LINE | awk '{print $NF}'`
X      @ STARTLINE = $ENDLINE + 1 - $NFILES 
X      awk "NR<=$ENDLINE {print}" .subjects.sorted |awk "NR>=$STARTLINE {print}" > .garbage.list
X      set FILELIST = `awk -F: '{print $1}' .garbage.list`
X      set FIRSTFILE = `echo $FILELIST | awk '{print $1}'`
X# I found a problem when the number of articles recieved did not equal
X#	the number in the group, so the following stuff tries to
X#	check for this.  First, there must be the right number of files
echo $FILELIST
X      if (`echo $FILELIST | wc -w` != $NFILES ) then
X	 echo "Wrong number of files for $LINE">> $MAILFILE
X      else if( `awk -f $SCRIPTDIR/$AWKSCRIPT2 .garbage.list | wc -w` != $NFILES) then
X# note that I am not doing nearly as much as I could here.  A full test
X# would be to make sure that the numbers went 1,2,3,... to $NFILES.
X#  I'll do all that if it turns out I need to!
X	 echo "Strange error in file list: $LINE" >>$MAILFILE
X# see if this is a uuencoded mess, below returns true if it is
X      else if ( "` egrep '(^begin *[0-9]+)' $FIRSTFILE`" !='' ) then
X	 echo "uuencoded data: $LINE" >> $MAILFILE
X	 pushd .TOTAKE
X	 set UUFILENAME = `egrep '(^begin)' ../$FIRSTFILE |awk '{print $3}'`
X	 set UUFILENAME = `echo $UUFILENAME | awk -F. '{print $1}'`
X	 awk 'NR<=1 , /(^BEGIN)/ {print}' ../$FIRSTFILE >> $UUFILENAME.hdr
X	 if ( -e .trashit. ) rm .trashit. 
X	 foreach SHARC ($FILELIST)
X	    cat ../$SHARC |awk '/(^BEGIN)/,/(^END)/ {print}'|egrep -v '(^BEGIN)|(^END)' >>.trashit.
X	 end
X	 uudecode .trashit. && rm .trashit.
X	 if( -e .trashit.) echo "uudecode failed" >> $MAILFILE
X	 popd
X	 mv $FILELIST .COMPRESSED.STUFF 
X#	   it could be argued that compressing uuencoded stuff is useless.
X#	   It probably is a waste of time, but lets do it anywho.
X	 pushd .COMPRESSED.STUFF ; compress $FILELIST ; popd
X# see if it is a shar archive
X      else if ("`egrep '(^#! *\/bin)' $FIRSTFILE`" != '' ) then
X	 echo "shar archive  : $LINE" >> $MAILFILE
X	 mkdir .TOTAKE/$ARCNAME
X	 pushd .TOTAKE/$ARCNAME
X	 awk 'NR<=1 , /(^#! *\/bin)/ {print}' ../../$FIRSTFILE > $FIRSTFILE.hdr
X	 foreach SHARC ($FILELIST)
X	    echo "Doing file $SHARC" >> unshar.report
X	    cat ../../$SHARC |awk '/(^#! *\/bin)/,/(^exit)/ {print}' |sh >> unshar.report
X	 end
X	 popd
X	 mv $FILELIST .COMPRESSED.STUFF 
X	 pushd .COMPRESSED.STUFF ; compress $FILELIST ; popd
X      else
X	 echo "not recognized: $LINE" >> $MAILFILE
X      endif
X    end
X#set echo ; set verbose
X# now lets see if we can handle single files...
X    set FILES = `egrep -l '(^BEGIN-+)' *`
X    set FILES = `egrep -l '(^END-+)' $FILES`
X    set FILES = `egrep -l '(^begin *[0-9]+ )' $FILES`
X    set FILES = `egrep -l '(^end)' $FILES`
X    foreach file ($FILES)
X	if( "`awk '/(^BEGIN-)/,/(^END-)/{print}' $file|awk '/(^begin *[0-9]+)/,/(^end)/{if(NR == 2) print}'`" != '') then
X	    echo "uuencoded data: `egrep '(^Subject)' $file`" >> $MAILFILE
X	    pushd .TOTAKE
X	    set UUFILENAME = `egrep '(^begin *[0-9]+)' ../$file |awk '{print $3}'`
X	    if ( -e $UUFILENAME ) then
X		echo "WARNING -- $UUFILENAME already exists, not processed" >> $MAILFILE
X	    else
X	       set UUFILENAME = `echo $UUFILENAME | awk -F. '{print $1}'`
X	       awk 'NR<=1 , /(^BEGIN-)/ {print}' ../$file >> $UUFILENAME.hdr
X
X	       cat ../$file | awk '/(^BEGIN-)/,/(^END-)/{print}' |awk '/(^begin *[0-9]+)/,/(^end)/{print}' | uudecode 
X	    endif
X	    popd
X	    mv $file .COMPRESSED.STUFF
X	    pushd .COMPRESSED.STUFF ; compress $file ; popd
X	endif
X    end
X    echo "Still have `ls |wc -w` files left" >> $MAILFILE
X    popd
end
echo END of results >> $MAILFILE
echo Mailing to $USER
cat $MAILFILE | mail  $USER && rm $MAILFILE
END_OF_FILE
if test 7270 -ne `wc -c <'news.un-news-er'`; then
    echo shar: \"'news.un-news-er'\" unpacked with wrong size!
fi
chmod +x 'news.un-news-er'
# end of 'news.un-news-er'
fi
if test -f '.news.autodearc' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'.news.autodearc'\"
else
echo shar: Extracting \"'.news.autodearc'\" \(75 characters\)
sed "s/^X//" >'.news.autodearc' <<'END_OF_FILE'
comp/sources/x
X#comp/sources/games
X#comp/sources/unix
comp/binaries/ibm/pc
END_OF_FILE
if test 75 -ne `wc -c <'.news.autodearc'`; then
    echo shar: \"'.news.autodearc'\" unpacked with wrong size!
fi
# end of '.news.autodearc'
fi
if test -f 'news.unnews.awk1' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'news.unnews.awk1'\"
else
echo shar: Extracting \"'news.unnews.awk1'\" \(2628 characters\)
sed "s/^X//" >'news.unnews.awk1' <<'END_OF_FILE'
X# this awk script will return only those lines which have 'part' near the
X# end of the line and which also have 2 numbers following the 'part' 
X# (separated by either 'of' or '/') where those 2 numbers are equal.
X# It also adds to the end of the line  the number of files making up
X# the set and the line number this line appears in the file.
X/(part|Part|PART) *[0-9]+\/[0-9]+ *$/ {  
X	# match Part<num>/<num>$,  now get last number
X	#STRNG = "/[0-9]*"
X	# TEST1 = match( $NF, STRNG )
X	I = length ( $NF )
X	STRNG = $NF
X	while ( substr(STRNG,I,1) != "/" ) I--
X	LASTNUM = substr( $NF, I + 1 )
X	#TEST1 = match($NF, "[0-9]*")
X	I = 1
X	while ( substr(STRNG,I,1) != "/" ) I++
X	J = 0
X	found = 0
X	while (found == 0) {
X	   J = J + 1
X	   if ( substr(STRNG,J,1) == "0" ) found = 1
X	   else if ( substr(STRNG,J,1) == "1" ) found = 1
X	   else if ( substr(STRNG,J,1) == "2" ) found = 1
X	   else if ( substr(STRNG,J,1) == "3" ) found = 1
X	   else if ( substr(STRNG,J,1) == "4" ) found = 1
X	   else if ( substr(STRNG,J,1) == "5" ) found = 1
X	   else if ( substr(STRNG,J,1) == "6" ) found = 1
X	   else if ( substr(STRNG,J,1) == "7" ) found = 1
X	   else if ( substr(STRNG,J,1) == "8" ) found = 1
X	   else if ( substr(STRNG,J,1) == "9" ) found = 1
X	   endif endif endif endif endif endif endif endif endif endif
X	   }
X	FIRSTNUM =substr( $NF, J ,I - J )
X	if( FIRSTNUM == LASTNUM )
X	   print $0 " " LASTNUM " " NR
X	}
X
X/((part|Part|PART) *[0-9]+\/[0-9]+\) *$)/ {  
X	# match Part<num>/<num>$,  now get last number
X	#STRNG = "/[0-9]*"
X	# TEST1 = match( $NF, STRNG )
X	I = length ( $NF ) - 1
X	STRNG = $NF
X	while ( substr(STRNG,I,1) != "/" ) I--
X	LASTNUM = substr( $NF, I + 1, length($NF) - I - 1)
X	#TEST1 = match($NF, "[0-9]*")
X	I = 1
X	while ( substr(STRNG,I,1) != "/" ) I++
X	J = 0
X	found = 0
X	while (found == 0) {
X	   J = J + 1
X	   if ( substr(STRNG,J,1) == "0" ) found = 1
X	   else if ( substr(STRNG,J,1) == "1" ) found = 1
X	   else if ( substr(STRNG,J,1) == "2" ) found = 1
X	   else if ( substr(STRNG,J,1) == "3" ) found = 1
X	   else if ( substr(STRNG,J,1) == "4" ) found = 1
X	   else if ( substr(STRNG,J,1) == "5" ) found = 1
X	   else if ( substr(STRNG,J,1) == "6" ) found = 1
X	   else if ( substr(STRNG,J,1) == "7" ) found = 1
X	   else if ( substr(STRNG,J,1) == "8" ) found = 1
X	   else if ( substr(STRNG,J,1) == "9" ) found = 1
X	   endif endif endif endif endif endif endif endif endif endif
X	   }
X	FIRSTNUM =substr( $NF, J ,I - J )
X	if( FIRSTNUM == LASTNUM )
X	   print $0 " " LASTNUM " " NR
X	}
X
X/(part|Part|PART) +[0-9]+ +of +[0-9]+ *$/ {
X        # found Part<num> of <num>$, now get last number
X	if( $NF == $(NF-2) )
X	   print $0 " " $NF " " NR
X	}
X
END_OF_FILE
if test 2628 -ne `wc -c <'news.unnews.awk1'`; then
    echo shar: \"'news.unnews.awk1'\" unpacked with wrong size!
fi
# end of 'news.unnews.awk1'
fi
if test -f 'news.unnews.awk2' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'news.unnews.awk2'\"
else
echo shar: Extracting \"'news.unnews.awk2'\" \(1148 characters\)
sed "s/^X//" >'news.unnews.awk2' <<'END_OF_FILE'
X# this awk script looks at lines which have 'part' near the
X# end of the line and which also have 2 numbers following the 'part' 
X# (separated by either 'of' or '/'), and returns the first of those 2 numbers.
X/(part|Part|PART) *[0-9]+\/[0-9]+[ |/)] *$/ {  
X	# match Part<num>/<num>$
X	I = length ( $NF )
X	STRNG = $NF
X	I = 1
X	while ( substr(STRNG,I,1) != "/" ) I++
X	J = 0
X	found = 0
X	while (found == 0) {
X	   J = J + 1
X	   if ( substr(STRNG,J,1) == "0" ) found = 1
X	   else if ( substr(STRNG,J,1) == "1" ) found = 1
X	   else if ( substr(STRNG,J,1) == "2" ) found = 1
X	   else if ( substr(STRNG,J,1) == "3" ) found = 1
X	   else if ( substr(STRNG,J,1) == "4" ) found = 1
X	   else if ( substr(STRNG,J,1) == "5" ) found = 1
X	   else if ( substr(STRNG,J,1) == "6" ) found = 1
X	   else if ( substr(STRNG,J,1) == "7" ) found = 1
X	   else if ( substr(STRNG,J,1) == "8" ) found = 1
X	   else if ( substr(STRNG,J,1) == "9" ) found = 1
X	   endif endif endif endif endif endif endif endif endif endif
X	   }
X	FIRSTNUM =substr( $NF, J ,I - J )
X	print FIRSTNUM
X	}
X
X/(part|Part|PART) +[0-9]+ +of +[0-9]+ *$/ {
X        # found Part<num> of <num>$, 
X	print $(NF-2)
X	}
X
END_OF_FILE
if test 1148 -ne `wc -c <'news.unnews.awk2'`; then
    echo shar: \"'news.unnews.awk2'\" unpacked with wrong size!
fi
# end of 'news.unnews.awk2'
fi
echo shar: End of shell archive.
exit 0