[comp.lang.perl] buffering and a news archiver

rae%alias@csri.toronto.edu (Reid Ellis) (03/27/90)

A problem and a program.

The problem -

    I have a program that archives all of the source groups for me at
    night [comp.sources.*] which uses an 'rc' file in the home
    directory.  The program is called 'narc' and the rc file is named
    '.narcrc' and everytime I run narc, which is a perl script, I get
    8k of NULs appended to the end of my '.narcrc' [which JUST SO
    HAPPENS to be == BUFSIZ]..

The program -

    To help people find my problem :) here is my news archival
    program, written in Perl.

#!/bin/sh
# This is a shell archive, meaning:
# 1. Remove everything above the #!/bin/sh line.
# 2. Save the resulting text in a file.
# 3. Execute the file with /bin/sh (not csh) to create the files:
#	narc.pl
# This archive created: Mon Mar 26 15:46:44 1990
# By:	Reid Ellis (Alias Research)
export PATH; PATH=/bin:$PATH
if test -f 'narc.pl'
then
	echo shar: over-writing existing file "'narc.pl'"
fi
cat << \SHAR_EOF > 'narc.pl'
#!/usr/local/bin/perl
#
# $Source: /repousr/u/rae/src/TEXT/perl/RCS/narc.pl,v $
# $Revision: 2.0 $
# $Author: rae $
# $Date: 90/03/25 15:43:00 $

# narc -- News ARChiver

# to avoid the buffering problems
#
$| = 0;

# Check for a NARCRC environment variable and use it if
# it's there.
#
if($ENV{'NARCRC'})
	{
	$narcrc = $ENV{'NARCRC'};

	# if NARCRC doesn't start with a '/', prepend
	# $HOME to it
	#
	if($narcrc !~ /\/.*/)
		{
		$narcrc = $ENV{"HOME"} . "/" . $narcrc;
		}
	}
else
	{
	# Default NARCRC: "$HOME/.narcrc"
	#
	$narcrc = $ENV{'HOME'} . "/.narcrc";
	}

# testing flag. Usually zero.
#
$verbose=0;
$standalone = "stand";

# get the name of the news server.  Can be replaced with
# a hardcoded '$server=machine;' if you like
#
$server_file = "/usr/local/lib/news/server";
$server = `cat $server_file` || die "Can't open $server_file";
chop $server;

# Where to find nntp
#
$nntp = "/usr/local/bin/nntp $server";

# open narcrc for input.  The format is:
#
#	/full/path/to/archive/directory
#	dirname   newsgroup   last-time-of-access
#	dirname   newsgroup   last-time-of-access
#	dirname   newsgroup   last-time-of-access
#	 ...
#
# 'dirname' is appended to the initial path.  The special
# case dirname of 'default' will use the last element of
# the newsgroup name.  For example, 
#	default  comp.sources.unix  19900101 000000
# will use the dirname 'unix'

open(NARCRC, "<$narcrc");
$SIG{'INT'} = 'handler';
$SIG{'QUIT'} = 'handler';

# The first line of the narcrc is where to store everythign
#
$repo = <NARCRC>;
chop($repo);
print "Saving files in $repo\n" if $verbose;

# $narc_line[$narc_index] is where each new line of the narcrc is
# stored .  This only gets written out after a successful 'narc'.
#
$narc_index = 1;

# for each line in the narcrc
#	chop it into separate bits
#	ask nntp for all new news for the group
#	for each new news item
#		read article into /tmp/narc$$
#		scan for magic string 'Archive-name'
#		get the subdir and part-name from this line
#	if hit eof, then no magic string was found, do next news item
#	create all leading directories if they're not there already
#	move /tmp/narc$$ to full pathname of archived article
while($narc_entry = <NARCRC>)
	{
	chop($narc_entry);

	# This is a software end-of-file because I am having trouble with
	# garbage showing up at the ends of file.  With shar files it's
	# not such a big deal, since they usually have 'exit 0' at the end
	# of them, but with the .narcrc, it might be accidentally parsed.
	#
	last if($narc_entry eq "EOF");

	($real_dirname, $newsgroup, $lastdate, $lasttime) = split(' ', $narc_entry);

	# check for degenerate empty case
	#	
	last if $newsgroup eq "" || $lastdate eq "" || $lasttime eq "";

	# if "default", set dirname to last element
	# of the newsgroup name
	#
	if($real_dirname eq "default" && $newsgroup =~ /.*\.([^.]*)$/)
		{
		$dirname = $1;
		}
	else
		{
		$dirname = $real_dirname;
		}

	# tell the user what's going on
	#
	"$lastdate $lasttime" =~ /^(..)(..)(..) (..)(..)(..)$/ &&
		printf("%s %s 19%s-%s-%s %s:%s\n",
			$newsgroup, "[$dirname]", $1, $2, $3, $4, $5);

	# open the stdout of the nntp command
	#
	open(MSGIDS, "$nntp newnews $newsgroup $lastdate $lasttime|")
		|| die "Can't get message ID's for $newsgroup from $server";

	# flag to see if we do anything
	$we_did_something = 0;

	# for each message ID..
	#
	while($msgid = <MSGIDS>)
		{
		# set the we_did_something flag so we know to update
		# the narcrc
		#
		$we_did_something = 1;

		# Zap the trailing newline on $msgid
		#
		chop($msgid);
		print "\t$msgid\n" if $verbose;

		# Get the article and store it in /tmp so we don't
		# need to use multiple [slow] nntp's for the same article
		#
		system("$nntp article '$msgid' > /tmp/narc$$");

		# Open the article and check for "Archive-name"
		#
 		open(BODY, "</tmp/narc$$") || die "Can't read retrieved article $msgid in file /tmp/narc$$";
		while(<BODY>)
			{
			if(/^Archive-name:/)
				{
				($foo, $arc_name) = split;
				@tree = split('/', $arc_name);
				# Put it in the '$standalone' dir if it's a single shar
				unshift(tree, $standalone)	if $#tree == 0;
				print "\t", join('/', @tree), "\n";
				last;
				} #if Archive-name
			} # while BODY

		# if eof then "Archive-name" wasn't found
		#
		last if eof;

		# close the file
		#
		close(<BODY>);

		# Now build the full pathname of where the file is
		# supposed to live
		$fullpath = $repo . '/' . $dirname . '/' . join('/', @tree);
		$name = pop(@tree);
		$relpath = $dirname . '/' . join('/', @tree);

		# make sure the dirs are there
		#
		&mkpath($repo . '/' . $dirname . '/' . join('/', @tree));

		if(-f $fullpath)
			{
			# A copy of the file already exists, so name this one the
			# same as the existing one, but with a number appended to
			# it to delineate its version.
			#
			for($i=2; -f $fullpath . $i; $i++)
				{ }

			# tack the number onto the end of the file name
			#
			$fullpath .= $i;
			}

		# and finally, put it there
		system("mv /tmp/narc$$ $fullpath");

		} # while MSGIDS
	close(<MSGIDS>);

	# only update the timestamp if we did anything
	#
	$timestamp = $we_did_something ? &now() : "$lastdate $lasttime";

	# Save the narcrc lines in memory in order so we can save them in
	# their original order.  Just using 'each' on %NARCRC currently
	# inverts their order [as of Perl 3.0]
	#
	$narc_line[$narc_index++] = sprintf("%s\t%s\t%s", $real_dirname, $newsgroup, $timestamp);
	} # while narc_entry
close(<NARCRC>);

# Save an updated narcrc
#
print "Saving $narcrc..";
open(NARCRC, ">$narcrc");
print NARCRC $repo, "\n";
for($i=1; $i < $narc_index; $i++)
	{
	print NARCRC $narc_line[$i], "\n";
	}
# software EOF
print NARCRC "EOF\n";
close(<NARCRC>);
print "\n";

# Subroutine to return the current date and time in
# nntp format, which is "YYMMDD HHMMSS"
#
sub now
	{
	($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime(time());
	$mon++;
	sprintf("%02d%02d%02d %02d%02d%02d", $year,$mon,$mday,$hour,$min,$sec);
	}

sub handler
	{
	local($sig) = @_;
	print "\n\nCaught signal $sig -- exiting $ARGV[0]\n";
	close(<ARTICLE>);
	close(<BODY>);
	close(<MSGIDS>);
	close(<NARCRC>);
	close(<SAVEFILE>);
	exit 0;
	}

sub mkpath
	{
	local($fullpath) = @_;
	local(@tree) = split('/', $fullpath);
	local(@tmp) = ();

	if($tree[0] eq "")
		{
		push(@tmp, shift(@tree));
		}
	foreach $dir (@tree)
		{
		push(@tmp, $dir);
		$path = join('/', @tmp);
		print "\t%% checking for '$path'\n" if $verbose;
		mkdir("$path", 0777) || die "Can't create directory $path"
			if ! -d "$path";
		}
	}	

# Emacs cutsomisation
#
# Local Variables:
# mode:fundamental
# tab-width:4
# End:
SHAR_EOF
chmod +x 'narc.pl'
#	End of shell archive
exit 0
Reid Ellis  264 Broadway Avenue, Toronto ON, M4P 1V9               Canada
rae@gpu.utcs.toronto.edu || rae%alias@csri.toronto.edu || +1 416 487 1383

piet@cs.ruu.nl (Piet van Oostrum) (03/27/90)

In article <814@alias.UUCP>, rae%alias (Reid Ellis) writes:
 `A problem and a program.
 `
 `The problem -
 `
 `    I have a program that archives all of the source groups for me at
 `    night [comp.sources.*] which uses an 'rc' file in the home
 `    directory.  The program is called 'narc' and the rc file is named
 `    '.narcrc' and everytime I run narc, which is a perl script, I get
 `    8k of NULs appended to the end of my '.narcrc' [which JUST SO
 `    HAPPENS to be == BUFSIZ]..
 `
 `		close(<BODY>);
 `
 `	close(<MSGIDS>);
 `
 `close(<NARCRC>);
 `
 `close(<NARCRC>);
 `	close(<ARTICLE>);
 `	close(<BODY>);
 `	close(<MSGIDS>);
 `	close(<NARCRC>);
 `	close(<SAVEFILE>);

All these closes are wrong.
You have to say close (FILEHANDLE), not close (<FILEHANDLE>).

What happens in the case of NARCRC, is that you read an extra line
<NARCRC>, which apparently reads a new buffer into memory. The buffer is
then written out when perl stops. The resulting string is used as an
argument for close. I would say that close should accept only filehandles
as parameters, not strings. I have no idea what close does with a string
parameters. Maybe it is an oversight in perl to accept this. This may well
depend on your particular stdio implementation.
-- 
Piet* van Oostrum, Dept of Computer Science, Utrecht University,
Padualaan 14, P.O. Box 80.089, 3508 TB Utrecht, The Netherlands.
Telephone: +31-30-531806   Uucp:   uunet!mcsun!ruuinf!piet
Telefax:   +31-30-513791   Internet:  piet@cs.ruu.nl   (*`Pete')

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (03/28/90)

In article <2739@ruuinf.cs.ruu.nl> piet@cs.ruu.nl (Piet van Oostrum) writes:
: In article <814@alias.UUCP>, rae%alias (Reid Ellis) writes:
:  `A problem and a program.
:  `
:  `The problem -
:  `
:  `    I have a program that archives all of the source groups for me at
:  `    night [comp.sources.*] which uses an 'rc' file in the home
:  `    directory.  The program is called 'narc' and the rc file is named
:  `    '.narcrc' and everytime I run narc, which is a perl script, I get
:  `    8k of NULs appended to the end of my '.narcrc' [which JUST SO
:  `    HAPPENS to be == BUFSIZ]..
:  `
:  `		close(<BODY>);
:  `
:  `	close(<MSGIDS>);
:  `
:  `close(<NARCRC>);
:  `
:  `close(<NARCRC>);
:  `	close(<ARTICLE>);
:  `	close(<BODY>);
:  `	close(<MSGIDS>);
:  `	close(<NARCRC>);
:  `	close(<SAVEFILE>);
: 
: All these closes are wrong.
: You have to say close (FILEHANDLE), not close (<FILEHANDLE>).

Use of perl -w would have spotlighted this.

: What happens in the case of NARCRC, is that you read an extra line
: <NARCRC>, which apparently reads a new buffer into memory. The buffer is
: then written out when perl stops. The resulting string is used as an
: argument for close. I would say that close should accept only filehandles
: as parameters, not strings. I have no idea what close does with a string
: parameters. Maybe it is an oversight in perl to accept this. This may well
: depend on your particular stdio implementation.

To quote the manual under close:
    FILEHANDLE may be an expression whose value gives the real filehandle name.

This is to support indirect filehandles.  If somebody wants the expression
<NARCRC> to return the next filehandle from a file, that's up to them.

I guess it's a case of "Live Free AND Die".

Use -w.

Larry

rae%alias@csri.toronto.edu (Reid Ellis) (03/28/90)

piet@cs.ruu.nl (Piet van Oostrum) writes:
|All these closes are wrong.
|You have to say close (FILEHANDLE), not close (<FILEHANDLE>).

Oops, you're quite right.  I made the change and it works just fine
now.  Thanks muchly, and if anyone actually finds that script useful,
send me mail with your suggestions.  Note that it handles n-deep
"Archive-name" headers.

					Reid
Reid Ellis  264 Broadway Avenue, Toronto ON, M4P 1V9               Canada
rae@gpu.utcs.toronto.edu || rae%alias@csri.toronto.edu || +1 416 487 1383