[comp.binaries.ibm.pc.d] Awk script updates SIMIBM.IDX from monthly update

ray@ole.UUCP (Ray Berry) (02/24/91)

   Over the past year I've probably downloaded the massive SIMIBM.ARC file
2-3 times in an attempt to keep a reasonably current list on hand.  Needless
to say, this is not at all efficient from the standpoint of net bandwidth.
OTOH, I've noticed that Keith has been pretty dependable lately about posting 
a monthly "update" file that specifies all the new entries on a per-calendar-
month basis.  So it seemed like a good idea to create a method for updating
my master SIMTEL list with these monthly update files.  The AWK script below
is my first cut at the problem.  I decided to use AWK rather than write a 
c program because of the uncertainty of both the availability and format of 
the files involved.

   Fortunately, both the master SIMTEL list and the updates are sorted both
directory-wise and file-wise, so the job is basically just a simple merge
sort.

   Obviously, this script doesn't address the question of identifying files
that are deleted from SIMTEL, presumably because they are superceded with
newer versions.  Perhaps there could be some formalization of this update
process, and a method found to handle deletions as well as insertions. 
(Starting to sound like a job for a diff'ed 'ed' script...).

   The script was developed (in DOS) with Thompson Automation AwkPlus 
(formally "PolyAwk").  Advisory messages to the "CON" device are nonstandard
and won't work on other AWKs.   Also, the 'ctime()' function in the BEGIN
block is nonstandard.  Sorry.  For DOS environments, MKS awk has other
ways of getting the date; OTOH, I don't know how to make MKS awk write to
stderr or the CON device.   At any rate, it should be very simple to adapt
the script to your particular environment. 

   Needless to say, corrections and/or improvments would be welcomed.

	Ray Berry
---SNIP----
#  This awk script is intended to merge the monthly SIMTEL updates into the
#  master SIMTEL index list.  Both lists are assumed to be sorted, both in 
#  terms of the directory names, as well as the file lists for each directory.

#  usage :  awk -f this_script update_file_name > new_master_file

#  No attempt is made to identify/delete older versions of archive entries
#  as newer versions are introduced.  Matching the leading alpha portion
#  of filename entries isn't enough- too many false alarms are produced.

#  When new directory names are encountered, an advisory message is written
#  to the DOS CONsole.  When update lines are encountered that match lines
#  already in the master file, a warning message is printed.

#  original author: Ray Berry - uucp:...sumax!ole!ray; CS:73407,3152 2/23/91

function print_thru_blank() {
	do {
		getline;
		print;
	} while ($0 != "")
	return;
}

BEGIN {	INDEX = "simtel.lst"   # whatever you call your master catalog file

	# document update in new master list
	print "merged file \"" ARGV[1] "\" on " ctime();
}
 	

$1=="Directory"	{  
	update_dir = $2;

	for(;;)	{

		if (index(last_dir, update_dir)) {
			break;
		}

		# seek the next directory header in the master file
		for (;;) {
			if ( ! getline < INDEX )	{
				# a new directory name sorts behind the
				# last name currently in the master list. 
				print "eof on " INDEX >"CON";
				# copy the new directory to the output
				print "";
				print "Directory " update_dir
				print "adding directory " update_dir > "CON"
				print_thru_blank();
				next;
			}
			if ($1 == "Directory" )
				break;
		print;
		} 

		# check to see if update directory is not in master file
		if (update_dir < $2 ) {
			# copy the update directory data to the index file
			last_dir = $0;	# save listfile directory name
			print "Directory " update_dir;
			print "adding directory " update_dir > "CON"
			print_thru_blank();
			print last_dir;		# print old directory name
			next;
		}

		print;		# the Directory line
		if ( update_dir == $2 ) {
			break;
		}
	}

	# merge the file listings from main list & update file
	getline;
	getline x < INDEX
	for (;;) {
		if (x == $0)
			print "warning- duplicate lines!" >"CON";
		if (x < $0) {	
			print x;
			if ( getline x < INDEX == 0 ) {
				# read remainder of 'new' items
				print "eof on "INDEX > "CON"
				print;
				print_thru_blank();
				exit;
			}
			if (x=="") {
				# read remainder of 'new' items
				print;
				print_thru_blank();
				next;
			}
		}
		else {
			print;
			getline;
			if (!NF) {
				print x;
				next;	#remainder of this directory list
					#gets printed in next Directory search
			}
		}
	}
}

END {		# output remainder of INDEX file
	while ( getline < INDEX >0 )
		print;
}
----SNIP----
-- 
Ray Berry  kb7ht  uucp: ...sumax!ole!ray CIS: 73407,3152 /* "inquire within" */

raymond@math.berkeley.edu (Raymond Chen) (02/24/91)

If there is interest, I could post `monthly updates' of SIMTEL20 in
the following form:

 PD1:<MSDOS.WHATEVER>
 Filename    Type Length  Date    Description
 ==============================================================================
-FOO11.ZIP     B    1234  900101  Do something weird, version 1.1
+FOO11.ZIP     B    5678  910101  Do something weird, version 1.2

where the `-' indicates files that have been deleted and the `+'
indicates files that have been added.

Note also that I posted several months ago a package of perl scripts
that automatically incorporate Keith's monthly updates into a
locally-maintained copy of the SIMTEL20 index.  These are the same
scripts that are used at the math.princeton.edu server.

I will mail the scripts to any interested parties.  (As Ray Berry pointed
out, they really aren't very difficult scripts to write, since Keith
Petersen did all of the hard work.)
--
raymond@math.berkeley.edu
Your friendly comp.sys.ibm.pc.misc archives administrator.