[comp.lang.perl] Needed: a pointer for a perl compare script

djo7613@hardy.u.washington.edu (Dick O'Connor) (11/30/90)

I've been following this group for awhile, cutting and saving sample scripts
like my Mom clips recipes, but I still haven't picked up enough pointers to
migrate one of my old Fortran "compare" programs to perl.   If I could do
this, the users could make their own runs and I'd be free to contemplate a
higher order of existence (programming!  :)

My program reads two files of differing format which are sorted by a unique
5-character label.  When two labels match, a new record is written, with
info from file A (moved around a bit) written to the "left" and info from
file B (again, reformatted a little) written to the "right".  Where a 
given record from file A or B has no counterpart, the same new record is
written, with blanks on the "side" without counterpart information.

I know this is simple; it's a short program now.  But there's something
I'm just not seeing that blocks my conversion to perl.  A pointer to a  
suggested construct would be wonderful...I'm happy to work out the details.

BTW I've been away for two weeks; did I miss The Announcement about The Book?

"Moby" Dick O'Connor                         djo7613@u.washington.edu 
Washington Department of Fisheries           *I brake for salmonids* 

merlyn@iwarp.intel.com (Randal Schwartz) (12/01/90)

In article <12020@milton.u.washington.edu>, djo7613@hardy (Dick O'Connor) writes:
| My program reads two files of differing format which are sorted by a unique
| 5-character label.  When two labels match, a new record is written, with
| info from file A (moved around a bit) written to the "left" and info from
| file B (again, reformatted a little) written to the "right".  Where a 
| given record from file A or B has no counterpart, the same new record is
| written, with blanks on the "side" without counterpart information.
| 
| I know this is simple; it's a short program now.  But there's something
| I'm just not seeing that blocks my conversion to perl.  A pointer to a  
| suggested construct would be wonderful...I'm happy to work out the details.


Hmm... (warning... untested code follows)...

If I could fit all of file A and B into memory (my preferred tactic), I'd
do something like this:

open(A,"Afile");
while(<A>) {
	chop;
	($label,$rest) = unpack("a5a*",$_);
	$a{$label} = $rest;
	$both{$label}++;
}
close(A);
open(B,"Bfile");
while(<B>) {
	chop;
	($label,$rest) = unpack("a5a*",$_);
	$b{$label} = $rest;
	$both{$label}++;
}
close(B);
for (sort keys both) {
	$left = defined($a{$_}) ? $a{$_} : "left default";
	$right = defined($b{$_}) ? $b{$_} : "right default";
	print "$_ $left $right\n";
}

This'd probably take some massaging, but I hope you get the general
idea.  If they don't both fit into memory, you will have to do some
juggling to read from a or b depending on which of the current labels
are lower.  This solution here is much simpler (and elegant :-) by
comparison.

| BTW I've been away for two weeks; did I miss The Announcement about The Book?

The only announcement you may have missed is that Larry and I are
working intensely to incorporate the review comments and finish up the
final draft so that we can still make the Usenix deadline.

print "Just another Perl [book] hacker,"
-- 
/=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\
| on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III      |
| merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn |
\=Cute Quote: "Intel: putting the 'backward' in 'backward compatible'..."====/

les@chinet.chi.il.us (Leslie Mikesell) (12/05/90)

In article <12020@milton.u.washington.edu> djo7613@hardy.u.washington.edu.acs.washington.edu (Dick O'Connor) writes:
>I've been following this group for awhile, cutting and saving sample scripts
>like my Mom clips recipes, but I still haven't picked up enough pointers to
>migrate one of my old Fortran "compare" programs to perl.   If I could do
>this, the users could make their own runs and I'd be free to contemplate a
>higher order of existence (programming!  :)

>My program reads two files of differing format which are sorted by a unique
>5-character label.  When two labels match, a new record is written, with
>info from file A (moved around a bit) written to the "left" and info from
>file B (again, reformatted a little) written to the "right".  Where a 
>given record from file A or B has no counterpart, the same new record is
>written, with blanks on the "side" without counterpart information.

Perl is the language of choice for this kind of thing but it may still
turn out to be non-trivial.  It is also fairly hard to describe
so examples are generally needed.  The merging subroutine is > half the
file so I'll just include the whole thing.  The concept here is to store
old and new items into different associative arrays, sort the keys,
then make the comparison from the top of each list.

Here is a sample that takes a stream that looks like this from a
legislative database:
NOXIOUS WEEDS, PEST ERADICATION - 1.2.5

VT H 2        AUTHOR:     ...
              TOPIC:      ... 
              SUBTOPIC:   ...

              SUMMARY:
              ........
              ........
 
              STATUS:
              .......
              .......

VT H 5        AUTHOR:
etc...

and files items under /dir/state/number, where state is taken from the
first 2 characters of the bill id, and number is last portion of the
topic line.  Within the file, items are sorted by their bill id with
an additional header added to note the date and whether the bill has
been signed.
A subsequent entry (possibly an update) with the same bill id will be
merged by extracting the SUMMARY: portion of the old entry and stuffing
it into the new data which will contain the current status.
The merging portion is done in the writestate subroutine.

Les Mikesell
  les@chinet.chi.il.us
#----------------
# merge.pl
# put legislative info into files:
#  1 directory per state, 1 file per topic
#  merge w/current - if new includes summary, use it,
#    else snarf summary from old
#  collect one state from current input - then read current info & merge
#
#  top dir of tree:
$dir = './test';
open (ERR,">>errlog") ;
#
%nitems=();
$haveitem = 0;
$havetopic = ""; 
$havestate = "";
$havesum = 0;
$instatus = 0;
($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst)=localtime(time);
$mon += 1;
$signed = 'n';
while (<>) { # read input file(s)
    # strip page breaks and some other junk
    if (/              Page/) { next;}
    if ( //)    { next;}   # these are always blank lines
    if ( /^ *\f/) { next;}   # form feeds - ignore
    if ( /^ *$/) {
        $bl=1;
        next;            # blank lines
    }
    unless (/\n$/)  {    # possible junk at EOF
        if ($instatus) { # past a complete item
            do saveitem();
            do writestate();
        } else {         # incomplete item, junk it
            $haveitem = 0;
            $havestate = "";
            $instatus = 0;
            $havesum = 0;
        }
        next;
    } 
# find:  AGRICULTURAL LOANS - 1.4.1
    if ( /^[A-Z].* \- [0-9][0-9]*\.[0-9][0-9]*/) { # new topic
        $topic=$_;
#        print "$topic \n";
        $topic =~ s/.* \- // ;
#        print "$topic \n";
        $topic =~ s/[^0-9]*$// ;
#        print "$topic \n";
        if ( $haveitem != 0 ) {
            do saveitem();      # save current entry
            do writestate();    # write current state
        }
        $havetopic = $topic;
        print "Topic = $topic\n";
        next;
    }
    # if start of new item, save last in %nitems
    # if start of new state, merge last state
    if (/^[A-Z][A-Z].*AUTHOR:/) { #new item
        if ($haveitem)  {do saveitem();} 
        /^([A-Z][A-Z])  *([0-9A-Za-z][0-9A-Za-z]*)  *([0-9].*)AUTHOR:/;
        if (length($1) != 2) {
            print "**** BILL ID ERROR in $_ **** \n";
            print ERR "**** BILL ID ERROR in $_ **** \n";
            next;
        }
        $sortid=sprintf("%2.2s %s %s",$1,$2,$3);
        $sortid =~ s/ *$// ;
        $state=substr($_,0,2);
        $state =~ y/A-Z/a-z/;
        if ($havestate ne "" && $havestate ne $state) {
#            $save = $_ ;
            do writestate();
#            print "Input line: was $save\n now $_\n";
        }
        $id = substr($_,0,14);
        $id =~ s/ *$//;  #trim trailing space
        $havestate=$state;
        $_ =~ s/.*AUTHOR:/AUTHOR:/ ;
        $_ =~ s/ *$//;
        $item = $_ ;
        #print "New: Itemid = $id $sortid\n";
        $haveitem = 1;
        next;
    }
    
    if (/SUMMARY:/) { $havesum = 1;}
    if ($haveitem == 0 ) { next;}
    if (/STATUS:/) { $instatus = 1;}
    if ($instatus ) {
        if ( /RATIFIED/) { $signed = 's';}  #NC variation
        if ( /[Ss]igned/) { $signed = 's';}
    }    
    if (/END OF REPORT/) {
        do saveitem();
        do writestate();
        $havetopic="";
        next;
    }
    $_ =~ s/              // ;  # strip right indent
    $item .= $_ ;               # append current line to item
}

do saveitem();
do writestate();
exit;

# save item to %nitems array w/sortid as key
# add header line of:
# >smmddyy
# where s = s or n (signed or not)
# m = month
# d = day  date item is written to database (now)
# y = year (2 digits)
# if duplicate keep one with summary field

sub saveitem  {
    if ($haveitem = 0 )  {return; }
    if ($havetopic eq "" ) {return; }
    $haveitem = 0;
    $instatus = 0;
    if ($havesum == 0) { # no summary, check for alternate
        if ($nitems{$sortid} && $nitems{$sortid} =~ /SUMMARY:/) {
            return;
        }
    }
    $nitems{$sortid} = sprintf (">%s%02d%02d%02d\n%s%s\n%s",$signed,$mon,$mday,$year,"BILL ID:    ",$id,$item) ;
    $signed = 'n';
    $havesum = 0;
    return;
}

# load any current items in for merging
sub writestate  {
    local ($_) ;      # important to not alter upper $_
    local ($*) = 1 ;  # multi-line match needed
    %oitems=();
    @olist=();
    # sanity check - may not have any new input 
    if ($havetopic eq "" ) {return; }
    if ($havestate eq "" ) {return; }
    $nname = sprintf ("%s/%s/%s",$dir,$havestate,$havetopic);
    print "Loading $nname \n" ;
    if (open (IN,"<$nname")) {
# read in old items keeping old date 
        $wsortid="";
        $item="";
        while (<IN>) {
            if (/^>/) {  # added header line 
                if ($wsortid ne "") {
                    $oitems{$wsortid} = $item; #store previous item
                    $wsortid = "";
                    $item="";                  # start new one
                }
            }
            $item .= $_ ;  # collect lines of item
            # normalize key to match original input
            if (/^BILL ID:    (..)  *([0-9A-Za-z][0-9A-Za-z]*)  *([0-9].*)/) {
                $wsortid=sprintf("%2.2s %s %s",$1,$2,$3);
                #print "Old id: $wsortid\n";
            }
        }
        if ($wsortid ne "") {
            $oitems{$wsortid} = $item; # save the last one
        }
        close(IN);
        @olist = sort (keys(%oitems)); # sort the old keys
        $howmany = $#olist +1;
        print "$howmany old bills\n" ;
    }
    @nlist = sort(keys(%nitems));  # sort the new keys
    $howmany = $#nlist +1;
    print "$howmany updates\n" ;
    #now merge the lists and write out
    print "Writing $nname \n" ;
    if ($nname ne $lname ) {
        close OUT;
        unless (open (OUT,">$nname"))  {
            $dirname = sprintf ("%s/%s",$dir,$havestate);
            printf "Creating $dirname\n";
            mkdir ($dirname,0777);
             open (OUT,">$nname") || die "Can't open $nname";
            $lname = $nname ;
        }
    }
    #print "@olist\n";
    #print "@nlist\n";
    $oldid=shift(@olist);  # start with top two keys
    $newid=shift(@nlist);
    $current = "" ;        # sanity check 
    while ( $oldid && $newid ) {  # compare and merge
        #print " oldid = $oldid  newid = $newid\n";
        if ($current ge $oldid || $current ge $newid) {
            print "***** MERGE ERROR at $current *****\n";
            print ERR "***** MERGE ERROR at $current *****\n";
        }
        if ($oldid eq $newid ) {  # merge summary w/new
            # if anything beyond date is changed use new
            # this keeps the old date on duplicates
            if ( ( ($t1) = $oitems{$oldid} =~ /BILL ID:([^\0]*)/ ) &&
                 ( ($t2) = $nitems{$newid} =~ /BILL ID:([^\0]*)/ ) &&
                 ($t1 eq $t2)) {
                print "Match: unchanged using OLD $oldid\n";
                print OUT $oitems{$oldid} ;
                $current = $oldid ;
            } else {
                print "Match: using NEW $newid \n";
                if ($nitems{$newid} =~ /SUMMARY:/) { #new has summary, toss old
                    print OUT $nitems{$newid} ;
                    $current = $newid ;
                    print "NEW has summary\n";
                } else {
                    # snarf summary from old - note multi-line wierdness
                    if (($status) = $oitems{$oldid} =~ /(^SUMMARY:\n[^\0]*)^STATUS:/) {
                        # and insert into new - that was easy...
                        substr($nitems{$newid},index($nitems{$newid},"STATUS:\n"),0) = $status ;
                    }
                    printf OUT  $nitems{$newid} ;
                    $current = $newid ;
                    print "OLD has summary\n";
                }
            }
            # this was a match, shift both lists to next item
            $oldid=shift(@olist);
            $newid=shift(@nlist);
            next;
        }
        # not a match, use alphabetically first item
        if ($oldid lt $newid ) {
            print "using OLD $oldid \n";
            print OUT $oitems{$oldid} ;
            $current = $oldid ;
            $oldid = shift(@olist);
            next;
        }
        # newid must be > oldid
        print OUT $nitems{$newid} ;
        $current = $newid ;
        print "using NEW $newid \n";
        $newid = shift(@nlist);
        next;
    }
    # one of the arrays is empty - write remaining part of other array
    if ($oldid) {
        print OUT $oitems{$oldid} ;
        print "using OLD $oldid \n";
        foreach $oldid (@olist) {
                print OUT $oitems{$oldid} ;
                print "using OLD $oldid \n";
        }
    }
    if ($newid) {
        print OUT $nitems{$newid} ;
        print "using NEW $newid \n";
        foreach $newid (@nlist) {
            print OUT $nitems{$newid} ;
            print "using NEW $newid \n";
        }
    }
    undef %nitems; #left over from trying to pin down a memory leak
    undef %oitems; # in an old version of perl
    %nitems=();
    %oitems=();
    $havestate="";
    $haveitem=0;
}

goer@quads.uchicago.edu (Richard L. Goerwitz) (12/05/90)

In article <1990Dec04.230436.8432@chinet.chi.il.us>
les@chinet.chi.il.us (Leslie Mikesell) writes:
>
>>My program reads two files of differing format which are sorted by a unique
>>5-character label.  When two labels match, a new record is written, with
>>info from file A (moved around a bit) written to the "left" and info from
>>file B (again, reformatted a little) written to the "right".  Where a 
>>given record from file A or B has no counterpart, the same new record is
>>written, with blanks on the "side" without counterpart information.
>
>Perl is the language of choice for this kind of thing but it may still
>turn out to be non-trivial....

I like reading this newsgroup, but this sort of statement comes up all too
often.  Perl is not the only language around that is optimized for file,
string, and symbol processing, which has associative arrays, and handles
sorting and printing elegantly.  If you can't think of any examples off-
hand then mail me, and I'll be glad to provide you with a few.  This is not
to say that we should not use perl.  It is to say simply that it's a bit
outlandish to call it "*the* language of choice" for tasks like the one
described above.

-Richard (goer@sophist.uchicago.edu)

chip@tct.uucp (Chip Salzenberg) (12/07/90)

According to goer@quads.uchicago.edu (Richard L. Goerwitz):
>Perl is not the only language around that is optimized for file,
>string, and symbol processing, which has associative arrays, and handles
>sorting and printing elegantly.  If you can't think of any examples off-
>hand then mail me, and I'll be glad to provide you with a few.

Come now, Richard.  If you criticize in public, you must put up your
facts in public.  Name these other languages.  Oh yes, and please
include availability and cost information.
-- 
Chip Salzenberg at Teltronics/TCT     <chip@tct.uucp>, <uunet!pdn!tct!chip>
      "I'm really sorry I feel this need to insult some people..."
            -- John F. Haugh II    (He thinks HE'S sorry?)

goer@quads.uchicago.edu (Richard L. Goerwitz) (12/07/90)

In article <275E7B47.2EB9@tct.uucp> chip@tct.uucp (Chip Salzenberg) writes:
>According to goer@quads.uchicago.edu (Richard L. Goerwitz):
>>Perl is not the only language around that is optimized for file,
>>string, and symbol processing, which has associative arrays, and handles
>>sorting and printing elegantly.  If you can't think of any examples off-
>>hand then mail me, and I'll be glad to provide you with a few.
>
>Come now, Richard.  If you criticize in public, you must put up your
>facts in public.  Name these other languages.  Oh yes, and please
>include availability and cost information.

Please, not one of those "come now" responses :-).  I felt it completely
inappropriate to go into language comparisons here.  I assumed that most
readers would know of other alternatives, and that my posting would serve
merely as a reminder not to get too outlandish in our claims about perl.
If you are going to press me, I'll gladly offer you a brief response re-
garding alternatives I had in mind:

In the case mentioned, I didn't see how perl offered distinct advantages
over nawk.  Nawk has most of the traits mentioned above, and is much more
widely available.  If you would like a good example of a language that has
all of the characteristics noted above, then I'd suggest you look at Icon.
Icon is a general purpose programming language with Snobolish string-
handling capabilities, automatic type conversions, associative arrays,
and so on.  Icon would have been just as easily applied to the particular
problem at hand as perl.

As for cost and availability, Icon is supported by government grant, and
has traditionally been PD.  Despite its PD status, Icon is available
through a very fine distribution system, and the software itself is much
less buggy and much more stable than perl's.  Icon is fully documented
in _The Icon Programming Language_ by Griswold & Griswold (2nd ed.; Prent-
ice Hall).  It is available, not only for Unix, MS-DOS, and the Mac (as
in perl's case), but is also available for VM/CMS, Ultrix, MVS/XA, VMS,
Mach, AEGIS, OS/2, Amiga DOS, and probably others I haven't thought of.
You can ftp it from many sites, probably the most accessible being cs.
arizona.edu.

This posting is not intended as an argument against using perl, by the
way.

-Richard (goer@sophist.uchicago.edu)

kaul@icarus.eng.ohio-state.edu (Rich Kaul) (12/08/90)

In article <1990Dec7.083412.8426@midway.uchicago.edu> goer@quads.uchicago.edu (Richard L. Goerwitz) writes:
   In the case mentioned, I didn't see how perl offered distinct advantages
   over nawk.  Nawk has most of the traits mentioned above, and is much more
   widely available.

I would argue that nawk is not nearly as available as perl.  There are
quite a few installed machines in which the old awk is all that is
available.  Even today nawk is not nearly as common as most awk users
would like, since few manufacturers ship it -- if you depend on nawk,
it's best to carry a copy of gawk with you.  If you have a carry a
copy of the sources to your tools with you, I'd take perl over awk
most any time.  Perl has all the options you can ever use and then
some ;-).

-rich
-- 
Rich Kaul                         | It wouldn't be research if we
kaul@icarus.eng.ohio-state.edu    | knew what we were doing.