djo7613@hardy.u.washington.edu (Dick O'Connor) (11/30/90)
I've been following this group for awhile, cutting and saving sample scripts like my Mom clips recipes, but I still haven't picked up enough pointers to migrate one of my old Fortran "compare" programs to perl. If I could do this, the users could make their own runs and I'd be free to contemplate a higher order of existence (programming! :) My program reads two files of differing format which are sorted by a unique 5-character label. When two labels match, a new record is written, with info from file A (moved around a bit) written to the "left" and info from file B (again, reformatted a little) written to the "right". Where a given record from file A or B has no counterpart, the same new record is written, with blanks on the "side" without counterpart information. I know this is simple; it's a short program now. But there's something I'm just not seeing that blocks my conversion to perl. A pointer to a suggested construct would be wonderful...I'm happy to work out the details. BTW I've been away for two weeks; did I miss The Announcement about The Book? "Moby" Dick O'Connor djo7613@u.washington.edu Washington Department of Fisheries *I brake for salmonids*
merlyn@iwarp.intel.com (Randal Schwartz) (12/01/90)
In article <12020@milton.u.washington.edu>, djo7613@hardy (Dick O'Connor) writes: | My program reads two files of differing format which are sorted by a unique | 5-character label. When two labels match, a new record is written, with | info from file A (moved around a bit) written to the "left" and info from | file B (again, reformatted a little) written to the "right". Where a | given record from file A or B has no counterpart, the same new record is | written, with blanks on the "side" without counterpart information. | | I know this is simple; it's a short program now. But there's something | I'm just not seeing that blocks my conversion to perl. A pointer to a | suggested construct would be wonderful...I'm happy to work out the details. Hmm... (warning... untested code follows)... If I could fit all of file A and B into memory (my preferred tactic), I'd do something like this: open(A,"Afile"); while(<A>) { chop; ($label,$rest) = unpack("a5a*",$_); $a{$label} = $rest; $both{$label}++; } close(A); open(B,"Bfile"); while(<B>) { chop; ($label,$rest) = unpack("a5a*",$_); $b{$label} = $rest; $both{$label}++; } close(B); for (sort keys both) { $left = defined($a{$_}) ? $a{$_} : "left default"; $right = defined($b{$_}) ? $b{$_} : "right default"; print "$_ $left $right\n"; } This'd probably take some massaging, but I hope you get the general idea. If they don't both fit into memory, you will have to do some juggling to read from a or b depending on which of the current labels are lower. This solution here is much simpler (and elegant :-) by comparison. | BTW I've been away for two weeks; did I miss The Announcement about The Book? The only announcement you may have missed is that Larry and I are working intensely to incorporate the review comments and finish up the final draft so that we can still make the Usenix deadline. print "Just another Perl [book] hacker," -- /=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\ | on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III | | merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn | \=Cute Quote: "Intel: putting the 'backward' in 'backward compatible'..."====/
les@chinet.chi.il.us (Leslie Mikesell) (12/05/90)
In article <12020@milton.u.washington.edu> djo7613@hardy.u.washington.edu.acs.washington.edu (Dick O'Connor) writes: >I've been following this group for awhile, cutting and saving sample scripts >like my Mom clips recipes, but I still haven't picked up enough pointers to >migrate one of my old Fortran "compare" programs to perl. If I could do >this, the users could make their own runs and I'd be free to contemplate a >higher order of existence (programming! :) >My program reads two files of differing format which are sorted by a unique >5-character label. When two labels match, a new record is written, with >info from file A (moved around a bit) written to the "left" and info from >file B (again, reformatted a little) written to the "right". Where a >given record from file A or B has no counterpart, the same new record is >written, with blanks on the "side" without counterpart information. Perl is the language of choice for this kind of thing but it may still turn out to be non-trivial. It is also fairly hard to describe so examples are generally needed. The merging subroutine is > half the file so I'll just include the whole thing. The concept here is to store old and new items into different associative arrays, sort the keys, then make the comparison from the top of each list. Here is a sample that takes a stream that looks like this from a legislative database: NOXIOUS WEEDS, PEST ERADICATION - 1.2.5 VT H 2 AUTHOR: ... TOPIC: ... SUBTOPIC: ... SUMMARY: ........ ........ STATUS: ....... ....... VT H 5 AUTHOR: etc... and files items under /dir/state/number, where state is taken from the first 2 characters of the bill id, and number is last portion of the topic line. Within the file, items are sorted by their bill id with an additional header added to note the date and whether the bill has been signed. A subsequent entry (possibly an update) with the same bill id will be merged by extracting the SUMMARY: portion of the old entry and stuffing it into the new data which will contain the current status. The merging portion is done in the writestate subroutine. Les Mikesell les@chinet.chi.il.us #---------------- # merge.pl # put legislative info into files: # 1 directory per state, 1 file per topic # merge w/current - if new includes summary, use it, # else snarf summary from old # collect one state from current input - then read current info & merge # # top dir of tree: $dir = './test'; open (ERR,">>errlog") ; # %nitems=(); $haveitem = 0; $havetopic = ""; $havestate = ""; $havesum = 0; $instatus = 0; ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst)=localtime(time); $mon += 1; $signed = 'n'; while (<>) { # read input file(s) # strip page breaks and some other junk if (/ Page/) { next;} if ( //) { next;} # these are always blank lines if ( /^ *\f/) { next;} # form feeds - ignore if ( /^ *$/) { $bl=1; next; # blank lines } unless (/\n$/) { # possible junk at EOF if ($instatus) { # past a complete item do saveitem(); do writestate(); } else { # incomplete item, junk it $haveitem = 0; $havestate = ""; $instatus = 0; $havesum = 0; } next; } # find: AGRICULTURAL LOANS - 1.4.1 if ( /^[A-Z].* \- [0-9][0-9]*\.[0-9][0-9]*/) { # new topic $topic=$_; # print "$topic \n"; $topic =~ s/.* \- // ; # print "$topic \n"; $topic =~ s/[^0-9]*$// ; # print "$topic \n"; if ( $haveitem != 0 ) { do saveitem(); # save current entry do writestate(); # write current state } $havetopic = $topic; print "Topic = $topic\n"; next; } # if start of new item, save last in %nitems # if start of new state, merge last state if (/^[A-Z][A-Z].*AUTHOR:/) { #new item if ($haveitem) {do saveitem();} /^([A-Z][A-Z]) *([0-9A-Za-z][0-9A-Za-z]*) *([0-9].*)AUTHOR:/; if (length($1) != 2) { print "**** BILL ID ERROR in $_ **** \n"; print ERR "**** BILL ID ERROR in $_ **** \n"; next; } $sortid=sprintf("%2.2s %s %s",$1,$2,$3); $sortid =~ s/ *$// ; $state=substr($_,0,2); $state =~ y/A-Z/a-z/; if ($havestate ne "" && $havestate ne $state) { # $save = $_ ; do writestate(); # print "Input line: was $save\n now $_\n"; } $id = substr($_,0,14); $id =~ s/ *$//; #trim trailing space $havestate=$state; $_ =~ s/.*AUTHOR:/AUTHOR:/ ; $_ =~ s/ *$//; $item = $_ ; #print "New: Itemid = $id $sortid\n"; $haveitem = 1; next; } if (/SUMMARY:/) { $havesum = 1;} if ($haveitem == 0 ) { next;} if (/STATUS:/) { $instatus = 1;} if ($instatus ) { if ( /RATIFIED/) { $signed = 's';} #NC variation if ( /[Ss]igned/) { $signed = 's';} } if (/END OF REPORT/) { do saveitem(); do writestate(); $havetopic=""; next; } $_ =~ s/ // ; # strip right indent $item .= $_ ; # append current line to item } do saveitem(); do writestate(); exit; # save item to %nitems array w/sortid as key # add header line of: # >smmddyy # where s = s or n (signed or not) # m = month # d = day date item is written to database (now) # y = year (2 digits) # if duplicate keep one with summary field sub saveitem { if ($haveitem = 0 ) {return; } if ($havetopic eq "" ) {return; } $haveitem = 0; $instatus = 0; if ($havesum == 0) { # no summary, check for alternate if ($nitems{$sortid} && $nitems{$sortid} =~ /SUMMARY:/) { return; } } $nitems{$sortid} = sprintf (">%s%02d%02d%02d\n%s%s\n%s",$signed,$mon,$mday,$year,"BILL ID: ",$id,$item) ; $signed = 'n'; $havesum = 0; return; } # load any current items in for merging sub writestate { local ($_) ; # important to not alter upper $_ local ($*) = 1 ; # multi-line match needed %oitems=(); @olist=(); # sanity check - may not have any new input if ($havetopic eq "" ) {return; } if ($havestate eq "" ) {return; } $nname = sprintf ("%s/%s/%s",$dir,$havestate,$havetopic); print "Loading $nname \n" ; if (open (IN,"<$nname")) { # read in old items keeping old date $wsortid=""; $item=""; while (<IN>) { if (/^>/) { # added header line if ($wsortid ne "") { $oitems{$wsortid} = $item; #store previous item $wsortid = ""; $item=""; # start new one } } $item .= $_ ; # collect lines of item # normalize key to match original input if (/^BILL ID: (..) *([0-9A-Za-z][0-9A-Za-z]*) *([0-9].*)/) { $wsortid=sprintf("%2.2s %s %s",$1,$2,$3); #print "Old id: $wsortid\n"; } } if ($wsortid ne "") { $oitems{$wsortid} = $item; # save the last one } close(IN); @olist = sort (keys(%oitems)); # sort the old keys $howmany = $#olist +1; print "$howmany old bills\n" ; } @nlist = sort(keys(%nitems)); # sort the new keys $howmany = $#nlist +1; print "$howmany updates\n" ; #now merge the lists and write out print "Writing $nname \n" ; if ($nname ne $lname ) { close OUT; unless (open (OUT,">$nname")) { $dirname = sprintf ("%s/%s",$dir,$havestate); printf "Creating $dirname\n"; mkdir ($dirname,0777); open (OUT,">$nname") || die "Can't open $nname"; $lname = $nname ; } } #print "@olist\n"; #print "@nlist\n"; $oldid=shift(@olist); # start with top two keys $newid=shift(@nlist); $current = "" ; # sanity check while ( $oldid && $newid ) { # compare and merge #print " oldid = $oldid newid = $newid\n"; if ($current ge $oldid || $current ge $newid) { print "***** MERGE ERROR at $current *****\n"; print ERR "***** MERGE ERROR at $current *****\n"; } if ($oldid eq $newid ) { # merge summary w/new # if anything beyond date is changed use new # this keeps the old date on duplicates if ( ( ($t1) = $oitems{$oldid} =~ /BILL ID:([^\0]*)/ ) && ( ($t2) = $nitems{$newid} =~ /BILL ID:([^\0]*)/ ) && ($t1 eq $t2)) { print "Match: unchanged using OLD $oldid\n"; print OUT $oitems{$oldid} ; $current = $oldid ; } else { print "Match: using NEW $newid \n"; if ($nitems{$newid} =~ /SUMMARY:/) { #new has summary, toss old print OUT $nitems{$newid} ; $current = $newid ; print "NEW has summary\n"; } else { # snarf summary from old - note multi-line wierdness if (($status) = $oitems{$oldid} =~ /(^SUMMARY:\n[^\0]*)^STATUS:/) { # and insert into new - that was easy... substr($nitems{$newid},index($nitems{$newid},"STATUS:\n"),0) = $status ; } printf OUT $nitems{$newid} ; $current = $newid ; print "OLD has summary\n"; } } # this was a match, shift both lists to next item $oldid=shift(@olist); $newid=shift(@nlist); next; } # not a match, use alphabetically first item if ($oldid lt $newid ) { print "using OLD $oldid \n"; print OUT $oitems{$oldid} ; $current = $oldid ; $oldid = shift(@olist); next; } # newid must be > oldid print OUT $nitems{$newid} ; $current = $newid ; print "using NEW $newid \n"; $newid = shift(@nlist); next; } # one of the arrays is empty - write remaining part of other array if ($oldid) { print OUT $oitems{$oldid} ; print "using OLD $oldid \n"; foreach $oldid (@olist) { print OUT $oitems{$oldid} ; print "using OLD $oldid \n"; } } if ($newid) { print OUT $nitems{$newid} ; print "using NEW $newid \n"; foreach $newid (@nlist) { print OUT $nitems{$newid} ; print "using NEW $newid \n"; } } undef %nitems; #left over from trying to pin down a memory leak undef %oitems; # in an old version of perl %nitems=(); %oitems=(); $havestate=""; $haveitem=0; }
goer@quads.uchicago.edu (Richard L. Goerwitz) (12/05/90)
In article <1990Dec04.230436.8432@chinet.chi.il.us> les@chinet.chi.il.us (Leslie Mikesell) writes: > >>My program reads two files of differing format which are sorted by a unique >>5-character label. When two labels match, a new record is written, with >>info from file A (moved around a bit) written to the "left" and info from >>file B (again, reformatted a little) written to the "right". Where a >>given record from file A or B has no counterpart, the same new record is >>written, with blanks on the "side" without counterpart information. > >Perl is the language of choice for this kind of thing but it may still >turn out to be non-trivial.... I like reading this newsgroup, but this sort of statement comes up all too often. Perl is not the only language around that is optimized for file, string, and symbol processing, which has associative arrays, and handles sorting and printing elegantly. If you can't think of any examples off- hand then mail me, and I'll be glad to provide you with a few. This is not to say that we should not use perl. It is to say simply that it's a bit outlandish to call it "*the* language of choice" for tasks like the one described above. -Richard (goer@sophist.uchicago.edu)
chip@tct.uucp (Chip Salzenberg) (12/07/90)
According to goer@quads.uchicago.edu (Richard L. Goerwitz): >Perl is not the only language around that is optimized for file, >string, and symbol processing, which has associative arrays, and handles >sorting and printing elegantly. If you can't think of any examples off- >hand then mail me, and I'll be glad to provide you with a few. Come now, Richard. If you criticize in public, you must put up your facts in public. Name these other languages. Oh yes, and please include availability and cost information. -- Chip Salzenberg at Teltronics/TCT <chip@tct.uucp>, <uunet!pdn!tct!chip> "I'm really sorry I feel this need to insult some people..." -- John F. Haugh II (He thinks HE'S sorry?)
goer@quads.uchicago.edu (Richard L. Goerwitz) (12/07/90)
In article <275E7B47.2EB9@tct.uucp> chip@tct.uucp (Chip Salzenberg) writes: >According to goer@quads.uchicago.edu (Richard L. Goerwitz): >>Perl is not the only language around that is optimized for file, >>string, and symbol processing, which has associative arrays, and handles >>sorting and printing elegantly. If you can't think of any examples off- >>hand then mail me, and I'll be glad to provide you with a few. > >Come now, Richard. If you criticize in public, you must put up your >facts in public. Name these other languages. Oh yes, and please >include availability and cost information. Please, not one of those "come now" responses :-). I felt it completely inappropriate to go into language comparisons here. I assumed that most readers would know of other alternatives, and that my posting would serve merely as a reminder not to get too outlandish in our claims about perl. If you are going to press me, I'll gladly offer you a brief response re- garding alternatives I had in mind: In the case mentioned, I didn't see how perl offered distinct advantages over nawk. Nawk has most of the traits mentioned above, and is much more widely available. If you would like a good example of a language that has all of the characteristics noted above, then I'd suggest you look at Icon. Icon is a general purpose programming language with Snobolish string- handling capabilities, automatic type conversions, associative arrays, and so on. Icon would have been just as easily applied to the particular problem at hand as perl. As for cost and availability, Icon is supported by government grant, and has traditionally been PD. Despite its PD status, Icon is available through a very fine distribution system, and the software itself is much less buggy and much more stable than perl's. Icon is fully documented in _The Icon Programming Language_ by Griswold & Griswold (2nd ed.; Prent- ice Hall). It is available, not only for Unix, MS-DOS, and the Mac (as in perl's case), but is also available for VM/CMS, Ultrix, MVS/XA, VMS, Mach, AEGIS, OS/2, Amiga DOS, and probably others I haven't thought of. You can ftp it from many sites, probably the most accessible being cs. arizona.edu. This posting is not intended as an argument against using perl, by the way. -Richard (goer@sophist.uchicago.edu)
kaul@icarus.eng.ohio-state.edu (Rich Kaul) (12/08/90)
In article <1990Dec7.083412.8426@midway.uchicago.edu> goer@quads.uchicago.edu (Richard L. Goerwitz) writes:
In the case mentioned, I didn't see how perl offered distinct advantages
over nawk. Nawk has most of the traits mentioned above, and is much more
widely available.
I would argue that nawk is not nearly as available as perl. There are
quite a few installed machines in which the old awk is all that is
available. Even today nawk is not nearly as common as most awk users
would like, since few manufacturers ship it -- if you depend on nawk,
it's best to carry a copy of gawk with you. If you have a carry a
copy of the sources to your tools with you, I'd take perl over awk
most any time. Perl has all the options you can ever use and then
some ;-).
-rich
--
Rich Kaul | It wouldn't be research if we
kaul@icarus.eng.ohio-state.edu | knew what we were doing.