[comp.unix.questions] Datafile conversion with AWK !

root@csoftec.csf.com (Cliff Manis (cmanis@csoftec)) (12/14/89)

I am needing help with data conversion, and would appreciate help 
in AWK or SED and/or awk & sed.   Or whatever....

While attempting to convert thie data for sometime with 'awk', I have
learned a lot, but still need some help with this one.

(here is the ASCII datafile I have)

AA1|name1|4you|ABC Co|3|56|a|bbb|c|d|eeee|fff|g|1|
AA1|name1|4you|5th ST||||||||||2|
AA1|name1|4you|Go4it, TX||||||||||3|
ZZ1|name2|5ght|SED Co.|4|88|b|ccc|c|d|eee2|ff2|h|1|
JJ1|name3|6ghi|AWK Inc.|4|98|c|ddd|e|f|gggg|ff3|i|1|
JJ1|name3|6ghi|2nd St||||||||||2|
JJ1|name3|6ghi|POB 34||||||||||3|
JJ1|name3|6ghi|Town, USA||||||||||4|
BB1|name4|7mob|B H Co|2|56|d|eee|f|g|hhhh|ii4|j|1|
BB1|name4|7mob|Athens, TN||||||||||2|

Please note that some records have one line to 4 lines of data, and
the address field is field four.  The last field is the sub-record number.

(and this is the datafile configuration which I need from the data above)

AA1|name1|4you|ABC Co|5th ST|Go4it, TX||3|56|a|bbb|c|d|eeee|fff|g|1
ZZ1|name2|5ght|SED Co.||||4|88|b|ccc|c|d|eee2|ff2|h|1
JJ1|name3|6ghi|AWK Inc.|2nd St|POB 34|Town, USA|4|98|c|ddd|e|f|gggg|ff3|i|1
BB1|name4|7mob|B H Co|Athens, TN|||2|56|d|eee|f|g|hhhh|ii4|j|1

Thanks in advance for any help sent to me or over the net.  As anyone can 
readily see, the top datafile is a BIG waste of space.....cliff
     ______________________________________________________ 
    +      Unix & Xenix today for a better tommorrow!      +
    +      Cliff Manis        ARPA: hcscm@afbs.af.mil      +
    +         INTERNET: cmanis@csoftec.csf.com             +
    +   UUCP: {dpmizar|texbell|swrinde}!csoftec!cmanis     +
-

rock@rancho.uucp (Rock Kent) (12/14/89)

<note I have deleted comp.unix.wizards . . . I'm sure they already know this.>

On 13 Dec 89 23:53:07 GMT, Cliff Manis(cmanis@csoftec) said:

Cliff> I am needing help with data conversion, and would appreciate help 
Cliff> in AWK or SED and/or awk & sed.  
           [to convert from]
Cliff> AA1|name1|4you|ABC Co|3|56|a|bbb|c|d|eeee|fff|g|1|
Cliff> AA1|name1|4you|5th ST||||||||||2|
Cliff> AA1|name1|4you|Go4it, TX||||||||||3|
           [to]
Cliff> AA1|name1|4you|ABC Co|5th ST|Go4it, TX||3|56|a|bbb|c|d|eeee|fff|g|1

Try:   awk -f reform.awk inputfile 
where reform.awk is as follows:
---------snip-snip-snip-snip---------------------------------------------------
BEGIN{FS="|"}
{
if (prior != $1 $2 $3) {
  if (prior != ""){ 
    for (j=1;j<17;j++) {printf "%s|", line[j]};
    printf "%s\n", line[17]; };
  prior = $1 $2 $3;
  split($0,line,"|");
  for (i=17;i>7;i--){line[i]=line[i-3]};
  for (i=5;i<8;i++){line[i]=""};
  l=4}
else {
  l=l+1;
  line[l]=$4;}
}
END{
for (j=1;j<17;j++) {printf "%s|", line[j]};
printf "%s\n", line[17];
}
---------snip-snip-snip-snip---------------------------------------------------

***************************************************************************
*Rock Kent    rock@rancho.uucp        POB 8964, Rancho Sante Fe, CA. 92067*
***************************************************************************

tchrist@convex.COM (Tom Christiansen) (12/14/89)

In article <539@csoftec.csf.com> root@csoftec.csf.com (Cliff Manis (cmanis@csoftec)) writes:
>I am needing help with data conversion, and would appreciate help 
>in AWK or SED and/or awk & sed.   Or whatever....

If you'll accept perl (version 3.0 because I like the array slices), this 
code does it for you, assuming the input files are really as shown:

    #!/usr/bin/perl
    while (<>) {
	chop; split(/\|/);
	if ( $_[$#_] == 1 ) { 
	    &print_line unless $#addrs < 0;
	    @preamble  = @_[0..2];
	    @addrs     = @_[3];
	    @postamble = @_[4..$#_];
	} else {
	    push(@addrs, $_[3]);
	}
    } 
    &print_line;
    sub print_line {
	push(@addrs,'') while $#addrs < 3;
	print join ('|', @preamble, @addrs, @postamble), "\n";
    }

I thought it a bit odd that the input data had a trailing pipe and the
requested output data didn't, but that's what the guy wanted.  If he
really does want a trailing pipe, change the last print line to:
	print join ('|', @preamble, @addrs, @postamble, "\n");

    Tom Christiansen                       {uunet,uiucdcs,sun}!convex!tchrist 
    Convex Computer Corporation                            tchrist@convex.COM
		 "EMACS belongs in <sys/errno.h>: Editor too big!"

johnson@ncrons.StPaul.NCR.COM (Wayne D. Johnson) (12/15/89)

In article <539@csoftec.csf.com> root@csoftec.csf.com (Cliff Manis (cmanis@csoftec)) writes:
>(here is the ASCII datafile I have)
>
>AA1|name1|4you|ABC Co|3|56|a|bbb|c|d|eeee|fff|g|1|
>AA1|name1|4you|5th ST||||||||||2|
>AA1|name1|4you|Go4it, TX||||||||||3|
>ZZ1|name2|5ght|SED Co.|4|88|b|ccc|c|d|eee2|ff2|h|1|
>JJ1|name3|6ghi|AWK Inc.|4|98|c|ddd|e|f|gggg|ff3|i|1|
>JJ1|name3|6ghi|2nd St||||||||||2|
>JJ1|name3|6ghi|POB 34||||||||||3|
>JJ1|name3|6ghi|Town, USA||||||||||4|
>BB1|name4|7mob|B H Co|2|56|d|eee|f|g|hhhh|ii4|j|1|
>BB1|name4|7mob|Athens, TN||||||||||2|
>
Interesting little bugger...

The first problem is to identify the records.  The easiest way to do that is 
with awk.  Try search paremeters like:
/1|$/
/2|$/
This will seperate the records so you can process them.  The 1| part matches the
1| in the record and the $ guarentees that it will only match when the 1| is the
last characters of the line.

The processing should be somthing like:
/1|$/ { addr1=$4; f1=$1; f2=$2; f3=$3...}
/2|$/ { addr2=$4; ...}
/3|$/ { addr3=$4; ...}
/4|$/ { addr4=$4; ...}
This will save off the contents of the address(?) field from record 1 and 2, you
will need to have a process for each record type.  

This allows you to process all the records but how about outputing them.  You
first need to detect when the line read is part of a new record.  I like to use
somthing like:

NR == 1 { old = $1}

This will set the variable old to the first field of the first record

($1 != old) || END { 
	print f1 f2 f3 addr1 addr2 addr3 addr4 ...
	addr1=""
	addr2=""
	addr3=""
	addr4=""
	old=$1
}
This will detect when you have read the first line of the next record.  When
this happens, you print out the information you have gathered from the last
record and set your old variable to the new key.

Lastly, you need to define your field seperator character as | so you put
IFS="|" and OFS="|" on the awk command line.

I've been pretty general and breif about this whole thing, if you need more
help or have a question about anything, just email me.  

Disclaimer: I have not run any of this code, but it should work.  note that 
the code listed has some generalities to it (i.e. "...") that means additional
code could be placed here.

I wasn't shure what your knowlege of awk was so I tried to be pretty simple.

-- 
Wayne Johnson         |  Is a baby's life worth more than the right to 
NCR Comten, Inc.      |  make a choice?  Babies are people too.
Roseville MN 55113    +-----------------------------------------------------
(Voice) 612-638-7665   (E-MAIL) W.Johnson@StPaul.NCR.COM

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (12/15/89)

In article <539@csoftec.csf.com> root@csoftec.csf.com (Cliff Manis (cmanis@csoftec)) writes:
: I am needing help with data conversion, and would appreciate help 
: in AWK or SED and/or awk & sed.   Or whatever....

Here is, of course, the answer in "whatever":        :-)

#!/usr/bin/perl

@key = ('','','');
while (<>) {
    ($key,$addr,$rest,$sub) = /^([^|]*\|[^|]*\|[^|]*)\|([^|]*)\|(.*(\d))/;
    &printone if $key ne $lastkey;
    $addr[$sub-1] = $addr;
}
&printone;

sub printone {
    print join('|',$lastkey,@addr,$lastrest),"\n" if $lastkey ne '';
    $lastkey = $key;
    $lastrest = $rest;
    @addr = ('','','','');
}

I think you should find this fairly readable.

If you're uncomfortable with the hairy regular expression, here's one
that uses split instead:

#!/usr/bin/perl

@key = ('','','');
while (<>) {
    (@key[0..2],$addr,@remainder) = split(/[|]/,$_,14);
    $sub = pop(@remainder);
    $key = join('|',@key);
    &printone if $key ne $lastkey;
    $addr[$sub-1] = $addr;
}
&printone;

sub printone {
    print join('|',$lastkey,@addr,@lastremainder,"1\n") if $lastkey ne '';
    $lastkey = $key;
    @lastremainder = @remainder;
    @addr = ('','','','');
}

Larry Wall
lwall@jpl-devvax.jpl.nasa.gov