[comp.unix.questions] How to merge two files in awk??

@xlab1.uucp (01/19/91)

 I am not sure if this question has been asked before...

 Supposing I have two files with three collumns in each. How do
 I merge the files and generate a single file with six or more 
 collumns using shell script?  for example if File A has collumns a, c, e 
 and File B has collumns b, d, f. I want to generate File C
 with collumns a,b,c,d,e,f.  Also it would be nice to be able to
 using the arithematic feature in awk...

 Finally, how do u specify the "rest of the line" in awk??

 thanks
 ashok

tchrist@convex.COM (Tom Christiansen) (01/20/91)

From the keyboard of @xlab1.uucp ():
: I am not sure if this question has been asked before...
:
: Supposing I have two files with three collumns in each. How do
: I merge the files and generate a single file with six or more 
: collumns using shell script?  for example if File A has collumns a, c, e 
: and File B has collumns b, d, f. I want to generate File C
: with collumns a,b,c,d,e,f.  Also it would be nice to be able to
: using the arithematic feature in awk...

This originally went also to comp.unix.internals.  I sure wouldn't
say an awk question is a unix internal.

Someone out there may have as paste solution, but I didn't see one.  

In old, standard awk, it's really quite cumbersome, as you have to read in
all the first file, then all the second file.  I find this to be a pretty
cumbersome solution.

    #!/bin/awk -f
    { a[NR] = $1; b[NR] = $2; c[NR] = $3; }
    END {
	count = NR/2;
	for (i = 1; i <= count; i++) {
	    print a[i], a[i+count], b[i], b[i+count], c[i], c[i+count];
	}
    }


In gawk (and nawk if you're rich), it's a little easier because you can
redirect getilne from a file, effectively reading two lines and writing
one line each iteration.  

    #!/usr/gnu/bin/gawk -f
    BEGIN {
	for (;;) {
	    if ((getline < ARGV[1]) <= 0) break;
	    a = $1; c = $2; e = $3;
	    if ((getline < ARGV[2]) <= 0) break;
	    b = $1; d = $2; f = $3;
	    print a, b, c, d, e, f;
	}
    }

It's also pretty easy in perl:

    #!/usr/bin/perl
    $[ = 1; $, = " "; $\ = "\n"; # awk emulation
    open(F1, $ARGV[1]); open(F2, $ARGV[2]);
    while ( (@a = split(' ',<F1>)) && (@b = split(' ', <F2>)) ) {
	print $a[1], $b[1], $a[2], $b[2], $a[3], $b[3];
    }

Other advantages of perl are:

    1) you get better error messages for syntax errors 
    2) you can symbolically debug your program
    3) no limits on lines/fields (gawk is better than nawk at this)
    4) can often be made to run faster than awk 
    5) better usage and i/o failure error messages (i didn't do this here)


If you only have awk and not gawk and perl, you should get them, because
they are both free and compile on a vast array (list? :-) of platforms.
Find them wherever GNUware is stored.

: Finally, how do u specify the "rest of the line" in awk??

I'm not really sure what you mean.  The whole line is $0.  What's the rest
of the line?  You mean fields past the third one?

--tom
--
"Hey, did you hear Stallman has replaced /vmunix with /vmunix.el?  Now
 he can finally have the whole O/S built-in to his editor like he
 always wanted!" --me (Tom Christiansen <tchrist@convex.com>)

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (01/21/91)

In article <1991Jan19.194124.2335@convex.com> tchrist@convex.COM (Tom Christiansen) writes:
> : for example if File A has collumns a, c, e 
> : and File B has collumns b, d, f. I want to generate File C
> : with collumns a,b,c,d,e,f.  Also it would be nice to be able to
> : using the arithematic feature in awk...
> Someone out there may have as paste solution, but I didn't see one.  

Is that a challenge?

  #!/bin/sh
  # untested, but too simple to fail in strange ways
  # type X as tab
  awk '{ print $1; print $2; print $3 }' < "$1" > /tmp/file1.$$
  awk '{ print $1; print $2; print $3 }' < "$2" > /tmp/file2.$$
  paste /tmp/file1.$$ /tmp/file2.$$ | (
  while read i
  do
    read j; read k
    echo "$iX$jX$k"
  done
  )
  rm /tmp/file1.$$ /tmp/file2.$$

---Dan

mrd@ecs.soton.ac.uk (Mark Dobie) (01/22/91)

In <3404@d75.UUCP> @xlab1.uucp writes:

> Finally, how do u specify the "rest of the line" in awk??

I am a relative beginner with awk, but I ran into this problem too. My
solution was to set the fields I wasn't interested in to "" and then
use $0.

eg
	$1 = "" ;
	print "rest of line is " $0

Is this a good way?

			Mark.
-- 

Mark Dobie                              M.Dobie@uk.ac.soton.ecs (JANET)
University of Southampton		M.Dobie@ecs.soton.ac.uk (Bitnet)

gwc@root.co.uk (Geoff Clare) (01/23/91)

In article <1991Jan19.194124.2335@convex.com> tchrist@convex.COM (Tom Christiansen) writes:
> : for example if File A has collumns a, c, e 
> : and File B has collumns b, d, f. I want to generate File C
> : with collumns a,b,c,d,e,f.  Also it would be nice to be able to
> : using the arithematic feature in awk...
> Someone out there may have as paste solution, but I didn't see one.  

In <25041:Jan2017:21:1491@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:

}Is that a challenge?

}  #!/bin/sh
}  # untested, but too simple to fail in strange ways
}  # type X as tab
}  awk '{ print $1; print $2; print $3 }' < "$1" > /tmp/file1.$$
}  awk '{ print $1; print $2; print $3 }' < "$2" > /tmp/file2.$$
}  paste /tmp/file1.$$ /tmp/file2.$$ | (
}  while read i
}  do
}    read j; read k
}    echo "$iX$jX$k"
}  done
}  )
}  rm /tmp/file1.$$ /tmp/file2.$$


That is really gross!

Try this:

    paste "$1" "$2" | awk '{print $1, $4, $2, $5, $3, $6}'

-- 
Geoff Clare <gwc@root.co.uk>  (Dumb American mailers: ...!uunet!root.co.uk!gwc)
UniSoft Limited, London, England.   Tel: +44 71 729 3773   Fax: +44 71 729 3273

martin@mwtech.UUCP (Martin Weitzel) (01/30/91)

In article <3404@d75.UUCP> @xlab1.uucp () writes:
> Supposing I have two files with three collumns in each. How do
> I merge the files and generate a single file with six or more 
> collumns using shell script?  for example if File A has collumns a, c, e 
> and File B has collumns b, d, f. I want to generate File C
> with collumns a,b,c,d,e,f.  Also it would be nice to be able to
> using the arithematic feature in awk...

IMHO this is not feasable with OLD "awk" for LARGE files.

Small files could be saved in an associative array.

	awk '
	FILENAME == "first" {
		line[NR] = $0
	}
	FILENAME == "second" {
		print line[++i] " " $0
	}
	' first second

Of course, UNIX has enough friendly commands to help you, e.g.:

	pr -tm first second | awk '{ whatever you like }'

With NEW "awk" (nawk) merging is feasable, e.g:

	nawk '{
		printf "%s ", $0
		getline < "second"
		print
	}' first

> Finally, how do u specify the "rest of the line" in awk??

I don't quite understand this. Do you mean the following:

	33.5 ZZZ 4564.334 foo bar
			  ^^^^^^^--- processed as "rest of line"
	^^^^ ^^^ ^^^^^^^^ ---------- processed as $1, $2, $3

In this case there are several solutions: If in your input data the
first three fields always occupy the same space, say 18 chars, you
can access the "rest of line" as substr($0, 19).

If the $1..$3 have no equal witdh, but you are sure that there is
only one separator between them, you may sum them up and get the rest
of the line with substr($0, length($1) + length($2) + length($3) + 3).

In any case my advice would be - if possible - to re-design your
input data, e.g. to put some unique separator before the "rest of
the line, say:

	33.5 ZZZ 4564.334 !foo bar
			  ^------------ unique, i.e. must not appear
as part of $1, $2, $3 or the rest of the line. Then you can use
split($0, xx, "!") and access the rest of the line with xx[2].

My general observation is that "awk" is a real "power tool", but to
get out most of it with not too complicated programs you should obbey
certain design criteria for your input data, e.g. you should use
unique separators in a hierachical way:

	XXX:a,b,c:YYYYYYY ZZZZZ
	                  ^^^^^---- $2
	^^^^^^^^^^^^^^^^^ --------- $1
            ^^^^^ ----------------- split($1, xx, ":")     -> xx[2]
	        ^ ----------------- split(xx[2], yy, ",")  -> yy[3]

Some other small hint: It's trivial to design a "comment feature" for
your input data using the familiar style that every line starting with
a "#" is thrown away. The following is an excerpt which can be found in
many of my "awk"-programs:

	awk '
	/^[ \t]*#/ { next; }
	......
	...... rest of program
	......
	'
-- 
Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83

pbiron@weber.ucsd.edu (Paul Biron) (02/02/91)

In article <1070@mwtech.UUCP> martin@mwtech.UUCP (Martin Weitzel) writes:
>In article <3404@d75.UUCP> @xlab1.uucp () writes:
[stuff deleted]
>> Finally, how do u specify the "rest of the line" in awk??
>
>I don't quite understand this. Do you mean the following:
>
>	33.5 ZZZ 4564.334 foo bar
>			  ^^^^^^^--- processed as "rest of line"
>	^^^^ ^^^ ^^^^^^^^ ---------- processed as $1, $2, $3
>

>first three fields always occupy the same space, say 18 chars, you
>can access the "rest of line" as substr($0, 19).
>
>If the $1..$3 have no equal witdh, but you are sure that there is
>only one separator between them, you may sum them up and get the rest
>of the line with substr($0, length($1) + length($2) + length($3) + 3).
>
[stuff deleted]

Another, albeit data dependent, way to do "rest of line" in {n,g}awk
is the following:

#!/usr/local/bin/gawk -f
{
        start = index ($0, $3)
        print "first is", $1, $2
        print "rest is", substr ($0, start)
}

This assumes that what you want to process as the "rest of line" does
not occur in the "first" part of the line.

While in general I agree with Martin that structuring your input data
makes your life a lot easier when it comes to writing awk scripts,
however, that is not always possible.  I don't know wheter it *is*
possible in the case of the original poster. 

Hope this helps,

--------------------------------------------------------------------------------
STOP THE WAR IN THE GULF --- NOW !!!!!!!
--------------------------------------------------------------------------------
Paul Biron      pbiron@ucsd.edu        (619) 534-5758

sleepy@wybbs.mi.org (Mike Faber) (02/04/91)

In article <1070@mwtech.UUCP> you write:
>In article <3404@d75.UUCP> @xlab1.uucp () writes:
>> Supposing I have two files with three collumns in each. How do
>> I merge the files and generate a single file with six or more 
>> collumns using shell script?  for example if File A has collumns a, c, e 
>> and File B has collumns b, d, f. I want to generate File C
>> with collumns a,b,c,d,e,f.  Also it would be nice to be able to
>> using the arithematic feature in awk...
>
>IMHO this is not feasable with OLD "awk" for LARGE files.

[Good discussion of old/new awk and solution]

Aren't we overlooking the easy solution here?

paste -d"|" filea fileb | awk -F"|" ' { printf("%s %s %s %s %s %s\n", \
$1,$3,$5,$2,$4,$6) } ' >outputfile

OK, it's brute force, but it's simple, easy to read, and flexible, in case the
file changes.

--
Michael Faber
sleepy@wybbs.uucp