[comp.unix.questions] Stupid awk question

dmaustin@vivid.sun.com (Darren Austin) (10/11/89)

Hi all,
	I am trying to split up a large output file into several
smaller files.  The smaller files are named after the value in
the second field.  I tried using the following simple awk script,

(current == $2) {print > current".summary"}
(current != $2) {close(current".summary");current=$2;print > current".summary";}

but it fails with 

awk: too many output files 10

so apparently it is running out of file descriptors.  I thought
that would be taken care of by using the "close" function.  Am I
using it incorrectly, or am I doing something else wrong?  Is
there a better way to do this with other tools?

Any help would be appreciated,
--Darren

P.S. This is on a Sun 3/60 SUNOS 4.0.3.
--
--------------------------------------+-------------------------------
Darren Austin                         | Is is a mistake to think you 
Sun Microsystems, Mountain View       | can solve any major problem
dmaustin@sun.com	              | just with potatoes.
--------------------------------------+-------------------------------

jik@athena.mit.edu (Jonathan I. Kamens) (10/11/89)

In article <DMAUSTIN.89Oct10145918@vivid.sun.com> dmaustin@vivid.sun.com
(Darren Austin) writes:
>Hi all,
>	I am trying to split up a large output file into several
>smaller files.  The smaller files are named after the value in
>the second field.  I tried using the following simple awk script,
>
>(current == $2) {print > current".summary"}
>(current != $2) {close(current".summary");current=$2;print > current".summary";}
>
>but it fails with 
>
>awk: too many output files 10

  Allow me to quote from "Awk -- A Pattern Scanning and Processing
Language (Second Edition)", by Alfred V. Aho, Brian W. Kernighan, and
Peter J. Weinberger.  I got my copy out of /usr/doc on my system.

  Page 2, on section 1.4 about printing, says, "Naturally there is a
limit on the number of output files; currently it is 10."

  A lot of implementations have overcome this limitation.  Apparently,
the version of awk you are using has not.

  May I suggest either nawk (I don't know where/under what conditions
that's available), or gawk (available for nothing whereever fine GNU
products are distributed)?

Jonathan Kamens			              USnail:
MIT Project Athena				11 Ashford Terrace
jik@Athena.MIT.EDU				Allston, MA  02134
Office: 617-253-4261			      Home: 617-782-0710

P.S.  Despite the title you gave it, it's not really a "stupid" awk
question, especially since the man page doesn't even appear to mention
this limitation, or at least not that I could find.

wyle@inf.ethz.ch (Mitchell Wyle) (10/12/89)

In article <DMAUSTIN.89Oct10145918@vivid.sun.com> 
dmaustin@vivid.sun.com (Darren Austin) writes:

>I am trying to split up a large output file into several
>smaller files.  The smaller files are named after the value in
>the second field.  I tried using the following simple awk script,
>
>(current == $2) {print > current".summary"}
>(current != $2) {close(current".summary");
> current=$2;print > current".summary";}
>
>but it fails with 
>
>awk: too many output files 10


Even though everyone will soon have new awk and all these old awk problems
will go away, I think this question deserves to be in the "Frequently
asked questions and answers" periodic postings.  Who moderates it?  How
should one post to it?

* * *

To answer the question, I shall quote verbatum an old article.

>>I am trying to use AWK to split one file into many formatted, smaller files.
>>The problem I am having is that I cannot output to more than 10 files...
>  
> Well, it won't help you right now, but the long-term fix is to complain
> to your software supplier and ask them to get rid of the silly limit.
> It's not that hard.

The limits are based on the number of file descriptors that can be open
at one time (usually small).  One way that I often get around this is
by writting something like this which splits up the input on the field
$1 .

sort +0 |
awk '
{
        if (last != $1) {
                if (NR > 0) print "!XYZZY";
                print "cat > " $1 "<<!XYZZY";
                last = $1;
        }
        print;
}
END { if (NR > 0) print "!XYZZY"; }' | /bin/sh

        Tony O'Hagan                    tonyo@qitfit.qitcs.oz

* * *

I use Tony's solution all the time.  I have seen it used by at least
two other people (David Goodenough and Amos Shapiro) in shell scripts
posted to the net.

It is very important to put that trailing End_of_Here_Document string
in the END clause of your awk program!  Depending on the complexity of
your parse, you might need other cleanup code  there as well.

Happy hacking, 

-Mitchell F. Wyle
Institut fuer Informationssysteme         wyle@inf.ethz.ch 
ETH Zentrum / 8092 Zurich, Switzerland    +41 1 256 5237
--
If this appears in _IN_MODERATION_ or ClariNet, please let me know.
I am forbidden to tell you that you can reach me at:
...!uunet!mcvax!ethz!wyle   or    wyle@rascal.ics.utexas.edu

bink@aplcen.apl.jhu.edu (9704) (10/12/89)

In article <3731@ethz-inf.UUCP> wyle@ethz.UUCP (Mitchell Wyle) writes:
> In article <DMAUSTIN.89Oct10145918@vivid.sun.com> Darren Austin writes:
> 
> >I am trying to split up a large output file into several
> >smaller files.  The smaller files are named after the value in
> >the second field.  I tried using the following simple awk script,
> > [...]
> >but it fails with 
> >awk: too many output files 10
>
> [...]
> sort +0 | awk '[SCRIPT BY TONY O'HAGAN DELETED]' | /bin/sh
> [...]

If you don't have access to the new AWK, but do have the "bs" command
(mini-language) on your machine, I've found it also works well for
splitting a file into subfiles which are named by a field of the input.
The following bs program should solve Mr. Austin's problem in one process:

#!/bin/bs
#  Split the input into files named by the 2nd field.
outfile = ""
while ?(line = get)
	match (line, "[^\t ]*[\t ]*\([^\t ]*\)")
	if mstring(1) != outfile    open ("put", outfile=mstring(1), "w")
	put = line
next
run

(The open seems to automatically close the previous file)
I don't know how portable this is; bs is available in System V.2 anyway.
Type ^N now for a way to do this in 1 line of PERL, by Mr. Schwartz...   ;-)

					-- Greg Ubben
					   bink@aplcen.apl.jhu.edu
					   ...!uunet!mimsy!aplcen!bink

merlyn@iwarp.intel.com (Randal Schwartz) (10/12/89)

In article <15023@bloom-beacon.MIT.EDU>, jik@athena (Jonathan I. Kamens) writes:
| In article <DMAUSTIN.89Oct10145918@vivid.sun.com> dmaustin@vivid.sun.com
| (Darren Austin) writes:
| >(current == $2) {print > current".summary"}
| >(current != $2) {close(current".summary");current=$2;print > current".summary";}
| > [fails]
| [describes why it fails]
|   May I suggest either nawk (I don't know where/under what conditions
| that's available), or gawk (available for nothing whereever fine GNU
| products are distributed)?

Or do it in Perl (of course)...

#!/usr/bin/perl

while (<>) {
	($where) = (/^\s*\S+\s+(\S+)/);
	if ($oldwhere ne $where) {
		open(WHERE,">>$where.summary") ||
			die "Cannot open $where.summary ($!)";
	}
	print WHERE $_;
}
close(WHERE);

(Perl users may note that the open will automatically close the
previous open.)

Just another Perl hacker,
-- 
/== Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ====\
| on contract to Intel's iWarp project, Hillsboro, Oregon, USA, Sol III  |
| merlyn@iwarp.intel.com ...!uunet!iwarp.intel.com!merlyn	         |
\== Cute Quote: "Welcome to Oregon... Home of the California Raisins!" ==/

ber@astbe.UUCP (H.Bernau) (10/12/89)

In article <DMAUSTIN.89Oct10145918@vivid.sun.com> dmaustin@vivid.sun.com (Darren Austin) writes:
>       I am trying to split up a large output file into several
>smaller files.  The smaller files are named after the value in
>the second field.  I tried using the following simple awk script,
>[...deleted...]
>but it fails with
>awk: too many output files 10
>Any help would be appreciated,
>--Darren

Hi.
I've had the same problem some days ago :-(
My solution was to make the awk script produce /bin/sh commands:

BEGIN { first = 1 }
(current == $2) { print }
(current != $2) { if( !first) { print "_THIS_IS_EOF_"; first = 0 }
                  current = $2;
                  printf "cat << _THIS_IS_EOF_ > %s.summary\n", $2
                }
END { print "_THIS_IS_EOF_" }

Hope that'll help.
-------------------------------------------------------------------------------
|   Rolf Bernau               |
|   GEI Software Technik mbH  |  Berlin:             astbe!ber
|   Hohenzollerndamm 150      |  USA:                ...!pyramid!tub!astbe!ber
|   1000 Berlin 33            |
|   West-Germany              |
-------------------------------------------------------------------------------

dg@lakart.UUCP (David Goodenough) (10/17/89)

dmaustin@vivid.sun.com (Darren Austin) sez:
> Hi all,
> 	I am trying to split up a large output file into several
> smaller files.  The smaller files are named after the value in
> the second field.  I tried using the following simple awk script,
> 
> (current == $2) {print > current".summary"}
> (current != $2) {close(current".summary");current=$2;print > current".summary";}
> 
> but it fails with 
> 
> awk: too many output files 10

Try this:

-------------------------------------------
#! /bin/sh

awk '{
	print "echo '"'"'" $0 "'"'"' >> " $2 ".summary"
     }' | sh
-------------------------------------------

Only problem is, it tends to chew up process id's. Oh well ..... Also it
appends, so you might have a problem if you wanted to initialise at the
start of each run. But you get the general idea. Also has the advantage
of not getting into trouble if your input file hasn't been sorted (i.e.
$2 records can come in any order)
-- 
	dg@lakart.UUCP - David Goodenough		+---+
						IHS	| +-+-+
	....... !harvard!xait!lakart!dg			+-+-+ |
AKA:	dg%lakart.uucp@xait.xerox.com			  +---+

decot@hpisod2.HP.COM (Dave Decot) (10/24/89)

Try csplit(1) or split(1) if you have them.

Dave Decot
decot@hpda