[comp.unix.shell] Breaking large file into pieces

monroe@dg-rtp.dg.com (Mark A Monroe) (09/11/90)

I want to rip a large file into pieces, naming new files according
to an ID string in the large file.  For example, the large file contains
records that look like this:

xxx-00001239	data	data	data
description
       .
       .
(variable length)
       .
						<---blank line
xxx-00001489	data	data	data
description
       .
       .
(variable length)
       .
						<---blank line
xxx-00001326	data	data	data

When I find a line in the large data file that starts
with "xxx-0000", I want to open a file named "xxx-0000<number>",
like "xxx-00001489", and write every line, including
the current one, into it.  When I see another "xxx-0000",
I want to close the file, open a new file named for the new id 
string, and continue writing.  At the end of the large data
file, close all files and exit.

Any suggestions?  

--


-------------------------- END OF MAIN MESSAGE -----------------------------
Mark A. Monroe  			
UNIX Release Integration	   Internet: monroe@dg-rtp.dg.com
Data General Corp.		   UUCP:     {world}!mcnc!rti!dg-rtp!monroe
Research Triangle Park, NC 	   Phone:    (919)248-6234       
----------------------------------------------------------------------------

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (09/12/90)

In article <1990Sep11.134238.20218@dg-rtp.dg.com> monroe@dg-rtp.dg.com (Mark A Monroe) writes:
: I want to rip a large file into pieces, naming new files according
: to an ID string in the large file.  For example, the large file contains
: records that look like this:
: 
: xxx-00001239	data	data	data
: description
:        .
:        .
: (variable length)
:        .
: 						<---blank line
: xxx-00001489	data	data	data
: description
:        .
:        .
: (variable length)
:        .
: 						<---blank line
: xxx-00001326	data	data	data
: 
: When I find a line in the large data file that starts
: with "xxx-0000", I want to open a file named "xxx-0000<number>",
: like "xxx-00001489", and write every line, including
: the current one, into it.  When I see another "xxx-0000",
: I want to close the file, open a new file named for the new id 
: string, and continue writing.  At the end of the large data
: file, close all files and exit.
: 
: Any suggestions?  

In standard shell+awk+sed it's a bit hard because you run out of file
descriptors.  You could do something like run sed over your file
to turn it into a giant script of here-is commands, but that'll be
real slow.

You could do something like this:

while read line; do
    case "$line" in
    xxx-0000*) set $line; exec >$1;;
    esac
    echo "$line"
done

But how well that works depends on the vagaries of your echo command,
such as what it does with lines starting with '-', or containing '\c'.
You don't really want to do this on a machine where echo isn't a builtin.

If you have Perl, your fastest solution will be to say something like

perl -pe 'open(STDOUT,">$&") if /^xxx-0000\d+/' filename

Change > to >> if the keys aren't unique in your input file.

Larry Wall
lwall@jpl-devvax.jpl.nasa.gov

merlyn@iwarp.intel.com (Randal Schwartz) (09/12/90)

In article <1990Sep11.134238.20218@dg-rtp.dg.com>, monroe@dg-rtp (Mark A Monroe) writes:
| I want to rip a large file into pieces, naming new files according
| to an ID string in the large file.  For example, the large file contains
| records that look like this:
| 
| xxx-00001239	data	data	data
| description
|        .
|        .
| (variable length)
|        .
| 						<---blank line
| xxx-00001489	data	data	data
| description
|        .
|        .
| (variable length)
|        .
| 						<---blank line
| xxx-00001326	data	data	data
| 
| When I find a line in the large data file that starts
| with "xxx-0000", I want to open a file named "xxx-0000<number>",
| like "xxx-00001489", and write every line, including
| the current one, into it.  When I see another "xxx-0000",
| I want to close the file, open a new file named for the new id 
| string, and continue writing.  At the end of the large data
| file, close all files and exit.
| 
| Any suggestions?  

You didn't say "and I don't want it in Perl", so I'm considering
this solution fair game...

perl -pe 'open(STDOUT,">$1") if /^(xxx-\d+)/;' bigdatafile

Pretty durn simple.  The right tool for the job.  Yeah, you could do
it with an awk script feeding into a /bin/sh (or with a smarter awk),
but this is too easy.

Just another Perl hacker,
-- 
/=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\
| on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III      |
| merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn |
\=Cute Quote: "Welcome to Portland, Oregon, home of the California Raisins!"=/

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (09/12/90)

In article <9466@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes:
>     echo "$line"

C'mon, Larry. You know that should be echo -n "$line$n" where $n has
been initialized to a newline. Or echo -n "$line"; echo.

---Dan

parag@shah.austin.ibm.com (Parag Shah) (09/12/90)

>I want to rip a large file into pieces, naming new files according
>to an ID string in the large file.  For example, the large file contains
>records that look like this:

>xxx-00001239	data	data	data
>description
 
Seems like the "csplit" command may be able to help you.
You may have to rename the namely created files later using
the first line to get the number id that you want.

Parag      

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (09/12/90)

In article <22842:Sep1121:10:3390@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
: In article <9466@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes:
: >     echo "$line"
: 
: C'mon, Larry. You know that should be echo -n "$line$n" where $n has
: been initialized to a newline. Or echo -n "$line"; echo.

Don't teach your grandmother to suck eggs.  Let's see you come up with
a solution that also works on all the \c machines.

Apart from "Install SVR4".  :-)

Larry

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (09/12/90)

In article <9469@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes:
> In article <22842:Sep1121:10:3390@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
> : In article <9466@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes:
> : >     echo "$line"
> : C'mon, Larry. You know that should be echo -n "$line$n" where $n has
> : been initialized to a newline. Or echo -n "$line"; echo.
> Don't teach your grandmother to suck eggs.  Let's see you come up with
> a solution that also works on all the \c machines.
> Apart from "Install SVR4".  :-)

Hmph. Just preprocess the file with sed 's-\\-\\\\-g'. Or, if you really
like backslashes, sed s/\\\\/\\\\\\\\\/g. Or, if you're a masochist, sed
s/\\(\\\\\\)/\\\\\\1/g? sh -c "sed s/\\\\(\\\\\\\\\\\\)/\\\\\\\\\\\\1/g"?
There, that's four solutions, apart from ``install BSD''. :-)

Oh, sorry, sed is heresy for a Perl lord, isn't it? :-)

---Dan

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (09/12/90)

In article <23529:Sep1122:52:3790@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
: In article <9469@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes:
: > In article <22842:Sep1121:10:3390@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
: > : In article <9466@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes:
: > : >     echo "$line"
: > : C'mon, Larry. You know that should be echo -n "$line$n" where $n has
: > : been initialized to a newline. Or echo -n "$line"; echo.
: > Don't teach your grandmother to suck eggs.  Let's see you come up with
: > a solution that also works on all the \c machines.
: > Apart from "Install SVR4".  :-)
: 
: Hmph. Just preprocess the file with sed 's-\\-\\\\-g'. Or, if you really
: like backslashes, sed s/\\\\/\\\\\\\\\/g. Or, if you're a masochist, sed
: s/\\(\\\\\\)/\\\\\\1/g? sh -c "sed s/\\\\(\\\\\\\\\\\\)/\\\\\\\\\\\\1/g"?
: There, that's four solutions, apart from ``install BSD''. :-)
: 
: Oh, sorry, sed is heresy for a Perl lord, isn't it? :-)

No, it isn't.  The official Perl Slogan is: There's More Than One Way To Do It.

There are still a few things sed is good for...   :-)

And you'll note that I did mention sed in my original message.

But "hmph", yourself.  You still haven't posted "a solution that ALSO works
on all the \c machines."  Your -n solution doesn't work on a \c machine,
and your \c solution doesn't work on a -n machine.  At least it's symmetrical.

(By the way, what makes you think s/\\\\/\\\\\\\\/ is a solution?  It only
translates \\, not \c.)

Please don't call me a "Perl lord".  I'm merely trying to follow the ancient
wisdom that "he who wants to be the greatest among you must become the
servant of all."

The leader of this particular jihad doesn't believe in holy war.  Pity.

Larry

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (09/12/90)

Larry:
> Dan:
> : Larry:
> : > Dan:
> : > : Larry:
> : > : >     echo "$line"
        [should be echo -n "$line$n" or echo -n "$line"; echo ]
      [but that fails on \c machines ]
    [preprocess with sed 's-\\-\\\\-g' or various others ]
> But "hmph", yourself.  You still haven't posted "a solution that ALSO works
> on all the \c machines."  Your -n solution doesn't work on a \c machine,
> and your \c solution doesn't work on a -n machine.  At least it's symmetrical.

Oh, gimme a break. Your suggestion of switching to SVR4 doesn't work on
any machine. :-) It's trivial to select between the two scripts with an
echo test. There: a perfectly portable sh solution to the problem that
started this thread.

> (By the way, what makes you think s/\\\\/\\\\\\\\/ is a solution?  It only
> translates \\, not \c.)

Oh dear. Not paying attention to your quoting? sed s/\\\\/\\\\\\\\/ is
translated by the shell into sed s/\\/\\\\/, which is translated by sed
into s/\/\\/, which is what we want. \c becomes \\c, which is parsed
(correctly) as backslash-c. What makes you think this isn't a solution?

---Dan

Tim.Ouellette@FtCollins.NCR.COM (Tim.Ouellette) (09/12/90)

>>>>> On 11 Sep 90 13:42:38 GMT, monroe@dg-rtp.dg.com (Mark A Monroe) said:

Mark> I want to rip a large file into pieces, naming new files according
Mark> to an ID string in the large file.  For example, the large file contains
Mark> records that look like this:

Mark> xxx-00001239	data	data	data
Mark> description
Mark>        .
Mark>        .
Mark> (variable length)
Mark>        .
Mark> 						<---blank line
Mark> xxx-00001489	data	data	data
Mark> description
Mark>        .
Mark>        .
Mark> (variable length)
Mark>        .
Mark> 						<---blank line
Mark> xxx-00001326	data	data	data

Mark> When I find a line in the large data file that starts
Mark> with "xxx-0000", I want to open a file named "xxx-0000<number>",
Mark> like "xxx-00001489", and write every line, including
Mark> the current one, into it.  When I see another "xxx-0000",
Mark> I want to close the file, open a new file named for the new id 
Mark> string, and continue writing.  At the end of the large data
Mark> file, close all files and exit.


Mark,
   Here's an awk solution.
------------------------split.awk-----------------------

BEGIN{pcFile="/dev/null";}
/xxx-[0123456789]+/{close(pcFile);pcFile = $1;}
{print $0 >> pcFile;}

--------------------------------------------------------

execute it by
awk -f split.awk datafile

Hope this helps
--
Timothy R. Ouellette
NCR Microelectronics                   Tim.Ouellette@FtCollins.ncr.com
Ft. Collins, CO.        	       uunet!ncrlnk!ncr-mpd!bach!timo
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
"If all the world is a stage, I want to run the trap door" -- P. Beaty

rbr@bonnie.ATT.COM (4197,ATTT) (09/12/90)

In article <1990Sep11.134238.20218@dg-rtp.dg.com> monroe@dg-rtp.dg.com (Mark A Monroe) writes:
>I want to rip a large file into pieces, naming new files according
>to an ID string in the large file.  For example, the large file contains
>records that look like this:
>
>xxx-00001239	data	data	data
>description
>       .
>       .
>(variable length)
>       .
>						<---blank line
>xxx-00001489	data	data	data
>description
>       .
>       .
>(variable length)
>       .
>						<---blank line
>xxx-00001326	data	data	data
>
>When I find a line in the large data file that starts
>with "xxx-0000", I want to open a file named "xxx-0000<number>",
>like "xxx-00001489", and write every line, including
>the current one, into it.  When I see another "xxx-0000",
>I want to close the file, open a new file named for the new id 
>string, and continue writing.  At the end of the large data
>file, close all files and exit.
>
>Any suggestions?  

Use context split "csplit(1)" to break up the file efficiently. Then
use head/cut/mv to rename the pieces.

csplit -f aaa  /"^xxx-0000"/  {99}  <in-file-name>
rm aaa00
for FN in `ls aaa*`
do
	NFN=`head -1 $FN | cut -d' ' -f1 `
	mv $FN $NFN
done

Bob Rager

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (09/12/90)

In article <24879:Sep1202:43:4690@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
: Larry:
: > Dan:
: > : Larry:
: > : > Dan:
: > : > : Larry:
: > : > : >     echo "$line"
:         [should be echo -n "$line$n" or echo -n "$line"; echo ]
:       [but that fails on \c machines ]
:     [preprocess with sed 's-\\-\\\\-g' or various others ]
: > But "hmph", yourself.  You still haven't posted "a solution that ALSO works
: > on all the \c machines."  Your -n solution doesn't work on a \c machine,
: > and your \c solution doesn't work on a -n machine.  At least it's symmetrical.
: 
: Oh, gimme a break. Your suggestion of switching to SVR4 doesn't work on
: any machine. :-)

True enough, but I wasn't suggesting that.  I was only asking you not
to suggest it.  I suggest we drop this one before someone else suggests
that we do so...

: It's trivial to select between the two scripts with an
: echo test. There: a perfectly portable sh solution to the problem that
: started this thread.

Ah, but as you know, I don't believe in fancy portability checks
in shell scripts...  :-)   :-)   :-)

The quibble still stands that echo isn't built into everyone's shell.

: > (By the way, what makes you think s/\\\\/\\\\\\\\/ is a solution?  It only
: > translates \\, not \c.)
: 
: Oh dear. Not paying attention to your quoting? sed s/\\\\/\\\\\\\\/ is
: translated by the shell into sed s/\\/\\\\/, which is translated by sed
: into s/\/\\/, which is what we want. \c becomes \\c, which is parsed
: (correctly) as backslash-c. What makes you think this isn't a solution?

Oops, you got me there.  It's not that I didn't see the non-quotes, it's
that I didn't see the "sed" on the front.  So I wasn't thinking of it as a
shell command.  (Odd, considering what newsgroup we're in.)  But it
just goes to show you the problems associated with remembering all
the multiple levels of interpretation that might happen.  At least in
my feeble brain.

We outlawed the "goto" statement where it reduced understanding.
Three or more backslashes in a row should be considered harmful.

Larry

skwu@boulder.Colorado.EDU (WU SHI-KUEI) (09/13/90)

The right tool for the job is NOT perl but 'csplit'.

les@chinet.chi.il.us (Leslie Mikesell) (09/13/90)

In article <1990Sep11.134238.20218@dg-rtp.dg.com> monroe@dg-rtp.dg.com (Mark A Monroe) writes:
>I want to rip a large file into pieces, naming new files according
>to an ID string in the large file.  For example, the large file contains
>records that look like this:

>xxx-00001239	data	data	data
>description
>       .
>       .
>(variable length)
>       .
>						<---blank line
>xxx-00001489	data	data	data

The perl suggestions are the best if you have perl, and it can be
done directly in awk if you have the new awk that can close files. 
I did this sort of thing years ago using:
 awk |sh
where the awk program generates a stream something like:
 cat > filename1 <<\!EOF
 data
 !EOF
 cat > filename2 <<\!EOF
 data
 !EOF

Les Mikesell
les@chinet.chi.il.us

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (09/13/90)

In article <26116@boulder.Colorado.EDU> skwu@spot.Colorado.EDU.Colorado.EDU (WU SHI-KUEI) writes:
: The right tool for the job is NOT perl but 'csplit'.

"Those words fall too easily from your lips."  --Gandalf

Let us attempt to distinguish fact from dogma.

    1)  As far as I can tell, csplit is AT&T proprietary.  I certainly
	don't have it on all my machines, and don't know offhand where
	I'd find the source for it.  The person we were advising may
	well not have it on his machine.  You should at least say "If
	you have csplit..."

    2)	The man page for csplit (in the AT&T universe of a Pyramid, anyway)
	indicates that you can have a maximum of 99 output files.  The
	application in question could easily have more than that, judging
	by how it was specified.  A general tool should not have
	such limitations.

    3)	csplit won't name the files in the way specified--you'd have to
	follow it up with a loopful of mv commands, one process per file.
	And in the naive implementation, you'd have a sed or awk for each
	file to extract out the filename to hand to mv.

    4)	csplit can't recognize patterns across newlines (not that this
	job required that, but a general tool shouldn't have such
	limitations.)

    5)	csplit can get confused on lines longer than 255 chars.  It can't
	handle embedded nulls.  A general tool should not have such
	limitations.

    6)	Even if I did manage to find a freely available source for csplit,
	I'd have to worry about recompiling it on all my different
	architectures.  That would be okay (after all, I have to do that
	with Perl too), but I have to do it for 50 blue jillion other little
	"must have" tools too.  I'd much rather compile Perl once on
	each architecture, rewrite csplit in Perl, throw it into my
	/u/scripts directory that's mounted everywhere, and never worry about
	recompiling csplit again.

So it's not quite so simple as all that.  You can chop down a tree with
a hatchet, but sometimes you want an industrial strength Swiss Army Chainsaw.
And sometimes not.  There's more than one way to do it.

Larry

rbp@investor.pgh.pa.us (Bob Peirce #305) (09/13/90)

In article <9466@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes:
>In article <1990Sep11.134238.20218@dg-rtp.dg.com> monroe@dg-rtp.dg.com (Mark A Monroe) writes:
>: I want to rip a large file into pieces, naming new files according
>: to an ID string in the large file.  For example, the large file contains
>: records that look like this:
>: 
>: Any suggestions?  
>
>In standard shell+awk+sed it's a bit hard because you run out of file
>descriptors.
In nawk you can close the old file before you open the new.
-- 
Bob Peirce, Pittsburgh, PA				  412-471-5320
...!uunet!pitt!investor!rbp			rbp@investor.pgh.pa.us