monroe@dg-rtp.dg.com (Mark A Monroe) (09/11/90)
I want to rip a large file into pieces, naming new files according to an ID string in the large file. For example, the large file contains records that look like this: xxx-00001239 data data data description . . (variable length) . <---blank line xxx-00001489 data data data description . . (variable length) . <---blank line xxx-00001326 data data data When I find a line in the large data file that starts with "xxx-0000", I want to open a file named "xxx-0000<number>", like "xxx-00001489", and write every line, including the current one, into it. When I see another "xxx-0000", I want to close the file, open a new file named for the new id string, and continue writing. At the end of the large data file, close all files and exit. Any suggestions? -- -------------------------- END OF MAIN MESSAGE ----------------------------- Mark A. Monroe UNIX Release Integration Internet: monroe@dg-rtp.dg.com Data General Corp. UUCP: {world}!mcnc!rti!dg-rtp!monroe Research Triangle Park, NC Phone: (919)248-6234 ----------------------------------------------------------------------------
lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (09/12/90)
In article <1990Sep11.134238.20218@dg-rtp.dg.com> monroe@dg-rtp.dg.com (Mark A Monroe) writes:
: I want to rip a large file into pieces, naming new files according
: to an ID string in the large file. For example, the large file contains
: records that look like this:
:
: xxx-00001239 data data data
: description
: .
: .
: (variable length)
: .
: <---blank line
: xxx-00001489 data data data
: description
: .
: .
: (variable length)
: .
: <---blank line
: xxx-00001326 data data data
:
: When I find a line in the large data file that starts
: with "xxx-0000", I want to open a file named "xxx-0000<number>",
: like "xxx-00001489", and write every line, including
: the current one, into it. When I see another "xxx-0000",
: I want to close the file, open a new file named for the new id
: string, and continue writing. At the end of the large data
: file, close all files and exit.
:
: Any suggestions?
In standard shell+awk+sed it's a bit hard because you run out of file
descriptors. You could do something like run sed over your file
to turn it into a giant script of here-is commands, but that'll be
real slow.
You could do something like this:
while read line; do
case "$line" in
xxx-0000*) set $line; exec >$1;;
esac
echo "$line"
done
But how well that works depends on the vagaries of your echo command,
such as what it does with lines starting with '-', or containing '\c'.
You don't really want to do this on a machine where echo isn't a builtin.
If you have Perl, your fastest solution will be to say something like
perl -pe 'open(STDOUT,">$&") if /^xxx-0000\d+/' filename
Change > to >> if the keys aren't unique in your input file.
Larry Wall
lwall@jpl-devvax.jpl.nasa.gov
merlyn@iwarp.intel.com (Randal Schwartz) (09/12/90)
In article <1990Sep11.134238.20218@dg-rtp.dg.com>, monroe@dg-rtp (Mark A Monroe) writes: | I want to rip a large file into pieces, naming new files according | to an ID string in the large file. For example, the large file contains | records that look like this: | | xxx-00001239 data data data | description | . | . | (variable length) | . | <---blank line | xxx-00001489 data data data | description | . | . | (variable length) | . | <---blank line | xxx-00001326 data data data | | When I find a line in the large data file that starts | with "xxx-0000", I want to open a file named "xxx-0000<number>", | like "xxx-00001489", and write every line, including | the current one, into it. When I see another "xxx-0000", | I want to close the file, open a new file named for the new id | string, and continue writing. At the end of the large data | file, close all files and exit. | | Any suggestions? You didn't say "and I don't want it in Perl", so I'm considering this solution fair game... perl -pe 'open(STDOUT,">$1") if /^(xxx-\d+)/;' bigdatafile Pretty durn simple. The right tool for the job. Yeah, you could do it with an awk script feeding into a /bin/sh (or with a smarter awk), but this is too easy. Just another Perl hacker, -- /=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\ | on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III | | merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn | \=Cute Quote: "Welcome to Portland, Oregon, home of the California Raisins!"=/
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (09/12/90)
In article <9466@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes: > echo "$line" C'mon, Larry. You know that should be echo -n "$line$n" where $n has been initialized to a newline. Or echo -n "$line"; echo. ---Dan
parag@shah.austin.ibm.com (Parag Shah) (09/12/90)
>I want to rip a large file into pieces, naming new files according >to an ID string in the large file. For example, the large file contains >records that look like this: >xxx-00001239 data data data >description Seems like the "csplit" command may be able to help you. You may have to rename the namely created files later using the first line to get the number id that you want. Parag
lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (09/12/90)
In article <22842:Sep1121:10:3390@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: : In article <9466@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes: : > echo "$line" : : C'mon, Larry. You know that should be echo -n "$line$n" where $n has : been initialized to a newline. Or echo -n "$line"; echo. Don't teach your grandmother to suck eggs. Let's see you come up with a solution that also works on all the \c machines. Apart from "Install SVR4". :-) Larry
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (09/12/90)
In article <9469@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes: > In article <22842:Sep1121:10:3390@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: > : In article <9466@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes: > : > echo "$line" > : C'mon, Larry. You know that should be echo -n "$line$n" where $n has > : been initialized to a newline. Or echo -n "$line"; echo. > Don't teach your grandmother to suck eggs. Let's see you come up with > a solution that also works on all the \c machines. > Apart from "Install SVR4". :-) Hmph. Just preprocess the file with sed 's-\\-\\\\-g'. Or, if you really like backslashes, sed s/\\\\/\\\\\\\\\/g. Or, if you're a masochist, sed s/\\(\\\\\\)/\\\\\\1/g? sh -c "sed s/\\\\(\\\\\\\\\\\\)/\\\\\\\\\\\\1/g"? There, that's four solutions, apart from ``install BSD''. :-) Oh, sorry, sed is heresy for a Perl lord, isn't it? :-) ---Dan
lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (09/12/90)
In article <23529:Sep1122:52:3790@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: : In article <9469@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes: : > In article <22842:Sep1121:10:3390@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: : > : In article <9466@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes: : > : > echo "$line" : > : C'mon, Larry. You know that should be echo -n "$line$n" where $n has : > : been initialized to a newline. Or echo -n "$line"; echo. : > Don't teach your grandmother to suck eggs. Let's see you come up with : > a solution that also works on all the \c machines. : > Apart from "Install SVR4". :-) : : Hmph. Just preprocess the file with sed 's-\\-\\\\-g'. Or, if you really : like backslashes, sed s/\\\\/\\\\\\\\\/g. Or, if you're a masochist, sed : s/\\(\\\\\\)/\\\\\\1/g? sh -c "sed s/\\\\(\\\\\\\\\\\\)/\\\\\\\\\\\\1/g"? : There, that's four solutions, apart from ``install BSD''. :-) : : Oh, sorry, sed is heresy for a Perl lord, isn't it? :-) No, it isn't. The official Perl Slogan is: There's More Than One Way To Do It. There are still a few things sed is good for... :-) And you'll note that I did mention sed in my original message. But "hmph", yourself. You still haven't posted "a solution that ALSO works on all the \c machines." Your -n solution doesn't work on a \c machine, and your \c solution doesn't work on a -n machine. At least it's symmetrical. (By the way, what makes you think s/\\\\/\\\\\\\\/ is a solution? It only translates \\, not \c.) Please don't call me a "Perl lord". I'm merely trying to follow the ancient wisdom that "he who wants to be the greatest among you must become the servant of all." The leader of this particular jihad doesn't believe in holy war. Pity. Larry
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (09/12/90)
Larry: > Dan: > : Larry: > : > Dan: > : > : Larry: > : > : > echo "$line" [should be echo -n "$line$n" or echo -n "$line"; echo ] [but that fails on \c machines ] [preprocess with sed 's-\\-\\\\-g' or various others ] > But "hmph", yourself. You still haven't posted "a solution that ALSO works > on all the \c machines." Your -n solution doesn't work on a \c machine, > and your \c solution doesn't work on a -n machine. At least it's symmetrical. Oh, gimme a break. Your suggestion of switching to SVR4 doesn't work on any machine. :-) It's trivial to select between the two scripts with an echo test. There: a perfectly portable sh solution to the problem that started this thread. > (By the way, what makes you think s/\\\\/\\\\\\\\/ is a solution? It only > translates \\, not \c.) Oh dear. Not paying attention to your quoting? sed s/\\\\/\\\\\\\\/ is translated by the shell into sed s/\\/\\\\/, which is translated by sed into s/\/\\/, which is what we want. \c becomes \\c, which is parsed (correctly) as backslash-c. What makes you think this isn't a solution? ---Dan
Tim.Ouellette@FtCollins.NCR.COM (Tim.Ouellette) (09/12/90)
>>>>> On 11 Sep 90 13:42:38 GMT, monroe@dg-rtp.dg.com (Mark A Monroe) said:
Mark> I want to rip a large file into pieces, naming new files according
Mark> to an ID string in the large file. For example, the large file contains
Mark> records that look like this:
Mark> xxx-00001239 data data data
Mark> description
Mark> .
Mark> .
Mark> (variable length)
Mark> .
Mark> <---blank line
Mark> xxx-00001489 data data data
Mark> description
Mark> .
Mark> .
Mark> (variable length)
Mark> .
Mark> <---blank line
Mark> xxx-00001326 data data data
Mark> When I find a line in the large data file that starts
Mark> with "xxx-0000", I want to open a file named "xxx-0000<number>",
Mark> like "xxx-00001489", and write every line, including
Mark> the current one, into it. When I see another "xxx-0000",
Mark> I want to close the file, open a new file named for the new id
Mark> string, and continue writing. At the end of the large data
Mark> file, close all files and exit.
Mark,
Here's an awk solution.
------------------------split.awk-----------------------
BEGIN{pcFile="/dev/null";}
/xxx-[0123456789]+/{close(pcFile);pcFile = $1;}
{print $0 >> pcFile;}
--------------------------------------------------------
execute it by
awk -f split.awk datafile
Hope this helps
--
Timothy R. Ouellette
NCR Microelectronics Tim.Ouellette@FtCollins.ncr.com
Ft. Collins, CO. uunet!ncrlnk!ncr-mpd!bach!timo
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
"If all the world is a stage, I want to run the trap door" -- P. Beaty
rbr@bonnie.ATT.COM (4197,ATTT) (09/12/90)
In article <1990Sep11.134238.20218@dg-rtp.dg.com> monroe@dg-rtp.dg.com (Mark A Monroe) writes: >I want to rip a large file into pieces, naming new files according >to an ID string in the large file. For example, the large file contains >records that look like this: > >xxx-00001239 data data data >description > . > . >(variable length) > . > <---blank line >xxx-00001489 data data data >description > . > . >(variable length) > . > <---blank line >xxx-00001326 data data data > >When I find a line in the large data file that starts >with "xxx-0000", I want to open a file named "xxx-0000<number>", >like "xxx-00001489", and write every line, including >the current one, into it. When I see another "xxx-0000", >I want to close the file, open a new file named for the new id >string, and continue writing. At the end of the large data >file, close all files and exit. > >Any suggestions? Use context split "csplit(1)" to break up the file efficiently. Then use head/cut/mv to rename the pieces. csplit -f aaa /"^xxx-0000"/ {99} <in-file-name> rm aaa00 for FN in `ls aaa*` do NFN=`head -1 $FN | cut -d' ' -f1 ` mv $FN $NFN done Bob Rager
lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (09/12/90)
In article <24879:Sep1202:43:4690@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
: Larry:
: > Dan:
: > : Larry:
: > : > Dan:
: > : > : Larry:
: > : > : > echo "$line"
: [should be echo -n "$line$n" or echo -n "$line"; echo ]
: [but that fails on \c machines ]
: [preprocess with sed 's-\\-\\\\-g' or various others ]
: > But "hmph", yourself. You still haven't posted "a solution that ALSO works
: > on all the \c machines." Your -n solution doesn't work on a \c machine,
: > and your \c solution doesn't work on a -n machine. At least it's symmetrical.
:
: Oh, gimme a break. Your suggestion of switching to SVR4 doesn't work on
: any machine. :-)
True enough, but I wasn't suggesting that. I was only asking you not
to suggest it. I suggest we drop this one before someone else suggests
that we do so...
: It's trivial to select between the two scripts with an
: echo test. There: a perfectly portable sh solution to the problem that
: started this thread.
Ah, but as you know, I don't believe in fancy portability checks
in shell scripts... :-) :-) :-)
The quibble still stands that echo isn't built into everyone's shell.
: > (By the way, what makes you think s/\\\\/\\\\\\\\/ is a solution? It only
: > translates \\, not \c.)
:
: Oh dear. Not paying attention to your quoting? sed s/\\\\/\\\\\\\\/ is
: translated by the shell into sed s/\\/\\\\/, which is translated by sed
: into s/\/\\/, which is what we want. \c becomes \\c, which is parsed
: (correctly) as backslash-c. What makes you think this isn't a solution?
Oops, you got me there. It's not that I didn't see the non-quotes, it's
that I didn't see the "sed" on the front. So I wasn't thinking of it as a
shell command. (Odd, considering what newsgroup we're in.) But it
just goes to show you the problems associated with remembering all
the multiple levels of interpretation that might happen. At least in
my feeble brain.
We outlawed the "goto" statement where it reduced understanding.
Three or more backslashes in a row should be considered harmful.
Larry
skwu@boulder.Colorado.EDU (WU SHI-KUEI) (09/13/90)
The right tool for the job is NOT perl but 'csplit'.
les@chinet.chi.il.us (Leslie Mikesell) (09/13/90)
In article <1990Sep11.134238.20218@dg-rtp.dg.com> monroe@dg-rtp.dg.com (Mark A Monroe) writes: >I want to rip a large file into pieces, naming new files according >to an ID string in the large file. For example, the large file contains >records that look like this: >xxx-00001239 data data data >description > . > . >(variable length) > . > <---blank line >xxx-00001489 data data data The perl suggestions are the best if you have perl, and it can be done directly in awk if you have the new awk that can close files. I did this sort of thing years ago using: awk |sh where the awk program generates a stream something like: cat > filename1 <<\!EOF data !EOF cat > filename2 <<\!EOF data !EOF Les Mikesell les@chinet.chi.il.us
lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (09/13/90)
In article <26116@boulder.Colorado.EDU> skwu@spot.Colorado.EDU.Colorado.EDU (WU SHI-KUEI) writes:
: The right tool for the job is NOT perl but 'csplit'.
"Those words fall too easily from your lips." --Gandalf
Let us attempt to distinguish fact from dogma.
1) As far as I can tell, csplit is AT&T proprietary. I certainly
don't have it on all my machines, and don't know offhand where
I'd find the source for it. The person we were advising may
well not have it on his machine. You should at least say "If
you have csplit..."
2) The man page for csplit (in the AT&T universe of a Pyramid, anyway)
indicates that you can have a maximum of 99 output files. The
application in question could easily have more than that, judging
by how it was specified. A general tool should not have
such limitations.
3) csplit won't name the files in the way specified--you'd have to
follow it up with a loopful of mv commands, one process per file.
And in the naive implementation, you'd have a sed or awk for each
file to extract out the filename to hand to mv.
4) csplit can't recognize patterns across newlines (not that this
job required that, but a general tool shouldn't have such
limitations.)
5) csplit can get confused on lines longer than 255 chars. It can't
handle embedded nulls. A general tool should not have such
limitations.
6) Even if I did manage to find a freely available source for csplit,
I'd have to worry about recompiling it on all my different
architectures. That would be okay (after all, I have to do that
with Perl too), but I have to do it for 50 blue jillion other little
"must have" tools too. I'd much rather compile Perl once on
each architecture, rewrite csplit in Perl, throw it into my
/u/scripts directory that's mounted everywhere, and never worry about
recompiling csplit again.
So it's not quite so simple as all that. You can chop down a tree with
a hatchet, but sometimes you want an industrial strength Swiss Army Chainsaw.
And sometimes not. There's more than one way to do it.
Larry
rbp@investor.pgh.pa.us (Bob Peirce #305) (09/13/90)
In article <9466@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes: >In article <1990Sep11.134238.20218@dg-rtp.dg.com> monroe@dg-rtp.dg.com (Mark A Monroe) writes: >: I want to rip a large file into pieces, naming new files according >: to an ID string in the large file. For example, the large file contains >: records that look like this: >: >: Any suggestions? > >In standard shell+awk+sed it's a bit hard because you run out of file >descriptors. In nawk you can close the old file before you open the new. -- Bob Peirce, Pittsburgh, PA 412-471-5320 ...!uunet!pitt!investor!rbp rbp@investor.pgh.pa.us