[comp.unix.questions] file too large

beaulieu@netcom.UUCP (Bob Beaulieu) (09/16/89)

I have a text file that is very large (26,000+ lines) and would like
to break it down to 5-6 smaller files. Is there an easy way to handle
this? I have tried vi but, it seems to hold 5000 lines in its buffer.
The same goes for ed and ex.

Thanks for any help.
-- 
Bob Beaulieu
277-b Tyrella Avenue
Mountain View, CA 94043
(415) 967-4678

cpcahil@virtech.UUCP (Conor P. Cahill) (09/16/89)

In article <2388@netcom.UUCP>, beaulieu@netcom.UUCP (Bob Beaulieu) writes:
> I have a text file that is very large (26,000+ lines) and would like
> to break it down to 5-6 smaller files. Is there an easy way to handle
> this?

Try split(1) which allows you to split the file into different segments
by # of lines.  If you want to have some form of logical split of the 
data, use split(1) to break the file into manageable parts and then
piece the parts you want together.
-- 
+-----------------------------------------------------------------------+
| Conor P. Cahill     uunet!virtech!cpcahil      	703-430-9247	!
| Virtual Technologies Inc.,    P. O. Box 876,   Sterling, VA 22170     |
+-----------------------------------------------------------------------+

ok@cs.mu.oz.au (Richard O'Keefe) (09/17/89)

In article <2388@netcom.UUCP>, beaulieu@netcom.UUCP (Bob Beaulieu) writes:
> I have a text file that is very large (26,000+ lines) and would like
> to break it down to 5-6 smaller files.

If you have split, it may do what you want:
	split -5000 foobaz
where foobaz contains 26,123 lines, will create
	xaa	# lines      1- 5,000
	xab	# lines  5,001-10,000
	xac	# lines 10,001-15,000
	xad	# lines 15,001-20,000
	xae	# lines 20,001-25,000
	xaf	# lines 25,001-26,123
If you want 'zabbo' used as the prefix instead of 'x', say
	split -5000 foobaz zabbo
and you'll get zabbo{aa,ab,ac,ad,ae,af} produced instead.
If you haven't got split, I can mail a version which is rather sexier.

Of course you could always do this with 'awk', use
	awk -f split-5000-by-6.awk foobaz
where the file split-5000-by-6.awk contains these lines:
	    1 <= NR && NR <=  5000 { print $0 > "xaa" }
	 5001 <= NR && NR <= 10000 { print $0 > "xab" }
	10001 <= NR && NR <= 15000 { print $0 > "xac" }
	15001 <= NR && NR <= 20000 { print $0 > "xad" }
	20001 <= NR && NR <= 25000 { print $0 > "xae" }
	25001 <= NR && NR <= 30000 { print $0 > "xaf" }
There Is Always Another Way...

fischer@iesd.auc.dk (Lars P. Fischer) (09/17/89)

In article <2388@netcom.UUCP> beaulieu@netcom.UUCP (Bob Beaulieu) writes:
>I have a text file that is very large (26,000+ lines) and would like
>to break it down to 5-6 smaller files. Is there an easy way to handle
>this? I have tried vi but, it seems to hold 5000 lines in its buffer.
>The same goes for ed and ex.

Try emacs(1). Handles files with up to 2^31 characters.

/Lars
--
Copyright 1989 Lars Fischer; you can redistribute only if your recipients can.
Lars Fischer,  fischer@iesd.auc.dk, {...}!mcvax!iesd!fischer
Department of Computer Science, University of Aalborg, DENMARK.

Our audience is programmers, because the UNIX environment was
designed fundamentally for programming.
			-- Kernighan & Pike

max@lgc.UUCP (Max Heffler @ Landmark Graphics) (09/18/89)

In article <2121@munnari.oz.au>, ok@cs.mu.oz.au (Richard O'Keefe) writes:

> If you have split, it may do what you want:

> Of course you could always do this with 'awk', use
> 	awk -f split-5000-by-6.awk foobaz

> There Is Always Another Way...

I forgot about split and just used dd...

-- 
Max Heffler                     uucp: ..!uunet!lgc!max
Landmark Graphics Corp.         phone: (713) 579-4751
333 Cypress Run, Suite 100
Houston, Texas  77094

meissner@tiktok.dg.com (Michael Meissner) (09/19/89)

In article <FISCHER.89Sep17141429@rosser.iesd.auc.dk> fischer@iesd.auc.dk (Lars P. Fischer) writes:
| In article <2388@netcom.UUCP> beaulieu@netcom.UUCP (Bob Beaulieu) writes:
| >I have a text file that is very large (26,000+ lines) and would like
| >to break it down to 5-6 smaller files. Is there an easy way to handle
| >this? I have tried vi but, it seems to hold 5000 lines in its buffer.
| >The same goes for ed and ex.
| 
| Try emacs(1). Handles files with up to 2^31 characters.

That really depends on the emacs implementation.  GNU emacs for
example, requires that all text, global data, and buffer space fit
within 2^24 bytes.  This is because the upper 8 bits are used to
encode the type and are also used for garbage collection.
--
Michael Meissner, Data General.
Uucp:		...!mcnc!rti!xyzzy!meissner		If compiles were much
Internet:	meissner@dg-rtp.DG.COM			faster, when would we
Old Internet:	meissner%dg-rtp.DG.COM@relay.cs.net	have time for netnews?

stein-c@acsu.Buffalo.EDU (Craig Steinberger) (09/19/89)

In article <2388@netcom.UUCP> beaulieu@netcom.UUCP (Bob Beaulieu) writes:
 >I have a text file that is very large (26,000+ lines) and would like
 >to break it down to 5-6 smaller files. Is there an easy way to handle
 >this? I have tried vi but, it seems to hold 5000 lines in its buffer.
 >The same goes for ed and ex.

There is a program called csplit that should do the trick.

guy@auspex.auspex.com (Guy Harris) (09/22/89)

 > >I have a text file that is very large (26,000+ lines) and would like
 > >to break it down to 5-6 smaller files. Is there an easy way to handle
 > >this? I have tried vi but, it seems to hold 5000 lines in its buffer.
 > >The same goes for ed and ex.
 >
 >There is a program called csplit that should do the trick.

There is a program called "csplit" in some, but not all, versions of
UNIX that might do the trick; it splits based on "context" (which is
presumably what the "c" in "csplit" stands for).  From the SunOS 4.0 man
page:

DESCRIPTION
     csplit reads the file whose name is filename  and  separates
     it  into  n+1  sections,  defined by the arguments argument1
     through argumentn.  If the filename argument is a  `-',  the
     standard  input is used.  By default the sections are placed
     in files named xx00 through xxn.  n may not be greater  than
     99.   These  sections  receive the following portions of the
     file:

     xx00    From the start of filename up to (but not including)
             the  line  indicated by argument1 (see OPTIONS below
             for an explanation of these arguments.)
     xx01:   From the line indicated by argument1 up to the  line
             indicated by argument2.
     xxn:    From the line referenced by argumentn to the end  of
             filename.

However, it is, as noted, not present in all versions of UNIX; it
doesn't come with 4.xBSD, for instance.  "split", which splits based on
line count, is present in all versions of UNIX AT&T has shipped, and is,
as such, more likely to be present in any given version of UNIX (it is
in 4.xBSD). 

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (09/22/89)

You can use sed to break it if you don't have any fancy tools.
	sed -n "1,1000p" big.file >part.1
Obviously you will want to pick the breakpoints by content.

-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
"The world is filled with fools. They blindly follow their so-called
'reason' in the face of the church and common sense. Any fool can see
that the world is flat!" - anon

fischer@iesd.auc.dk (Lars P. Fischer) (09/25/89)

In article <1226@xyzzy.UUCP> meissner@tiktok.dg.com (Michael Meissner) writes:
>| Try emacs(1). Handles files with up to 2^31 characters.
>
>That really depends on the emacs implementation.  GNU emacs for
>example, requires that all text, global data, and buffer space fit
>within 2^24 bytes.  This is because the upper 8 bits are used to
>encode the type and are also used for garbage collection.

OK, so I blew it. Sorry. If you need to edit files with more than 200k
lines (80 chars/line), don't use emacs. In all other cases, do :-).

(Only 16M chars per session? You mean I can't say "emacs /dev/xy0c"?
Anybody out there has a *real* editor?? :-).

/Lars
--
Copyright 1989 Lars Fischer; you can redistribute only if your recipients can.
Lars Fischer,  fischer@iesd.auc.dk, {...}!mcvax!iesd!fischer
Department of Computer Science, University of Aalborg, DENMARK.

Our audience is programmers, because the UNIX environment was
designed fundamentally for programming.
			-- Kernighan & Pike