[comp.lang.perl] Splitting by character count in perl?

djo7613@milton.u.washington.edu (Dick O'Connor) (03/06/91)

In order to follow an ancient bitpath left over from the days of HASP and
card punch queues, I need to split a file of 150 character lines into a
file of lines 80 characters wide or less.  Back when our Cyber was active,
we did this with a simple Fortran program; now I've got Ultrix and even
perl (!) at my disposal, and this problem cries out for a filter, not a pgm.

Our old program copied the first 78 characters on a line to the output file
after prepending and appending the character 'A'.  Characters 79-150 were
written to line 2 of the output file with prepended and appended 'B'.  A
simple routine glued things back together at the other end.

Trouble is, I'm enough of a novice that I can't see the simple Unix way.
All of the utilities I've looked at would be a lot happier if I could
move entire lines or fields.  But this is a file of typical scientific
data; no tabs, blanks may or may not separate fields depending on the
width of the (right-justified) number in that field, and an overall
adherence to a pre-defined "format."

After searching the perl man page, I'm stuck.  I keep thinking split
can do what I need, but the examples don't make it clear.  I'd prefer a
perl solution to one using "other" utilities; any takers?  It's that or
I use f77, and we wouldn't want to do that, would we??  :)  Thanks!

"Moby" Dick O'Connor                         djo7613@u.washington.edu 
Washington Department of Fisheries           *I brake for salmonids*

merlyn@iwarp.intel.com (Randal L. Schwartz) (03/07/91)

In article <17806@milton.u.washington.edu>, djo7613@milton (Dick O'Connor) writes:
| Our old program copied the first 78 characters on a line to the output file
| after prepending and appending the character 'A'.  Characters 79-150 were
| written to line 2 of the output file with prepended and appended 'B'.  A
| simple routine glued things back together at the other end.

To split'em:

perl -pe 'chop; ($a,$b) = unpack("a78a*",$_); $_ = "A${a}A\nB${b}B\n";'

To join'em: (presuming alternating lines of A's and B's from above):

perl -pe 's/^A(.*)A\n$/$1/ || s/^B(.*)B\n$/$1\n/;'

If you can have an A line without a B line, you will need to maintain
state between lines.  That's an exercise for you.

print "Just another Perl hacker,"
-- 
/=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\
| on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III      |
| merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn |
\=Cute Quote: "Intel: putting the 'backward' in 'backward compatible'..."====/

eichin@athena.mit.edu (Mark W. Eichin) (03/07/91)

[My apologies for the tutorial style below; I'm writing this for the
reader that doesn't know perl at all, but needs to use it. I welcome
technical corrections publicly, and style comments privately...]
	pack/unpack does exactly what you want. The man page isn't all
that clear on this, though I think the Camel Book has examples which
make it clear... the pack string is almost exactly analogous to the
FORMAT statement in Fortran (or rather, FORTRAN, since I mean the
"classic" versions as opposed to the new standards effort) to the
extent that someone could probably write a translator with little
difficulty.
	As for your particular example, the one-liner:

perl -ne '@two=unpack("a78a*",$_); print "A",$two[0],"\nB",$two[1];'

should do it. Data follows. unpack is taking the current line ($_) and
unpacking it into a string of 78 chars and a string of "thre rest" (*)
and leaving the results into an array called "two" (@two). Then it's
printing the A, the first element of @two ($two[0] - arrays start at
zero, like C, by default, though you can set a variable to adjust
that), then the newline and the B ("\nB"), and then the second element
of two (which *already* contains the trailing newline... $_ is the
*entire* line, and we never did a chop to split off the newline so it
is still there. Using "a78a72" would have also chopped off the
newline, as it is the 151st character...) The -n wraps a loop around
the whole thing, the -e indicates that we're putting the line right
here instead of off in a script.
	I hope this helps; I didn't really want to provide a naked
one-liner, thus the windy explanation. The *important* thing, of
course, is that running the above line, then feeding it the following
three lines of data (78 equals + 72 stars each):

==============================================================================************************************************************************
==============================================================================************************************************************************
==============================================================================************************************************************************

yields:

A==============================================================================
B************************************************************************
A==============================================================================
B************************************************************************
A==============================================================================
B************************************************************************

Hmmm. Double checking your note, you want the A and B *appended* as
well - Ok, fine, I'll leave the above because it makes a point about
newlines, and submit:

perl -ne '@two=unpack("a78a72",$_); print "A",$two[0],"A\nB",$two[1],"B\n";'

A==============================================================================A
B************************************************************************B
A==============================================================================A
B************************************************************************B
A==============================================================================A
B************************************************************************B

Items for further exploration:
	a) the reassembly could be done with pack. 
	b) if the line is less than 150 columns, so will the output. I
suspect the fortran code had the same problem - and that the data
*doesn't* have that problem. See what pack("A78") does, and note how
it would solve that problem.
	c) There is a substr function, but you'd have to use it twice;
would that be slower? [probably, since it would still have to create
the temporary values - but it might be more memory efficient, though
not by enough to matter in this example.]

Enjoy...
				_Mark_ <eichin@athena.mit.edu>
				MIT Student Information Processing Board
				Watchmaker Computing <eichin@watch.com>

marcl@ESD.3Com.COM (Marc Lavine) (03/08/91)

djo7613@milton.u.washington.edu (Dick O'Connor) writes:
>Our old program copied the first 78 characters on a line to the output file
>after prepending and appending the character 'A'.  Characters 79-150 were
>written to line 2 of the output file with prepended and appended 'B'.  A
>simple routine glued things back together at the other end.

eichin@athena.mit.edu (Mark W. Eichin) writes:
>	As for your particular example, the one-liner:
>perl -ne '@two=unpack("a78a72",$_); print "A",$two[0],"A\nB",$two[1],"B\n";'

I just started hacking with Perl last week (and think it's great -- thanks
for another wonderful tool, Larry), but I'm a long-time fan of regular
expressions (in the distant past, I used to edit files with "ex").  I
really like having Perl's "fancy" regular expressions available.  So,
here's a different solution to the problem using only regular
expressions (which should be quite fast):

To split the lines use:
	perl -pe 's/^(.{78})(.{72})$/A\1A\nB\2B/'

And to join them use:
	perl -pe 's/^A(.{78})A\n/\1/; s/^B(.{72})B$/\1/;'

(which came out very similar to Randal Schwartz's suggestion of:
	perl -pe 's/^A(.*)A\n$/$1/ || s/^B(.*)B\n$/$1\n/;'
)

BTW, I came up with the following motto for Perl:

	Perl: Kitchen sink included.
--
Marc Lavine			Broken: marcl%3Com.Com@sun.com
Smart: marcl@3Com.Com		UUCP: ...{sun|decwrl}!3com.3com!marcl