[comp.unix.questions] awk or sed question

agw@broadway.columbia.edu (Art Werschulz) (07/02/87)

Hi all.

I have a file consisting of some lines that have 80 chars or fewer,
and some with more than 80 characters (in fact, some may have more
than 160 characters).  I wish to break up only the >80-char lines, so
that no line has more than 80 chars.  

Here's the catch:  I want to break the long lines only at whitespace.
Thus, I can't use 

	% fold -80 foo.tex

since a space in the middle of a word would be disatrous.

Any suggestions?

	Art Werschulz

 	ARPAnet:  agw@columbia.edu
	USEnet:   ... seismo!columbia!agw
	BITnet:   agw%columbia.edu@wiscvm
	CCNET:    agw@columbia
	ATTnet:   Columbia University (212) 280-3610 280-2736
		  Fordham University  (212) 841-5323 841-5396

wrp@burdvax.PRC.Unisys.COM (William R. Pringle) (07/03/87)

in article <4780@columbia.UUCP>, agw@broadway.columbia.edu (Art Werschulz) says:
> 
> 
> Hi all.
> 
> I have a file consisting of some lines that have 80 chars or fewer,
> and some with more than 80 characters (in fact, some may have more
> than 160 characters).  I wish to break up only the >80-char lines, so
> that no line has more than 80 chars.  
> 
> Here's the catch:  I want to break the long lines only at whitespace.
> Thus, I can't use 
> 
> 	% fold -80 foo.tex
> 
> since a space in the middle of a word would be disatrous.
> 
> Any suggestions?
> 
> 	Art Werschulz
> 
>  	ARPAnet:  agw@columbia.edu
> 	USEnet:   ... seismo!columbia!agw
> 	BITnet:   agw%columbia.edu@wiscvm
> 	CCNET:    agw@columbia
> 	ATTnet:   Columbia University (212) 280-3610 280-2736
> 		  Fordham University  (212) 841-5323 841-5396
=====================

Here is a little script that I use to do that. Hope it helps.

---------------- Cut Here ---------
#!/bin/sh
#
# fold lines at whitespace
#
# by Bill Pringle
#	burdvax!wrp
#

sed -e 's/	/        /g' $* |		# convert tabs to 8 spaces
awk - '
BEGIN { N = 80 }
{   if ((n = length($0)) <= N)
        print
    else {
        LEN = N
        for (i = 1; n > N; n -= LEN) {
           while (substr($0,LEN+i-1,1) != " ") {
               LEN -= 1
           }
           if (i==1) {
		printf "%s\\\n", substr($0, i, LEN)
	   } else {
		printf ">     %s\\\n", substr($0, i, LEN)
	   }
           i += LEN;
        }
        printf ">     %s\\\n", substr($0,i)
     }
} '

mwm@eris.BERKELEY.EDU (Mike (My watch has windows) Meyer) (07/04/87)

<in article <4780@columbia.UUCP>, agw@broadway.columbia.edu (Art Werschulz) says:
[A request for a sed or awk tool to break 80-character lines at whitespace.]

Some problems just aren't amenable to tackling with sed/awk. I think
this is one of them. It may be doable with sed, but I'm not sure how.
Any awk script to do this wiill be almost as complicated as a C
program to do the same thing.

For example:

In article <somearticle@somehost> someone writes:
<Here is a little script that I use to do that. Hope it helps.
<
<---------------- Cut Here ---------
<#!/bin/sh
<#
<# fold lines at whitespace
<#
<
<sed -e 's/	/        /g' $* |		# convert tabs to 8 spaces
<awk - '
<BEGIN { N = 80 }
<{   if ((n = length($0)) <= N)
<        print
<    else {
<        LEN = N
<        for (i = 1; n > N; n -= LEN) {
<           while (substr($0,LEN+i-1,1) != " ") {
<               LEN -= 1
<           }
<           if (i==1) {
<		printf "%s\\\n", substr($0, i, LEN)
<	   } else {
<		printf ">     %s\\\n", substr($0, i, LEN)
<	   }
<           i += LEN;
<        }
<        printf ">     %s\\\n", substr($0,i)
<     }
<} '


This is what I mean. First, converting tabs directly to 8 spaces has
*got* to be wrong. Secondly, this fails on files with lines longer
than awks internal buffer for records (minor, and usually acceptable).

The loose problem spec doesn't help much, of course. But that just
means the problem is a "real-life" problem, and not a classroom
exercise. The C code to solve the problem has some differences (no
tags on folded lines, and the whitespace where the fold is doesn't
get printed). It's also a pure filter, but allows for user-specified
fold columns, instead of wiring it to 80.

The main loop of the C code is 26 lines, not counting comments. The
awk script is 19 lines. The C code would shrink to 22 lines by using
printfs instead of fputs/putchar, and formatting if/else the same way
the awk script is.

Since (as far as I'm concerned)) sed and awk are for quickly building
programs that would be difficult in C, the small difference between the
two programs - which hopefully indicates a small difference in
construction time - shows that this is an problem for which awk isn't
really suited. 

On the other hand, some simple test case (the first n integers on a
single line, seperate by a singe space) show the C version can handle
n = 10000 in about the same sys and user times (as reported by
/bin/time on a Sun 3/50 running SunOS 3.3) as the sed/awk version for
n = 100. The sed/awk version drops core for n >= 1000, and the C
version takes less that 1/10th of a second of sys and user time for n
<= 1000, so I didn't do direct comparisons.

The shell script to emulate the awk/sed script user interfaces, and
the more complex script to combine the two, is left as an exercise for
the reader.

	<mike

/*
 * wfold - fold stdin on column n, n being the first (and only) argument.
 *	If unspecified, n is 80. A throwaway for demo purposess on the
 *	net.
 */

#include <stdio.h>

/*
 * MAXFOLD is the largest fold column we're willing to accept. All others
 * rejected.
 */
#define	MAXFOLD	160

void
main(argc, argv) int argc; char **argv; {
	register	foldc = 80 ;
	char		buffer[MAXFOLD + 2] ;
	register char	*fold_point, *leftovers ;

	/* Argument processing */
	if (argc > 2) {
		fprintf(stderr, "useage: %s [n]\n", argv[0]) ;
		exit(1) ;
		}
	if (argc == 2) foldc = atoi(argv[1]) ;
	if (foldc <= 0 || foldc > MAXFOLD) {
		fprintf(stderr,
			"%s: only fold columns between 1 and %d supported\n",
			argv[0], MAXFOLD) ;
		exit(1) ;
		}
	/*
	 * The plan is to treat each line + leftovers from last read as
	 * a new line. fold_point indicates where the end of the leftovers
	 * end. Initially set to the beginning of the buffer, it's set up
	 * correctly each time through the loop.
	 *
	 * We need to get one more characters than the maximum fold, as
	 * the first character past the fold column might be whitespace,
	 * and that's a legit fold point. Since fgets reads at most n-1
	 * characters (n is the second argument), we need to ask for foldc+2
	 * characters, minus however much leftovers there are from last loop.
	 */
	leftovers = buffer ;
	while (fgets(leftovers, foldc+2-(leftovers-buffer), stdin) != NULL) {
		/*
		 * If we got a complete line, print it.
		 */
		if (buffer[strlen(buffer) - 1] == '\n') {
			fputs(buffer, stdout) ;
			leftovers = buffer ;
			}
		/*
		 * Got a long line. Find the fold point, print up to the fold,
		 * then shuffle the remaining characters forward and try again.
		 */
		else {
			fold_point = buffer + foldc ;
			while (*fold_point != ' ' && *fold_point != '\t'
			    && fold_point > buffer)
				fold_point -= 1 ;
			/* Test for lines with no whitespace */
			if (fold_point == buffer) {
				fputs(buffer, stdout) ;
				putchar('\n') ;
				leftovers = buffer ;
				}
			else {
				/* Dump up to fold point */
				*fold_point = '\0' ;
				fputs(buffer, stdout) ;
				putchar('\n') ;
				/* Now, deal with the leftovers */
				fold_point += 1 ;
				strcpy(buffer, fold_point) ;
				leftovers = &buffer[strlen(buffer)] ;
				}
			}
		}
	exit(0) ;
	}


--
I'm gonna lasso you with my rubberband lazer,		Mike Meyer
Pull you closer to me, and look right to the moon.	mwm@berkeley.edu
Ride side by side when worlds collide,			ucbvax!mwm
And slip into the Martian tide.				mwm@ucbjade.BITNET

lied@ihuxy.ATT.COM (Bob Lied) (07/06/87)

In article <4780@columbia.UUCP>, agw@broadway.columbia.edu (Art Werschulz) writes:
> 
> Here's the catch:  I want to break the long lines only at whitespace.

All us well-trained C programmers immediately look for a way to
find that last space before column 80.  That was my first impulse.
Here's an awk script which uses split() in a fairly clever way to
try to find the last word before column 80.  Besides not handling
lines over 160 characters, it also has a fatal flaw if that last
word appears earlier in the string.  Other than that, it might
have at least tutorial value:

newform -i file | # Convert those tabs to equivalent spaces!
awk 'length > 80 {
	str = substr($0, 1, 80)
	n = split(str, words)
	divide = index(str, words[n])
	left = substr($0, 1, divide-1)
	right = substr($0, divide)
	print left "\n" right
	}
     length <= 80 { print }'

Then I had a brainstorm!  What well-known text processing program
already knows how to break text at white space?  Why, of course:
nroff (or its much faster local cousin, sroff).  Use awk to insert
formatter commands, then run the bugger through nroff!  (Use grep -v
to eliminate the extra blank lines.)

newform -i file |
awk 'BEGIN	{print ".nf" ; print ".na"; print ".nh"; print ".ll 80"}
     length>80	{print ".fi" ; print ; print ".nf" }
     length<=80	{print}' | nroff | grep -v '^$'

See!  All the caffeine and Nutrasweet eventually pays off.

	Bob Lied	ihnp4!ihuxy!lied

PAAAAAR%CALSTATE.BITNET@wiscvm.wisc.EDU (07/08/87)

Art Werschulz <agw@broadway.columbia.EDU>
wrote asking for a way to 'word wrap' text.

About 2 months ago I needed to do  something  similar  when  preparing
handouts etc for a class. I designed, coded, documented, and ported  a
program 'br.c' as an example for the class.

It is more useful than I expected. In BSD Mail I  use  '~br  -70'  to
quickly format messages. In vi you can ignore layout when editting and
then reformat a paragraph with '!{ br'. The speed is magical.  Another
use is for multicolumn printing of a list:
'!20! sortcat -nbr 20pr -3 -w70 -o10'


I am now working on joining up  lines  that  have  been  word-wrapped.
Given 'jn' (join) and 'br' I will have a kind of poor  person's  nroff
by
  'jn oldbr newpr whatever'
Has any one got a speedy program to unwrap word-wrapped text?


I won't post 'br.c', 'br.1', etc as  they  are  longer  than  previous
examples. If any one wants source, manual pages,  and/or  full  design
documentation - contact me directly and I'll send the  stuff  back  by
email(inshallah). Here are a list of features/bugs:

1.  'br  width'   reads   standard   input   and   produces   standard
output(only).

2. No output line is longer than w characters.

3. Whole groups of whitespace characters are replaced by a  newline(CR
in ASCII).

4. It interprets tabs as 6 characters wide.

5. 'br' treats nonprinting characters as having zero width.  Backspace
causes problems.

6. Words that are longer than 'w' are hyphenated before the break.

7. 'br -width' and 'br width' do the same thing.

8. The default width is 76.


Dick Botting, CSU San Ber'do
5500, State University Pkwy, San Bernardino, CA 92407
714-887-7368(voice), 714-887-7365(modem - login as guest)
paaaar@calstate.bitnet
paaaaar@ccs.csussc.edu
paaaaar%calstate.bitnet@wiscvm.wisc.edu


Disclaimer - I am only an egg.