[comp.unix.questions] sed script to combine blank lines?

young@vlsi.ll.mit.edu (George Young) (10/13/88)

Is there a 'sed' wizard out there?  I often want to take a big ascii file
(like a .c file after cc -E) and collapse each group of 'blank' lines
into exactly one blank line.  'Blank' here is any combination of blanks,
tabs and maybe ^L's.  It looks from the documentation that sed should do this
quite neatly, using the multiple line pattern space commands with imbedded
newlines, but I sure can't figure out how.  I'd prefer the resulting blank
line to be just a newline.

I know there are various other unix tools that could do this, but sed looks
like the most appropriate, and I would like to learn to use it's fancier
facilities anyway.

Any ideas?  Just email response if this seems too trivial for broad posting.


George Young,  Rm. B-141		young@ll-vlsi.arpa
MIT Lincoln Laboratory			young@vlsi.ll.mit.edu
244 Wood St.
Lexington, Massachusetts 02173		(617) 981-2756
.
-- 
George Young,  Rm. B-141		young@ll-vlsi.arpa
MIT Lincoln Laboratory			young@vlsi.ll.mit.edu
244 Wood St.
Lexington, Massachusetts 02173		(617) 981-2756

dhesi@bsu-cs.UUCP (Rahul Dhesi) (10/13/88)

In article <192@vlsi.ll.mit.edu> young@vlsi.ll.mit.edu (George Young) writes:
>I often want to take a big ascii file
>(like a .c file after cc -E) and collapse each group of 'blank' lines
>into exactly one blank line.  'Blank' here is any combination of blanks,
>tabs and maybe ^L's.

Under 4.3BSD, "cat -s" does this, except that I don't know if it also
considers ^L to be a blank line.

Also, for screen displays only, "less" has an option to do the same.

I've never been able to make sed work with patterns that span
newlines.
-- 
Rahul Dhesi         UUCP:  <backbones>!{iuvax,pur-ee}!bsu-cs!dhesi

rar@nascom.UUCP (Alan Ramacher) (10/13/88)

In article <192@vlsi.ll.mit.edu>, young@vlsi.ll.mit.edu (George Young) writes:
> Is there a 'sed' wizard out there?  I often want to take a big ascii file
> (like a .c file after cc -E) and collapse each group of 'blank' lines
> into exactly one blank line.  'Blank' here is any combination of blanks,
> tabs and maybe ^L's.  It looks from the documentation that sed should do this
> quite neatly, using the multiple line pattern space commands with imbedded
> newlines, but I sure can't figure out how.  I'd prefer the resulting blank
> line to be just a newline.

sed is not powerful enuf for the job, but a simple awk script will
work. If you have difficulties writting it, let me know and I will
supply one. Good luck.

	Allan Ramacher

maart@cs.vu.nl (Maarten Litmaath) (10/14/88)

In article <192@vlsi.ll.mit.edu> young@vlsi.ll.mit.edu (George Young) writes:
\Is there a 'sed' wizard out there?  I often want to take a big ascii file
\(like a .c file after cc -E) and collapse each group of 'blank' lines
\into exactly one blank line.  'Blank' here is any combination of blanks,
\tabs and maybe ^L's.

How about `awk'?

% cat /usr/local/bin/deblank
#! /bin/sh
exec awk '$0 !~ /^[ \t\f]*$/ { print; prev = 0; next }
		{ if (prev) next }
		{ prev = 1; print "" }
'
%
-- 
Hippic sport:                         |Maarten Litmaath @ Free U Amsterdam:
             a contradiction in terms.|maart@cs.vu.nl, mcvax!botter!maart

rupley@arizona.edu (John Rupley) (10/14/88)

In article <136@nascom.UUCP>, rar@nascom.UUCP (Alan Ramacher) writes:
> In article <192@vlsi.ll.mit.edu>, young@vlsi.ll.mit.edu (George Young) writes:
> > Is there a 'sed' wizard out there?  I often want to take a big ascii file
> > (like a .c file after cc -E) and collapse each group of 'blank' lines
> > into exactly one blank line.  'Blank' here is any combination of blanks,
> > tabs and maybe ^L's.
> 
> sed is not powerful enuf for the job, but a simple awk script will
> work. 

The following simple sed script should work (after replacing <sp> etc
by the corresponding ascii characters) -- and I suspect it is shorter and
will run faster than an equivalent awk script.

sed   "/^[<sp><tab><lf>]*$/{
	N
	/\n.*[^<sp><tab><lf>]/{
	b
	}
	D
	}" filename


John Rupley
    internet: rupley@megaron.arizona.edu
    uucp: ..{cmcl2 | hao!ncar!noao}!arizona!rupley
    Dept. Biochemistry, Univ. Arizona, Tucson  AZ  85721

csr@drutx.ATT.COM (Steve Roush) (10/14/88)

#  strip 1st blank line down to nothing, then print newline
#  s/.*//p  should work, but not on my version
#  append next line, then strip thru newline, finally loop to see if non-blank
#############   NOTE, replace ^L with the real thing

sed  '/^[	 ^L]*$/{
s/.*//
p
:bogus
/^[	 ^L]*$/{
N
s/.*\n//
bbogus
}
}' 

steve roush
AT&T - BTL
Denver
303-538-4860
drutx!csr

wu@spot.Colorado.EDU (WU SHI-KUEI) (10/14/88)

In article <7372@megaron.arizona.edu> rupley@arizona.edu (John Rupley) writes:
>In article <136@nascom.UUCP>, rar@nascom.UUCP (Alan Ramacher) writes:
>> In article <192@vlsi.ll.mit.edu>, young@vlsi.ll.mit.edu (George Young) writes:
>> > Is there a 'sed' wizard out there?  I often want to take a big ascii file
>> > (like a .c file after cc -E) and collapse each group of 'blank' lines
>> > into exactly one blank line.  'Blank' here is any combination of blanks,
>> > tabs and maybe ^L's.
>> 
>> sed is not powerful enuf for the job, but a simple awk script will
>> work. 
>
>The following simple sed script should work (after replacing <sp> etc
>by the corresponding ascii characters) -- and I suspect it is shorter and
>will run faster than an equivalent awk script.
>
>sed   "/^[<sp><tab><lf>]*$/{
>	N
>	/\n.*[^<sp><tab><lf>]/{
>	b
>	}
>	D
>	}" filename

Just to confirm the above I timed the following sed and awk scripts
on an otherwise empty 3B2/400 running SV3.1. The sed version is
approximately 35% faster when applied to a file which contains 948 and 250
lines before and after stripping off the extra blank lines.


------------------------------ cut here ------------------------------
sed -n '
/^[	]*$/ {
	p
: loop
	n
	/^[ 	]*$/b loop
}
/[!-~][!-~]*/p' $*
------------------------------ cut here ------------------------------

------------------------------ cut here ------------------------------
awk '
	BEGIN					{ empty = 1 }
	$0 !~ /^[ \t]+$|^$/			{print; empty = 0}
	$0 ~ /^$|^[ \t]+$/ && empty == 0	{print; empty = 1}
' $*
------------------------------ cut here ------------------------------

Just a guest here.  In real life
Carl Brandauer
{ncar!uunet}!nbires!bdaemon!carl
attmail!bdaemon!carl

maart@cs.vu.nl (Maarten Litmaath) (10/15/88)

In article <7372@megaron.arizona.edu> rupley@arizona.edu (John Rupley) writes:
\sed   "/^[<sp><tab><lf>]*$/{
\	N
\	/\n.*[^<sp><tab><lf>]/{
\	b
\	}
\	D
\	}" filename

Four things:
1)	I guess you meant <ff>, instead of <lf>.
2)	The script doesn't convert lines containing [<sp><tab><ff>]* to JUST
	1 newline.
3)	The script contains non-printable characters.
4)	The script could have been less complex.

For these reasons I suggest the following adjusted script:

------------------------------cut here----------------------------------------
#! /bin/sh

chars=" `echo tf | tr tf '\11\14'`"
exec sed "/^[$chars]*$/{
		N
		/\n[$chars]*$/D
		s/[$chars]*//
	}" $*
------------------------------cut here----------------------------------------

The `awk' solution I posted earlier, had to be modified a bit too:

------------------------------cut here----------------------------------------
#! /bin/sh

exec awk '$0 !~ /^[ \t\f]*$/ { print; prev = 0; next }
	prev == 0 { prev = 1; print "" }' $*
-- 
Hippic sport:                         |Maarten Litmaath @ Free U Amsterdam:
             a contradiction in terms.|maart@cs.vu.nl, mcvax!botter!maart

maart@cs.vu.nl (Maarten Litmaath) (10/15/88)

In article <4057@boulder.Colorado.EDU> wu@spot.Colorado.EDU (WU SHI-KUEI) writes:
\------------------------------ cut here ------------------------------
\sed -n '
\/^[	]*$/ {
    ^
\	p
\: loop
\	n
\	/^[ 	]*$/b loop
\}
\/[!-~][!-~]*/p' $*
\------------------------------ cut here ------------------------------

At the place the caret is pointing to, a space should be added (there's only
the tab)! Furthermore, formfeeds were considered white space too.
This is an indication that non-printable characters (in essential places)
should be avoided. One way to do it here:

chars=" `echo tf | tr tf '\11\14'`"

The first `p' command should be replaced by `s/.*//p', to meet what the
original poster wanted. For the same reason the `/[!-~][!-~]*/p' command
is doing too much: `p' suffices.

\------------------------------ cut here ------------------------------
\awk '
\	BEGIN					{ empty = 1 }
\	$0 !~ /^[ \t]+$|^$/			{print; empty = 0}
\	$0 ~ /^$|^[ \t]+$/ && empty == 0	{print; empty = 1}
\' $*
\------------------------------ cut here ------------------------------

The `next' command of `awk' can simplify this script (see my previous
posting). Why not use `/^[ \t\f]*$/' for the regular expression?
Your `sed' solution (+ modifications) is the best I've seen so far.
My `awk' solution is easier to program, but a lot slower indeed.
-- 
Hippic sport:                         |Maarten Litmaath @ Free U Amsterdam:
             a contradiction in terms.|maart@cs.vu.nl, mcvax!botter!maart

dave@lsuc.uucp (David Sherman) (10/17/88)

Some versions of UNIX (including 4.1BSD) include a program
call ssp(1) which does this; it's a bit faster than the
parallel awk script.

Also, if you know your text doesn't contain duplicate lines
(nroff output almost always qualifies, for example), good
old uniq(1) will do it.

David Sherman
-- 
{ uunet!attcan  att  pyramid!utai  utzoo } !lsuc!dave

leo@philmds.UUCP (Leo de Wit) (10/17/88)

In article <136@nascom.UUCP> rar@nascom.UUCP (Alan Ramacher) writes:
|In article <192@vlsi.ll.mit.edu>, young@vlsi.ll.mit.edu (George Young) writes:
|> Is there a 'sed' wizard out there?  I often want to take a big ascii file
|> (like a .c file after cc -E) and collapse each group of 'blank' lines
|> into exactly one blank line.  'Blank' here is any combination of blanks,
|> tabs and maybe ^L's.  It looks from the documentation that sed should do this
|> quite neatly, using the multiple line pattern space commands with imbedded
|> newlines, but I sure can't figure out how.  I'd prefer the resulting blank
|> line to be just a newline.
|
|sed is not powerful enuf for the job, but a simple awk script will
|work. If you have difficulties writting it, let me know and I will
|supply one. Good luck.

I already mailed George a solution, but couldn't leave this one alone...
Sed is most certainly powerful enough - I'll show you in a minute - ;
in fact, I think for such a typical text processing job sed is to be
preferred.  And a not too unimportant reason for that is its speed.

And here's my sed-solution; note that tab and formfeed have been coded
as ^I and ^L so your pager isn't fooled; you should of course use the
control codes in real.

(using /bin/sh as command interpreter: )

sed -n -e '
/^[ ^I^L]*$/{
    s/^.*$//p
    : again
    n
    s/^[ ^I^L]*$//
    t again
}
p' your_file

Explanation: whenever you read a line containing only blank characters
(i.e. satisfying the first pattern), print just one newline. Discard
any blank lines that follow (the 'again' loop). When you're through
with the 'first pattern subroutine 8-)' print the non-blank line that's
now in the pattern space. Simple enough, huh ?

                                            Leo.

P.S. I don't doubt it can be done with awk (it could even be programmed
with the shell). I however doubt it will be nearly as fast as the sed
solution.

allyn@hp-sdd.hp.com (Allyn Fratkin) (10/17/88)

you sed experts are going way overboard here.  this is a really
trivial problem for sed.  try something simple, like the following:

sed 's/^[<sp><tab><ff>]*//; /^$/N; /^\n[<sp><tab><ff>]*$/D'

this starts by reducing a "blank" line to an empty one.  an empty
line is joined to the next line, and then if this next line is also
"blank", the first line is deleted.

i admit that this does require non-printable chars.  too bad sed
doesn't understand octal escape sequences.

to get around the non-printable chars, you could always do something
like this:

chars=`echo '\040\011\014\c'`	# use -n with bsd echo
sed "s/^[$chars]*//; /^$/N; /^\n[$chars]*$/D"

-- 
 From the virtual mind of Allyn Fratkin            allyn@sdd.hp.com
                          San Diego Division       - or -
                          Hewlett-Packard Company  uunet!ucsd!hp-sdd!allyn

mark@adec23.UUCP (Mark Salyzyn) (10/17/88)

In article <192@vlsi.ll.mit.edu>, young@vlsi.ll.mit.edu (George Young) writes:
> I often want to take a big ascii file
> (like a .c file after cc -E) and collapse each group of 'blank' lines
> into exactly one blank line.  'Blank' here is any combination of blanks,
> tabs and maybe ^L's.  It looks from the documentation that sed should do this
> quite neatly, using the multiple line pattern space commands with imbedded
> newlines, but I sure can't figure out how.  I'd prefer the resulting blank
> line to be just a newline.

This is untested, but you should get the idea ...

: loop
N
/^[(blank)(tab)(form feed)]*\n[(blank)(tab)(form feed)]*$/ {
s///
b loop
}

Replace (blank) with the blank character itself. Replace (tab) with the tab
character itself. Replace (form feed) with the form feed character itself.

-- Mark Salyzyn @ ADEC Systems Inc.

morrell@hpsal2.HP.COM (Michael Morrell) (10/18/88)

/ hpsal2:comp.unix.questions / dave@lsuc.uucp (David Sherman) /  3:49 pm  Oct 16, 1988 /
Some versions of UNIX (including 4.1BSD) include a program
call ssp(1) which does this; it's a bit faster than the
parallel awk script.

Also, if you know your text doesn't contain duplicate lines
(nroff output almost always qualifies, for example), good
old uniq(1) will do it.
----------

I thought the original problem was to replace multiple "blank" lines with
a single newline (\n), where "blank" included lines with only tabs and
formfeeds on them.  ssp and uniq will only help if the lines to be combined
are really blank.