[comp.unix.shell] deleting some empty lines with sed

datpete@daimi.aau.dk (Peter Andersen) (04/27/91)

I have some source-files that I produce documentation from.

I use sed to make a few changes to the text. I have figured
most of it out, but I have one problem remaining:
If two or more blank lines appear, I want to remove all but
one of these.

I have tried the following sed script

    s^ *$//p
1,$ N
    s/\(\n\n\)\n*/\1/gp

but it didn't work.

Does anyone have a way of doing this, perhaps using something
else but sed. I'm not a perl-guru, but if its possible in perl
I'd like to hear about that too.

Peter

rouben@math16.math.umbc.edu (Rouben Rostamian) (04/28/91)

In article <1991Apr27.143519.26256@daimi.aau.dk> datpete@daimi.aau.dk (Peter Andersen) writes:
>I use sed to make a few changes to the text. I have figured
>most of it out, but I have one problem remaining:
>If two or more blank lines appear, I want to remove all but
>one of these.

This should do it:

sed -n -e '
/^ *$/!p
/^ *$/{
    p
    :loop
    n
    /^ *$/bloop
    p
}'  <input_file

To catch and eliminate lines that have a mixture of tab and space characters,
you may want to modify this by replacing each of the /^ */ expressesions
by /^[ T]*$/, where I have typed a T to represent a tab.  (In other words,
you should type a tab where I have a T.)

Hope that this helps.

--
Rouben Rostamian                          Telephone: (301) 455-2458
Department of Mathematics and Statistics  e-mail:
University of Maryland Baltimore County   bitnet: rostamian@umbc.bitnet
Baltimore, MD 21228,  U.S.A.              internet: rouben@math9.math.umbc.edu

Tom Christiansen <tchrist@convex.COM> (04/28/91)

From the keyboard of datpete@daimi.aau.dk (Peter Andersen):
:I have some source-files that I produce documentation from.
:
:I use sed to make a few changes to the text. I have figured
:most of it out, but I have one problem remaining:
:If two or more blank lines appear, I want to remove all but
:one of these.
:
:I have tried the following sed script
:
:    s^ *$//p
:1,$ N
:    s/\(\n\n\)\n*/\1/gp
:
:but it didn't work.

You need to set up a look with labels and branches.  I see that as I type
this another poster has given you a sed solution, so I won't post my
crufty version of the same.

Something you can't do is use \n on the RHS of the expression, nor can you
use ^ or $ with \n.  Well, maybe *you* can, but I sure couldn't.   Maybe
that I even want to do these things shows that I've been doing more perl
than sed, but so it goes.

:Does anyone have a way of doing this, perhaps using something
:else but sed. I'm not a perl-guru, but if its possible in perl
:I'd like to hear about that too.

Well, your code isn't quite doing what your description is asking
for, since it seems to be trying to trim two or more newlines down to 
just one, but I'll assume you are looking for the equivalent 
of a `cat -s', but lines with all blanks count as blank lines.

One way you could approach it in perl is via brute force: you could
suck in the whole file (as you seemed to have been trying to do in sed)
and then perform your substitution:

    #!/usr/bin/perl
    undef $/;  			# undefine input record separator
    $_ = <>; 			# read entire input stream into pattern space
    $* = 1; 			# make ^ and $ work multi-linedly
    s/^([ \t]*\n)+/\n/g; 	# compress whitespace+newlines --> \n
    print;			# print pattern space

This is pretty easy to code, but will eat up a lot of memory if you have
a big file, since it needs to hold the entire file in memory at one time.
Instead, you could keep a state variable around that indicated whether the
last line had been blank.  We'll use $saw_blanks to mean whether the last
line was blank.  

    #!/usr/bin/perl -n 	
    next if /^\s*$/ && $was_blanks;
    s/^\s*$/\n/; 
    print;				
    $was_blanks = /^$/;		

In fact, now that I stare at it, I could have put used a pattern match
*and* a substitute, but instead I'll bunch them together.

    #!/usr/bin/perl -n 	
    next if s/^\s*$/\n/ && $was_blanks;
    print;				
    $was_blanks = /^$/;		

--tom

louk@tslwat.UUCP (Lou Kates) (04/29/91)

In article <1991Apr27.143519.26256@daimi.aau.dk> datpete@daimi.aau.dk (Peter Andersen) writes:
>If two or more blank lines appear, I want to remove all but
>one of these.

The following  awk script (its called  nawk on my system) will do
this:

		nawk 'NF || !b; {b = !NF}'

b is a  flag which is true  if the previous input line was blank.
The first statement of  the script  prints the current line if it
has fields or  if b is false (or null as it will be on the  first
line of input). The second statement of the script sets b.

Lou Kates, Teleride Sage Ltd., louk%tslwat@watmath.waterloo.edu

merlyn@iwarp.intel.com (Randal L. Schwartz) (04/29/91)

In article <1991Apr27.143519.26256@daimi.aau.dk>, datpete@daimi (Peter Andersen) writes:
| Does anyone have a way of doing this, perhaps using something
| else but sed. I'm not a perl-guru, but if its possible in perl
| I'd like to hear about that too.

perl -ne 'print unless /\S/ ? ($skip = 0) : $skip++' <in >out

How this works is left as an exercise to the reader. :-)

print "Just another Perl hacker," unless $skip # :-)
-- 
/=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\
| on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III      |
| merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn |
\=Cute Quote: "Intel: putting the 'backward' in 'backward compatible'..."====/

weimer@garden.ssd.kodak.com (Gary Weimer (253-7796)) (04/29/91)

In article <1991Apr27.211212.18855@convex.com>, Tom Christiansen
<tchrist@convex.COM> writes:
|> From the keyboard of datpete@daimi.aau.dk (Peter Andersen):
|> :I have some source-files that I produce documentation from.
|> :
|> :I use sed to make a few changes to the text. I have figured
|> :most of it out, but I have one problem remaining:
|> :If two or more blank lines appear, I want to remove all but
|> :one of these.
|> :
|> :I have tried the following sed script
|> :
|> :    s^ *$//p
|> :1,$ N
|> :    s/\(\n\n\)\n*/\1/gp
|> :
|> :but it didn't work.
|> 
|> You need to set up a look with labels and branches.  I see that as I type
|> this another poster has given you a sed solution, so I won't post my
|> crufty version of the same.

[three perl scripts deleted]

Wouldn't 'cat -s' be easier? I know, it's not near as exciting, but...

weimer@ssd.kodak.com ( Gary Weimer )

daveh@marob.uucp (Dave Hammond) (05/01/91)

In article <1991Apr27.143519.26256@daimi.aau.dk> datpete@daimi.aau.dk (Peter Andersen) writes:
>I have some source-files that I produce documentation from.
>
>I use sed to make a few changes to the text. I have figured
>most of it out, but I have one problem remaining:
>If two or more blank lines appear, I want to remove all but
>one of these.
>[sed example deleted]
>Does anyone have a way of doing this, perhaps using something
>else but sed. I'm not a perl-guru, but if its possible in perl
>I'd like to hear about that too.

Perhaps I'm oversimplifying the problem, but wouldn't

tr -s '\012'

squeeze multiple, consecutive newlines to a single newline?

--
Dave Hammond
daveh@marob.uucp
uunet!rutgers!phri!marob!daveh

dattier@vpnet.chi.il.us (David W. Tamkin) (05/01/91)

weimer@ssd.kodak.com (Gary Weimer) wrote in
<1991Apr29.150244.4378@ssd.kodak.com>:

| In article <1991Apr27.211212.18855@convex.com>, Tom Christiansen
| <tchrist@convex.COM> writes:

| |> From the keyboard of datpete@daimi.aau.dk (Peter Andersen):
| |> :I have some source-files that I produce documentation from.

| |> :If two or more blank lines appear, I want to remove all but
| |> :one of these.

| |> :I have tried the following sed script

[script deleted]

| |> :but it didn't work.

| |> You need to set up a look with labels and branches.  I see that as I type
| |> this another poster has given you a sed solution, so I won't post my
| |> crufty version of the same.

Well, I haven't seen those sed solutions; they aren't here yet or they're
already gone.

| Wouldn't 'cat -s' be easier? I know, it's not near as exciting, but...

cat -s on a BSD system, yes, but in System V cat -s does something completely
different (it _s_uppresses error messages if input files are nonexistent or
unreadable or if the output file cannot be written).

We went around this subject on comp.editors several months ago.  The best sed
soluation was posted by someone whom I cannot name (because at the same time
a person with a very similar name posted one that didn't work, and I can
never remember which name belonged to which person.

The sed script boiled down to three lines (the ^I represents a tab):

s/[ ^I]*$//
/^$/N
/[!-~]/!D

If the input begins with one or more blank lines, a single empty line will
be preserved at the start of the output.  To get rid of that as well, add
/./,$ !d      after the first line.

David Tamkin  PO Box 7002  Des Plaines IL  60018-7002  dattier@vpnet.chi.il.us
GEnie:D.W.TAMKIN  CIS:73720,1570  MCIMail:426-1818  708 518 6769  312 693 0591

"Parker Lewis Can't Lose" mailing list:
 flamingo-request@esd.sgi.com (relay)  flamingo-request@ddsw1.mcs.com (digest)

dattier@vpnet.chi.il.us (David W. Tamkin) (05/02/91)

daveh@marob.uucp (Dave Hammond) wrote in <281DB41B.9E6@marob.uucp>:

| Perhaps I'm oversimplifying the problem, but wouldn't

| tr -s '\012'

| squeeze multiple, consecutive newlines to a single newline?

Yes, but that isn't what we want.  The identical misunderstanding came up
when someone asked the same question a few months ago, and many skimmers
made equivalent suggestions [such as sed '/^$/d'], which don't answer the
question at hand.

Compressing all consecutive newlines to a single newline (1) assumes that
there are no non-empty blank lines (containing spaces and tabs but no
printing text) and (2) removes all empty lines.  By leaving only one newline
character between two consecutive lines that have text, you're removing all
blank lines from the document.

Since that may not be obvious, let me illustrate.  Here is some text.  I'll
represent newline characters with "^J".

^J
^J
Hi there!^J
^J
^J
It's nice to see you again.^J
How was your trip?^J
^J
^J
^J
David^J
^J

The goal is to get this:

Hi there!^J
^J
It's nice to see you again.^J
How was your trip?^J
^J
David^J

All leading newlines are stripped, two or more trailing newlines are
compressed to one, and any string of three or more intermediate newlines is
reduced to two.

That's none at the top, two anywhere in the middle, and one at the end.
If we run it through tr -s '\012' we get one newline no matter where, with 
this result:

^J
Hi there!^J
It's nice to see you again.^J
How was your trip?^J
David^J

The paragraphing is lost and one blank line at the top is still there.

The question is how to squeeze all consecutive blank lines to a single blank
line (which is what cat -s does on a BSD system).  Removing all blank lines
and getting solid blocks of text is not the answer!

David Tamkin  PO Box 7002  Des Plaines IL  60018-7002  dattier@vpnet.chi.il.us
GEnie:D.W.TAMKIN  CIS:73720,1570  MCIMail:426-1818  708 518 6769  312 693 0591

"Parker Lewis Can't Lose" mailing list:
 flamingo-request@esd.sgi.com (relay)  flamingo-request@ddsw1.mcs.com (digest)

lwall@jpl-devvax.jpl.nasa.gov (Larry Wall) (05/03/91)

In article <281DB41B.9E6@marob.uucp> daveh@marob.uucp (Dave Hammond) writes:
: Perhaps I'm oversimplifying the problem, but wouldn't
: 
: tr -s '\012'
: 
: squeeze multiple, consecutive newlines to a single newline?

Yes, it would, and yes, you are.  They wanted to squeeze 3 or more
consecutive newlines down to 2.

Incidentally, with perl 3.044 and later you can do it with

	perl -00pe 's/^\n+//'

or, less efficiently (because of file slurping overhead and string copying
due to modifying the middle of a long string, and because the previous
solution uses anchored search),

	perl -0777pe 's/\n{3,}/\n\n/g'

Oddly enough, any octal number will work in place of 0777 except 012, which
would set the input delimiter to newline.

Larry Wall
lwall@netlabs.com

mike@x.co.uk (Mike Moore) (05/03/91)

In article <1991Apr27.143519.26256@daimi.aau.dk> datpete@daimi.aau.dk (Peter Andersen) writes:
>I have some source-files that I produce documentation from.
>
>I use sed to make a few changes to the text. I have figured
>most of it out, but I have one problem remaining:
>If two or more blank lines appear, I want to remove all but
>one of these.
>[sed example deleted]
>Does anyone have a way of doing this, perhaps using something
>else but sed. I'm not a perl-guru, but if its possible in perl
>I'd like to hear about that too.


The sed script below works, but messes up on lines at the beginning
and end of the file.  It may also have problems with large files.
The awk script works wonderfully!

in=`cat $1 | tr '\012' ''`

echo $in | sed -e 's/[]*/\
\
/g' -e 's//\
/g'

#========================

awk ' BEGIN { blank=0
	      line=0
	    }
            {
# remove here...
              if ( blank == 0 )
	      {
                if ( $0 != "" )
                {
                  blank++

		  if ( line != 0 )
		    print ""

                  print $0
		}
		else
		  line++
	      }
	      else
              {
# to here, if you do not want to account for blank lines at beginning

	        if ( $0 == "" )
		  line=1
		else
		{
		  if ( line != 0 )
		  {
		    print ""
		    line=0
		  }

		  print $0
		}

# (and this...)
              }

            }

# remove here...
      END   { if ( line != 0 )
                print ""
            }
# to here, if you do not want to account for blank lines at end

            ' $1


#========================

-- 
---            | usual and obvious disclaimers etc...      | Never take a
Mike Moore     | (anyone daft enough to sue me deserves    | Scorpio seriously.
mike@x.co.uk   |  to win a share in my negative cashflow!) | I know, I am one.

barnett@grymoire.crd.ge.com (Bruce Barnett) (05/03/91)

In article <1991May02.005009.27947@vpnet.chi.il.us> dattier@vpnet.chi.il.us (David W. Tamkin) writes:

>The question is how to squeeze all consecutive blank lines to a
>single blank line (which is what cat -s does on a BSD system).
>Removing all blank lines and getting solid blocks of text is not the
>answer!

No one mentioned uniq(1) as a possible solution. Of course it isn't a
good idea if you have identical consecutive non-empty lines....

--
Bruce G. Barnett	barnett@crdgw1.ge.com	uunet!crdgw1!barnett

]) (05/05/91)

How about this:

	:
	# sh or ksh
	#
	# Reduce all multiple consecutive "empty" lines to singles.  Process
	# stdin (switch it to filename processing or either as you wish).
	#

	# Make the T in the sed line a <TAB> char.  I leave it a T for
	# this posting just for clarity.

	# Use sed to strip all trailing blanks and tabs
	sed 's/[ T]*$//' |

	# Use awk to delete empty lines that follow empty lines
	awk '
		$0 != "" {
			print
			prevnull = 0
			next
		}

		prevnull == 0 {
			print
			prevnull = 1
		}' -

The flow in the awk script is to look for non-null lines first on the
assumption that more input lines will be non-null than null.  If a line
is non-null, print it and set the null-marker off.  No need to test the
marker variable before setting it, that would take longer than simply
making the assignment every time.

If we get past that first block, we are processing a null line.  If the
null-marker is off, print this null line and set the null-marker on.  Any
line not matching one of these two blocks is a null line following a null
line, and is ignored.

This is a pretty standard approach for me to take.  I use sed to do editing
of an input-stream for awk, where I do the conditional work based on line
contents.  Since sed is faster than awk, the sed process typically writes
the data out faster than awk will take it in, so doing the line-editing in
sed instead of within the awk script is no performance penalty at all for
the shell script as a whole.  It is, in fact, a performance gain because
the awk script would take more time if I expanded its definition of a null
line to include blanks and tabs.

...Kris
-- 
Kristopher Stephens, | (408-746-6047) | krs@uts.amdahl.com | KC6DFS
Amdahl Corporation   |                |                    |
     [The opinions expressed above are mine, solely, and do not    ]
     [necessarily reflect the opinions or policies of Amdahl Corp. ]

dattier@ddsw1.MCS.COM (David W. Tamkin) (05/06/91)

barnett@crdgw1.ge.com wrote in <BARNETT.91May3125128@grymoire.crd.ge.com>:

| No one mentioned uniq(1) as a possible solution. Of course it isn't a
| good idea if you have identical consecutive non-empty lines....

Nor is it a good idea if the consecutive blank lines are not identical
because they have different combinations of spaces and tabs.  If all the
blank lines are truly empty, then this isn't an additional problem, though
Bruce Barnett's original caveat (above) still stands.

David Tamkin  Box 7002  Des Plaines IL  60018-7002  708 518 6769  312 693 0591
dattier@ddsw1.mcs.com   MCI Mail: 426-1818  CIS: 73720,1570  GEnie: D.W.TAMKIN

"Parker Lewis Can't Lose" mailing list:
 flamingo-request@esd.sgi.com (relay)  flamingo-request@ddsw1.mcs.com (digest)

lugnut@sequent.UUCP (Don Bolton) (05/08/91)

In article <May91.164529.10132@x.co.uk> mike@x.co.uk (Mike Moore) writes:
>In article <1991Apr27.143519.26256@daimi.aau.dk> datpete@daimi.aau.dk (Peter Andersen) writes:
>>I have some source-files that I produce documentation from.
>>
>>I use sed to make a few changes to the text. I have figured
>>most of it out, but I have one problem remaining:
>>If two or more blank lines appear, I want to remove all but
>>one of these.
>>[sed example deleted]
>>Does anyone have a way of doing this, perhaps using something
>>else but sed. I'm not a perl-guru, but if its possible in perl
>>I'd like to hear about that too.
>

try... cat yourfile | nawk -f "the 2 non-blank lines below" :-) 

BEGIN { RS = "" }
      {print $0; print ""}

the above works fine, unless there are some size limits I'm not hitting here..

>
>The sed script below works, but messes up on lines at the beginning
>and end of the file.  It may also have problems with large files.
>The awk script works wonderfully!
>
>in=`cat $1 | tr '\012' ''`
>
>echo $in | sed -e 's/[]*/\
>\
>/g' -e 's//\
>/g'
>
>#========================
>
>awk ' BEGIN { blank=0
>	      line=0
>	    }
>            {
># remove here...
>              if ( blank == 0 )
>	      {
>                if ( $0 != "" )
>                {
>                  blank++
>
>		  if ( line != 0 )
>		    print ""
>
>                  print $0
>		}
>		else
>		  line++
>	      }
>	      else
>              {
># to here, if you do not want to account for blank lines at beginning
>
>	        if ( $0 == "" )
>		  line=1
>		else
>		{
>		  if ( line != 0 )
>		  {
>		    print ""
>		    line=0
>		  }
>
>		  print $0
>		}
>
># (and this...)
>              }
>
>            }
>
># remove here...
>      END   { if ( line != 0 )
>                print ""
>            }
># to here, if you do not want to account for blank lines at end
>
>            ' $1
>
>
>#========================


Don "simple Simon" Bolton