[comp.unix.questions] sed script to remove cr/lf except at paragraph breaks

rac@sherpa.UUCP (Roger A. Cornelius) (05/19/89)

I'm in need of a sed script to remove MSDOS cr/lf (actually replace each
cr/lf combination with one space) except at the start of a paragraph.
i.e. only the cr/lf preceding a paragraph break should remain.  Paragraphs
are marked only by four leading spaces and nothing else.

Here's where I am now:

N
h
/\n    /{
P
D
}
s/^M\n/ /g

This works correctly for the first match, ie beginning of a paragraph,
but for all other lines, the substitution of a space for cr/lf only
works correctly for the first occurrance in the line (the g flag seems
to have no effect).  But there are two occurrances due to the N function.
How can I match (and substitute for) the terminating nl in the pattern
space?  The sed man pages concerning addresses say you can't.  What am
I missing or how can I get around this?

This is probably simple and has been covered here before.  Sorry.

Any help appreciated.
Roger

bink@aplcen.apl.jhu.edu (Ubben Greg) (05/22/89)

In article <119@sherpa.UUCP> rac@sherpa.UUCP (Roger A. Cornelius) writes:
> I'm in need of a sed script to remove MSDOS cr/lf (actually replace each
> cr/lf combination with one space) except at the start of a paragraph.
> i.e. only the cr/lf preceding a paragraph break should remain.  Paragraphs
> are marked only by four leading spaces and nothing else.
>
> Here's where I am now:
>
> N
> h
> /\n    /{
> P
> D
> }
> s/^M\n/ /g

The h here is useless, because you never use G, g, or x to get the text back.
The problem with using N to gather an arbitrary number of lines in the pattern
space is that SED doesn't keep the pattern space between cycles (unless you
can make the D command work out), so you must code an explicit loop:

	: loop
	$q
	N
	/\n    /{ P; D; }
	s/^M\n/ /
	b loop

Also, the $q is needed because SED will stop dead without printing the pattern
space if an N (or n) is attempted on the last line of the input.  If you don't
care for "gotos" (or correctness), here's an alternative method that makes use
of the hold space and SED's natural cycle for looping:

	/^    /!{ H; $!d; }
	x
	1d
	s/^M\n/ /g

Since this algorithm is based on the transition BETWEEN two paragraphs, the
1d and $! are necessary to handle the special cases of the first and last
lines (and even then it doesn't work right when the first line is not the
beginning of a paragraph or the last line IS the beginning of a paragraph).
This problem requires a 1-line look-ahead, and in general, the x command is
a good way to implement this in SED.

> This works correctly for the first match, ie beginning of a paragraph,
> but for all other lines, the substitution of a space for cr/lf only
> works correctly for the first occurrance in the line (the g flag seems
> to have no effect).  But there are two occurrances due to the N function.

Because you're never gathering more than 2 lines in the pattern space at once,
due to ending the cycle as explained above.

> How can I match (and substitute for) the terminating nl in the pattern
> space?  The sed man pages concerning addresses say you can't.  What am
> I missing or how can I get around this?

The terminating newline can only be matched by a $ because it is not really
there -- it is always tacked on when the line is output.

						-- Greg Ubben "A SED fanatic"
						   bink@aplcen.apl.jhu.edu

itwaf@dcatla.UUCP (Bill Fulton [Sys Admin]) (05/23/89)

In article <119@sherpa.UUCP> rac@sherpa.UUCP (Roger A. Cornelius) writes:
> I'm in need of a sed script to remove MSDOS cr/lf (actually replace each
> cr/lf combination with one space) except at the start of a paragraph.
> i.e. only the cr/lf preceding a paragraph break should remain.  Paragraphs
> are marked only by four leading spaces and nothing else.
> Here's where I am now:
> [ sed script deleted]

How about lex, instead? I think the lex input between these lines:
----------
%%
\015\012"    "    ECHO;
\015\012          { strcpy(yytext, " "); ECHO; }
----------
should do what you want. Make it with 'lex <filename> ; cc lex.yy.c -ll',
then feed a.out your MSDOS file(s)! You could append a functions section to
do setup, or you could drive it from a front-end script.

I don't want to turn this into a lex vs. sed thing, but it does seem that
lex would be much more direct and easy. I agree that lex is "well ... a
little strange" if you don't work with it a lot, but once you start to mess
around with sed scripts such as you have, it starts to balance out.

Once I played with it a little, I've decided that lex is pretty neat as a
standalone utility!

Bill Fulton
dcatla!itwaf@gatech.edu  OR  ..!gatech!dcatla!itwaf