[comp.unix.questions] sed - match newlines on input

bill@hao.UUCP (03/09/87)

I'm trying to match a pattern over multiple lines.  For instance, on the input:

one
two
three

with the sed script:

s/one\ntwo\nthree/one, two, three/g

one would expect to get the following output

one, two, three

OK, so I don't understand the manual (what else is new).  How can I get what I
need?  Also, what about the "Multiple Input-line Functions".  Might that be the
way to go?  An example would really help.  Thanks in advance.

							Bill Roberts
							NCAR/HAO
							Boulder,CO

boykin@custom.UUCP (Joseph Boykin) (03/10/87)

In article <570@hao.UCAR.EDU>, bill@hao.UCAR.EDU (Bill Roberts) writes:
> I'm trying to match a pattern over multiple lines.
> For instance, on the input:
> 
> one
> two
> three
> 
> with the sed script:
> 
> s/one\ntwo\nthree/one, two, three/g
> one would expect to get the following output
> one, two, three
> 
> OK, so I don't understand the manual (what else is new).  How can I get what I
> need?  Also, what about the "Multiple Input-line Functions".  Might that be the
> way to go?  An example would really help.  Thanks in advance.
> 
> 							Bill Roberts
> 							NCAR/HAO
> 							Boulder,CO

The documentation on this point within the SED documentation is
definately confusing, when we did the documentation for PC/SED,
we tried to make it better, but I don't think we did!

Okay, here goes:
SED's regular expression handler was modified to handle embedded newlines.
Compare this to VI (UNIX or PC/VI) which CANNOT match a pattern which crosses
a line boundary.  When searching for a regular expression, you search
a single null terminated string.  That string can have any character
in it, including a new-line (although for the most part the new-line isn't
stored).  Hence, your script will not work since SED is not seeing
the \n in your script as "when you get to the end of this string, start
comparing with the next string (pronounced 'line').  When the documentation
talks about looking for embedded new lines, what it is talking about
is that SED permits the user to 'join' two lines together.  What this
really means is that the NULL in the first line is replaced by a \n and
the second line is concatenated onto the end of the first.
The regular expression can now test for an embedded new line since all you have
is one string which just so happens to have a \n in the middle.

To be honest, I don't feel like mucking with SED long enough
to give you a script to do what you want (someone else probably will!)
but I think the basic idea is to go through the file, and for each
line join the next two lines together with the 'N' command, then you
can test to see if that new 'line' is the concatenation of the
three you are interested in, if so, do your substitution.

Okay, I just reread this message and I know it isn't very clear.
On the other hand, neither is SED is not the easiest program to
understand either!  If you're still stuck, give me a call.

-- 

Joe Boykin
Custom Software Systems
...{necntc, frog}!custom!boykin

romwa@gpu.utcs.toronto.edu (03/11/87)

In article <570@hao.UCAR.EDU>, bill@hao.UCAR.EDU (Bill Roberts) writes:
> I'm trying to match a pattern over multiple lines.  For instance, on the input:
> 
> one
> two
> three
> 
> with the sed script:
> 
> s/one\ntwo\nthree/one, two, three/g
> 
> one would expect to get the following output
> 
> one, two, three
> 
I have always had trouble with newlines and sed.  The way I
would deal with your problem is to use 'tr' and then sed.
It would go something like this:

tr "\012" "#" < datafile | sed -e 's/#/, /g' > outfile

This translates all new lines to a printable ascii character.
You should use one that does not occur elswhere in the file.
The output is piped to sed where the character '#' is expanded
to ', '.  

I, too, would like to see a sed example doing the whole
process.

ras1@mtuxo.UUCP (03/11/87)

In article <570@hao.UCAR.EDU>, bill@hao.UCAR.EDU (Bill Roberts) writes:
>> I'm trying to match a pattern over multiple lines ... with the sed script:
>>      s/one\ntwo\nthree/one, two, three/g
>> one would expect to get the output: one, two, three

In article <572@custom.UUCP> boykin@custom.UUCP (Joseph Boykin) writes
>The documentation on this point within the SED documentation is
>definately confusing, when we did the documentation for PC/SED,
>...
>To be honest, I don't feel like mucking with SED long enough
>to give you a script to do what you want (someone else probably will!)

Okay, here goes:

## FIRST a relatively clean straight forward "sed -f Script":

/^abc$/!d
h
n
/^123$/!d
H
n
/^xyz$/!d
H
g
s/\n/, /g
p

## NEXT a couple of "tighter" versions:

sed -n "/^abc$/{h; n
            /^123$/{H; n; H
                /^xyz$/g; s/\n/, /gp;}
       }" $*    ## Above "s...gp" only prints if both the 'H' AND 'g' click

#OR: 

sed -n "/^abc$/!d; h; n; /^123$/{H; n; /^xyz$/!d; x; G; s/\n/, /gp;}" $*

## LAST a grotesque (Byzantine?) WORKING variant
#  is provided (with full apologies in advance) only as
#  POSSIBLE food for thought, discussion and reflection:
#        (or nausea and revulsion but PLEASE !flame:-)

CHUNK="/^abc$/{!d; h; n; /^123$/{H; b outside
       :within
       p;}"

sed -n "#Note:  The 's' and 't' won't click without prior 'h', 'H' AND 'x'
        $CHUNK
        }
        d
        :outside
        n; /^xyz$/{x; G;}
        s/\n/, /g
        t within
        " $*

#Points for consideration:
#       + Quoted multi-line sed script without "-e" (only needed icw "-f").
#       + Leading "#" comment within script.
#	+ "sed" script CHUNK substituted from the shell environment.
#       + Multiple sed commands per a line.
#       + Multiple progressive pattern matching elements on a line.
#       + Control grouping '{' braces:
#               + Nested,
#               + Multiple opens on a line,
#               + Multi-line control groups.
#       + Branching 'b' (and conditional branching 't') to labels that are
#         "outside" AND also back "inside" of control group '{' constructs.
#       + The LITERAL case "^one\ntwo\nthree$"
#         could be handled (matched & output)
#         WITHOUT the use of the hold area.

Enuf sed?/Go4 sam?-)

Dick Stewart; ihnp4!tarpon!stewart; ATT-IS: DISCLAIMER ...

PS:  Aficionados of "awk" try some "time"d comparisons;
     Although limited "sed" is quick.
     All of the sed scripts above were run on 3B's and a PC7300.

allyn@sdcsvax.UUCP (03/11/87)

In article <570@hao.UCAR.EDU>, bill@hao.UCAR.EDU (Bill Roberts) wants to
use sed to match a pattern over multiple lines.

the trick to sed is that you need to tell it when to join the lines
before it can look for multiple line patterns.  the N command is
used for this.

try this command:

sed -f sedinput < inputfile

with the "sedinput" file containing:
/^one$/N
/^one\ntwo$/N
/^one\ntwo\nthree$/s/\n/, /g

with an "inputfile" consisting of:
zero
one
two
three
four

it produces as output:
zero
one, two, three
four

-- 
 From the virtual mind of Allyn Fratkin            allyn@sdcsvax.ucsd.edu    or
                          EMU Project              {ucbvax, decvax, ihnp4}
                          U.C. San Diego                         !sdcsvax!allyn

tj@mks.UUCP (03/12/87)

In article <570@hao.UCAR.EDU>, bill@hao.UCAR.EDU (Bill Roberts) writes:
> I'm trying to match a pattern over multiple lines.  For instance, on the input:
>	one
>	two
>	three
> with the sed script:
>	s/one\ntwo\nthree/one, two, three/g
> one would expect to get the following output
>	one, two, three
> OK, so I don't understand the manual (what else is new).  How can I get what I
> need?  Also, what about the "Multiple Input-line Functions".  Might that be the
> way to go?  An example would really help.  Thanks in advance.

sed -e '/one$/{
N
N
s/one\ntwo\nthree/one, two, three/
}'

This solution was found without effort on the first attempt.
Perhaps MKS Toolkit documentation is better than some...
;-)

     ll  // // ,~/~~\'   T. J. Thompson {decvax,ihnp4,seismo}!watmath!mks!tj
    /ll/// //l' `\\\     Mortice Kern Systems Inc.
   / l //_// ll\___/     43 Bridgeport Rd. E., Waterloo, ON, Can. N2J 2J4
O_/                      (519)884-2251
-- 
     ll  // // ,~/~~\'   T. J. Thompson {decvax,ihnp4,seismo}!watmath!mks!tj
    /ll/// //l' `\\\     Mortice Kern Systems Inc.
   / l //_// ll\___/     43 Bridgeport Rd. E., Waterloo, ON, Can. N2J 2J4
O_/                      (519)884-2251

ras1@mtuxo.UUCP (03/12/87)

>>      s/one\ntwo\nthree/one, two, three/g

The previously posted examples fail for:
	...	OR	...
	abc		abc
	abc		123
	123		abc
	xyz		123
	...		xyz
			...

# A "better" approach might be:

sed -n "
       :top
       /^abc$/{h; n
               /^123$/!b top
                      {H; n; /^xyz$/!b top
                       H; g; s/\n/, /gp;}
              }" ${*:-}

#enuf grotesque mucking around w/sed:-)

Dick Stewart; ihnp4!tarpon!stewart; ATT-IS: DISCLAIMER !!!

hoey@nrl-aic.arpa (03/14/87)

>>      s/one\ntwo\nthree/one, two, three/g

It is amazing how many answers you get to questions about sed, and even
more amazing how many turn out to be wrong.  Having seen four persons'
incorrect answers, and no correct answer yet, I suppose you ought to
see one.

If the "g" at the end of the problem is to be believed, the file may
have to be completely read into memory.  This is because a file
containing

    threeone\ntwo\nthreeone\ntwo\n...\nthreeone\ntwo\nthreeone

must be written as a single line

    threeone, two, threeone, two, ..., threeone, two, threeone

and sed refuses to write part of a line.  The following script will
suffice to solve the problem:

    H;$!d;x;s/.//
    s/one\ntwo\nthree/one, two, three/g

Unfortunately, at least on 4.2BSD, there is a limit of about 4K to the
pattern space; on longer files you will get "Output line too long"
diagnostics and sed may core dump.  One solution to the problem is to
ignore the final "g" in the problem statement.  In other words, we
relax the requirement that a "one\ntwo\nthree" that begins on the same
line as a previous match ends will be recognized.  The following script
will solve the simpler problem, and will keep at most three lines of
the file in the pattern space at a time.

    /\n/!{$!N;}
    $!N
    /one\ntwo\nthree/s/\n/, /g
    P;D

If you are interested in simplifying these scripts, please be careful
to avoid the following common bugs:

1. Script changes "\n" to ", " always, rather than just in a
   one\ntwo\nthree context (Dornfield).
2. Script assumes the input does not contain "#" character, or some
   other character (Dornfield).
3. Script fails to recognize pattern "one\none\ntwo\nthree" (Fratkin,
   Roberts.  Thanks to Chris Torek for pointing out the problem of
   partial matches).
4. Script searches for the pattern /^one\ntwo\nthree$/ rather than
   /one\ntwo\nthree/ (Fratkin, Stewart).
5. Script fails to output the last line (Fratkin, Roberts).  Note that
   an "N" command on the last line will exit without printing the
   pattern space.  Apparently "$!N" solves the problem, though I am
   unconvinced that the documentation guarantees this to be so.
6. Script does not check that the input matches "one\ntwo\nthree".
   (Bill Roberts's script will modify "twone\nthreeleven\nhike".)
7. Script fails to output the non-matching lines.  (Dick Stewart's
   scripts have this problem, as well as gratuitously changing the
   problem to "s/abc\n123\nxyz/abc, 123, xyz/g" without even mentioning
   that this has been done.)

My thanks to Allyn Fratkin, for being the first to suggest that the
problem admits of an elegant solution, and for providing some of the
ideas that have gone into the above scripts.

Dan Hoey