[comp.editors] Multiple line regexps

soh@andromeda.trl.OZ.AU (kam hung soh) (06/03/91)

I would like to write a regular expression which can look for patterns
longer than one line.  For example, I want to find the first line of
each paragraph.  If I try this regexp in grep or awk, /^$^.+$/, nothing
happens.  Admittedly, I could replace newlines with a unique character
say '~', before I process my file, but I wondered if regexps can be used
across a newline boundary.

Regards,


Soh, Kam Hung      email: h.soh@trl.oz.au     tel: +61 3 541 6403 
Telecom Research Laboratories, POB 249 Clayton, Victoria 3168, Australia 

tchrist@convex.COM (Tom Christiansen) (06/03/91)

From the keyboard of soh@andromeda.trl.OZ.AU (kam hung soh):
:I would like to write a regular expression which can look for patterns
:longer than one line.  For example, I want to find the first line of
:each paragraph.  If I try this regexp in grep or awk, /^$^.+$/, nothing
:happens.  Admittedly, I could replace newlines with a unique character
:say '~', before I process my file, but I wondered if regexps can be used
:across a newline boundary.

Not in most text processing languages, but I'll offer two alternatives.

Rob Pike once explained to me that his screen editor, sam, could handle
such things because it doesn't have hard-wired in what a line is.  Sadly,
sam is not available for free (although it's probably cheap from the 
AT&T toolbox) so I've not used it.  Perhaps someone who has might comment.

Another possibility is to use perl, which isn't really an interactive
editor, but is certainly a superset of sed, awk, and sh, at the very
least.  Perl has no problems with regexps spanning multiple lines.
While the default records processed are a line at a time, you can switch
this to paragraph mode (records delimited by newline pairs) or even
whole-file mode, in which you can slurp the entire file into the pattern
space.  There's no problem saying something like s/\n\nX\n//g then.  In
fact, there's an internal variable you can set to change the definitions
of ^ and $ to mean not just at the beginning or end of a string, but
rather anywhere after and before a newline as well, which is often handy.

--tom
--
Tom Christiansen		tchrist@convex.com	convex!tchrist
	    "Perl is to sed as C is to assembly language."  -me

gast@lanai.cs.ucla.edu (David Gast) (06/03/91)

In article <1991Jun2.231351.10229@trl.oz.au> soh@andromeda.trl.OZ.AU (kam hung soh) writes:
>I would like to write a regular expression which can look for patterns
>longer than one line.  For example, I want to find the first line of
>each paragraph.  If I try this regexp in grep or awk, /^$^.+$/, nothing
>happens.

Most unix commands are line oriented the quick answer is no, but ...
Sed allows patterns to be longer than one line.  With a little bit
of programming, you can have awk recognize patterns across lines,
just save the line into a variable and then test to see whether the
old line matches a pattern and the new line does.  I realize this
statement is not very clear, let me give a concrete example.  (I
have not checked this code, so it may have a typo or two, but the
example should be clear).

Suppose you want to define a new paragraph as occurring when the previous
line is null (you may want to make it null or only white space since people
do put spaces or tabs on otherwise null lines or at the end of lines) and
the current line is non-null (you could have indented five spaces, begins
with a capital, etc).  This program prints the first line of every new
paragraph, you can revise it to suit your needs.

awk '
	$0 ~ /./ && oldline ~ /^$/ {
		print $0 }
		{oldline=$0}
	' arguments-go-here

Note: If the first line of the file has text on it, it will print it
since oldline is implicitly initialized to null.

Obviously, perl could also do this since perl can do everything.  :-)

David Gast






Admittedly, I could replace newlines with a unique character
>say '~', before I process my file, but I wondered if regexps can be used
>across a newline boundary.

byron@archone.tamu.edu (Byron Rakitzis) (06/03/91)

In article <1991Jun2.231351.10229@trl.oz.au> soh@andromeda.trl.OZ.AU (kam hung soh) writes:
>I would like to write a regular expression which can look for patterns
>longer than one line.  For example, I want to find the first line of
>each paragraph.

Most Unix utilities are line based, so no dice.

However, emacs will let you put a literal newline in the regexp, just like
any other character. Simply escape it with a C-q first.

I would like to write a "stream sam", after having seen and read about the
sam text editor. It seems to me that sed's usefulness is very limited in
certain circumstances, and writing obfuscated sed scripts making use of
the hold space just doesn't do it for me. 
--
Byron Rakitzis
byron@archone.tamu.edu

tchrist@convex.COM (Tom Christiansen) (06/03/91)

From the keyboard of soh@andromeda.trl.OZ.AU (kam hung soh):
:I would like to write a regular expression which can look for patterns
:longer than one line.  For example, I want to find the first line of
:each paragraph.  

It occurs to me I didn't answer the example question.  In perl, you
could solve that problem in this way (and many others as well):

    perl -00 -ne 'print /(.*\n)/' some_file

The -00 put us in paragraph mode, and the (.*\n) isolates the first
line of each paragraph for printing.

--tom
--
Tom Christiansen		tchrist@convex.com	convex!tchrist
	    "Perl is to sed as C is to assembly language."  -me

lee@sq.sq.com (Liam R. E. Quin) (06/05/91)

soh@andromeda.trl.OZ.AU (kam hung soh) writes:
>I would like to write a regular expression which can look for patterns
>longer than one line.  For example, I want to find the first line of
>each paragraph.  If I try this regexp in grep or awk, /^$^.+$/, nothing
>happens.

Although you can't match across a newline with /^$^.+$/ in most Unix
software, you can get what you want.  You _could_ do it in lex, by the way,
and that would be sensible if you were going to do the same thing often.

You can do this in sed or awk, and also in ex or vi, with a little cleverness.
Here's how in ex or vi....

First, we could print all blank (empty) lines with
	:g/^$/p
The command
	g reg-exp command
tells the editor (vi, ex, ed) to run the command on every line that matches
the pattern.  The command is pretty unrestricted, although it can't be another
global (g) command...

Well, that prints all the blank lines.
We could print all lines after a blank line:
	:g/^$/+1p
but that isn't quite right, because it goes wrong if there are two blank
lines in a row.  Ah! that's why you had /^$.+$/ and not /^$.*$/.  I see...
OK, we could do this:
	:g/^$/+1s/./&/p
This says that on the line after each blank line, try to substitute a single
character for itself (&), and if that worked print the line.
This is OK except that if the last line in the file is blank the +1 is wrong,
so we must omit the last line, and do the command on 1,$-1:
	:1,$-1g/^$/.+1s/./&/p
Wow!  well, that's plausible.

In sed, we could use the Hold space.  I won't do that here, as it's a little
confusing to describe...

In awk, though, we could do this:
	awk '
	/^./ {
	    if (last == "") print
	}

	{
	    last = $0
	}'

You can be terser with some versions of awk:
	awk '/^./{ if (last == "") print} { last = $0 }'

If you have mgrep of Gnu grep, you could also grep for blank lines,
with one line of context, and grep for . on the result.

So none of these answer your real, fundamental, can-regexp-do-this question,
but they do address what you're trying to solve.

Lex can do multi-line patterns, and in Dougherty & O'Reilly's Unix Text
Processing (the big blue one) there is an example of a multi-line grep
using sed, as I recall.

Liam

-- 
Liam Quin, lee@sq.com, SoftQuad, Toronto, +1 416 963 8337
the barefoot programmer

kuiper@CS.Cornell.EDU (Matthijs Kuiper) (06/06/91)

byron@archone.tamu.edu (Byron Rakitzis) writes:

>I would like to write a "stream sam", after having seen and read about the
>sam text editor.
You do not have to write anything.
Just buy sam: it has a stream-mode. And, it is true, sam makes
it easy to write patterns that match multiple lines, or, in 
sam-speak, that match the newline-character.

--
Matthijs Kuiper (kuiper@cs.cornell.edu)

gpaa29@udcf.glasgow.ac.uk (F.Burton) (06/06/91)

You might like to look at the article 'The Text Editor sam' by Rob Pike
in Software--Practice and Experience (1987) _17_, 813-845. Pike describes
the way sam handles "structural regular expressions", where the file is
treated as a single string with matchable newlines.

The original reference is: R.Pike, 'Structural Regular Expressions,' Proc.
EUUG Spring Conf., Helsinki 1987, European Unix User's Group, Buntingford,
Herts, UK.
--
Francis Burton      Physiology, Glasgow University, Glasgow G12 8QQ, Scotland
041 339 8855 x6609  | JANET: F.L.Burton@vme.gla.ac.uk   !net: via mcsun & ukc
"A horse! A horse!" | INTERNET: via nsfnet-relay.ac.uk  BITNET: via UKACRL
-- 
--
Francis Burton      Physiology, Glasgow University, Glasgow G12 8QQ, Scotland
041 339 8855 x6609  | JANET: F.L.Burton@vme.gla.ac.uk   !net: via mcsun & ukc
"A horse! A horse!" | INTERNET: via nsfnet-relay.ac.uk  BITNET: via UKACRL

gwc@root.co.uk (Geoff Clare) (06/06/91)

tchrist@convex.COM (Tom Christiansen) writes:

>    perl -00 -ne 'print /(.*\n)/' some_file

>The -00 put us in paragraph mode, and the (.*\n) isolates the first
>line of each paragraph for printing.

The exact equivalent in awk is:

    awk 'BEGIN { RS=""; FS="\n" }
	 { print $1 }' some_file
    
The RS="" makes blank lines the record separator, and the FS="\n" allows
the first line of the record to be obtained using "$1".

-- 
Geoff Clare <gwc@root.co.uk>  (Dumb American mailers: ...!uunet!root.co.uk!gwc)
UniSoft Limited, London, England.   Tel: +44 71 729 3773   Fax: +44 71 729 3273

lwall@jpl-devvax.jpl.nasa.gov (Larry Wall) (06/07/91)

In article <2732@root44.co.uk> gwc@root.co.uk (Geoff Clare) writes:
: Distribution: comp
: Organization: UniSoft Ltd., London, England
: Lines: 18
: 
: tchrist@convex.COM (Tom Christiansen) writes:
: 
: >    perl -00 -ne 'print /(.*\n)/' some_file
: 
: >The -00 put us in paragraph mode, and the (.*\n) isolates the first
: >line of each paragraph for printing.
: 
: The exact equivalent in awk is:
: 
:     awk 'BEGIN { RS=""; FS="\n" }
: 	 { print $1 }' some_file
:     
: The RS="" makes blank lines the record separator, and the FS="\n" allows
: the first line of the record to be obtained using "$1".

That's an exact equivalent except in one Important Respect:

    $ perl -00 -ne 'print /(.*\n)/' u.usa.va.3
    # u.usa.va.3 uucp-map@acsu.buffalo.edu
    #N      ukelele
    #N      un1
    #N      usancon
    #N      usaos
    #N      .uu.net, uunet
    #N      .uucom.com, uucom
    #N      vast
    #N      .verdix.com, vrdxhq
    #N      viar
    #N      virgil
    #N      .virginia.edu, virginia
    #N      virtech
    #N      visenix
    #N      visix
    #N      viusys
    #N      vssadm
    #N      vtserf
    #N      wimpy
    #N      wperkins
    #N      .wsrcc.com, wsrcc.com, wsrcc
    #N      wyvern
    #N      xlisa
    #N      xrxedds
    #N      yendor
    #END u.usa.va.3
    $ awk 'BEGIN { RS=""; FS="\n" }{ print $1 }' u.usa.va.3
    # u.usa.va.3 uucp-map@acsu.buffalo.edu
    #N      ukelele
    #N      un1
    #N      usancon
    #N      usaos
    Segmentation fault (core dumped)

That's on a Vax.  On Suns, at least it's polite enough to give an error
message about the line being too long.

Arbitrary limits are for the birds.  They crap on you when you're already
halfway to the celebration.

Larry Wall
lwall@netlabs.com