soh@andromeda.trl.OZ.AU (kam hung soh) (06/03/91)
I would like to write a regular expression which can look for patterns longer than one line. For example, I want to find the first line of each paragraph. If I try this regexp in grep or awk, /^$^.+$/, nothing happens. Admittedly, I could replace newlines with a unique character say '~', before I process my file, but I wondered if regexps can be used across a newline boundary. Regards, Soh, Kam Hung email: h.soh@trl.oz.au tel: +61 3 541 6403 Telecom Research Laboratories, POB 249 Clayton, Victoria 3168, Australia
tchrist@convex.COM (Tom Christiansen) (06/03/91)
From the keyboard of soh@andromeda.trl.OZ.AU (kam hung soh): :I would like to write a regular expression which can look for patterns :longer than one line. For example, I want to find the first line of :each paragraph. If I try this regexp in grep or awk, /^$^.+$/, nothing :happens. Admittedly, I could replace newlines with a unique character :say '~', before I process my file, but I wondered if regexps can be used :across a newline boundary. Not in most text processing languages, but I'll offer two alternatives. Rob Pike once explained to me that his screen editor, sam, could handle such things because it doesn't have hard-wired in what a line is. Sadly, sam is not available for free (although it's probably cheap from the AT&T toolbox) so I've not used it. Perhaps someone who has might comment. Another possibility is to use perl, which isn't really an interactive editor, but is certainly a superset of sed, awk, and sh, at the very least. Perl has no problems with regexps spanning multiple lines. While the default records processed are a line at a time, you can switch this to paragraph mode (records delimited by newline pairs) or even whole-file mode, in which you can slurp the entire file into the pattern space. There's no problem saying something like s/\n\nX\n//g then. In fact, there's an internal variable you can set to change the definitions of ^ and $ to mean not just at the beginning or end of a string, but rather anywhere after and before a newline as well, which is often handy. --tom -- Tom Christiansen tchrist@convex.com convex!tchrist "Perl is to sed as C is to assembly language." -me
gast@lanai.cs.ucla.edu (David Gast) (06/03/91)
In article <1991Jun2.231351.10229@trl.oz.au> soh@andromeda.trl.OZ.AU (kam hung soh) writes: >I would like to write a regular expression which can look for patterns >longer than one line. For example, I want to find the first line of >each paragraph. If I try this regexp in grep or awk, /^$^.+$/, nothing >happens. Most unix commands are line oriented the quick answer is no, but ... Sed allows patterns to be longer than one line. With a little bit of programming, you can have awk recognize patterns across lines, just save the line into a variable and then test to see whether the old line matches a pattern and the new line does. I realize this statement is not very clear, let me give a concrete example. (I have not checked this code, so it may have a typo or two, but the example should be clear). Suppose you want to define a new paragraph as occurring when the previous line is null (you may want to make it null or only white space since people do put spaces or tabs on otherwise null lines or at the end of lines) and the current line is non-null (you could have indented five spaces, begins with a capital, etc). This program prints the first line of every new paragraph, you can revise it to suit your needs. awk ' $0 ~ /./ && oldline ~ /^$/ { print $0 } {oldline=$0} ' arguments-go-here Note: If the first line of the file has text on it, it will print it since oldline is implicitly initialized to null. Obviously, perl could also do this since perl can do everything. :-) David Gast Admittedly, I could replace newlines with a unique character >say '~', before I process my file, but I wondered if regexps can be used >across a newline boundary.
byron@archone.tamu.edu (Byron Rakitzis) (06/03/91)
In article <1991Jun2.231351.10229@trl.oz.au> soh@andromeda.trl.OZ.AU (kam hung soh) writes: >I would like to write a regular expression which can look for patterns >longer than one line. For example, I want to find the first line of >each paragraph. Most Unix utilities are line based, so no dice. However, emacs will let you put a literal newline in the regexp, just like any other character. Simply escape it with a C-q first. I would like to write a "stream sam", after having seen and read about the sam text editor. It seems to me that sed's usefulness is very limited in certain circumstances, and writing obfuscated sed scripts making use of the hold space just doesn't do it for me. -- Byron Rakitzis byron@archone.tamu.edu
tchrist@convex.COM (Tom Christiansen) (06/03/91)
From the keyboard of soh@andromeda.trl.OZ.AU (kam hung soh): :I would like to write a regular expression which can look for patterns :longer than one line. For example, I want to find the first line of :each paragraph. It occurs to me I didn't answer the example question. In perl, you could solve that problem in this way (and many others as well): perl -00 -ne 'print /(.*\n)/' some_file The -00 put us in paragraph mode, and the (.*\n) isolates the first line of each paragraph for printing. --tom -- Tom Christiansen tchrist@convex.com convex!tchrist "Perl is to sed as C is to assembly language." -me
lee@sq.sq.com (Liam R. E. Quin) (06/05/91)
soh@andromeda.trl.OZ.AU (kam hung soh) writes: >I would like to write a regular expression which can look for patterns >longer than one line. For example, I want to find the first line of >each paragraph. If I try this regexp in grep or awk, /^$^.+$/, nothing >happens. Although you can't match across a newline with /^$^.+$/ in most Unix software, you can get what you want. You _could_ do it in lex, by the way, and that would be sensible if you were going to do the same thing often. You can do this in sed or awk, and also in ex or vi, with a little cleverness. Here's how in ex or vi.... First, we could print all blank (empty) lines with :g/^$/p The command g reg-exp command tells the editor (vi, ex, ed) to run the command on every line that matches the pattern. The command is pretty unrestricted, although it can't be another global (g) command... Well, that prints all the blank lines. We could print all lines after a blank line: :g/^$/+1p but that isn't quite right, because it goes wrong if there are two blank lines in a row. Ah! that's why you had /^$.+$/ and not /^$.*$/. I see... OK, we could do this: :g/^$/+1s/./&/p This says that on the line after each blank line, try to substitute a single character for itself (&), and if that worked print the line. This is OK except that if the last line in the file is blank the +1 is wrong, so we must omit the last line, and do the command on 1,$-1: :1,$-1g/^$/.+1s/./&/p Wow! well, that's plausible. In sed, we could use the Hold space. I won't do that here, as it's a little confusing to describe... In awk, though, we could do this: awk ' /^./ { if (last == "") print } { last = $0 }' You can be terser with some versions of awk: awk '/^./{ if (last == "") print} { last = $0 }' If you have mgrep of Gnu grep, you could also grep for blank lines, with one line of context, and grep for . on the result. So none of these answer your real, fundamental, can-regexp-do-this question, but they do address what you're trying to solve. Lex can do multi-line patterns, and in Dougherty & O'Reilly's Unix Text Processing (the big blue one) there is an example of a multi-line grep using sed, as I recall. Liam -- Liam Quin, lee@sq.com, SoftQuad, Toronto, +1 416 963 8337 the barefoot programmer
kuiper@CS.Cornell.EDU (Matthijs Kuiper) (06/06/91)
byron@archone.tamu.edu (Byron Rakitzis) writes: >I would like to write a "stream sam", after having seen and read about the >sam text editor. You do not have to write anything. Just buy sam: it has a stream-mode. And, it is true, sam makes it easy to write patterns that match multiple lines, or, in sam-speak, that match the newline-character. -- Matthijs Kuiper (kuiper@cs.cornell.edu)
gpaa29@udcf.glasgow.ac.uk (F.Burton) (06/06/91)
You might like to look at the article 'The Text Editor sam' by Rob Pike in Software--Practice and Experience (1987) _17_, 813-845. Pike describes the way sam handles "structural regular expressions", where the file is treated as a single string with matchable newlines. The original reference is: R.Pike, 'Structural Regular Expressions,' Proc. EUUG Spring Conf., Helsinki 1987, European Unix User's Group, Buntingford, Herts, UK. -- Francis Burton Physiology, Glasgow University, Glasgow G12 8QQ, Scotland 041 339 8855 x6609 | JANET: F.L.Burton@vme.gla.ac.uk !net: via mcsun & ukc "A horse! A horse!" | INTERNET: via nsfnet-relay.ac.uk BITNET: via UKACRL -- -- Francis Burton Physiology, Glasgow University, Glasgow G12 8QQ, Scotland 041 339 8855 x6609 | JANET: F.L.Burton@vme.gla.ac.uk !net: via mcsun & ukc "A horse! A horse!" | INTERNET: via nsfnet-relay.ac.uk BITNET: via UKACRL
gwc@root.co.uk (Geoff Clare) (06/06/91)
tchrist@convex.COM (Tom Christiansen) writes: > perl -00 -ne 'print /(.*\n)/' some_file >The -00 put us in paragraph mode, and the (.*\n) isolates the first >line of each paragraph for printing. The exact equivalent in awk is: awk 'BEGIN { RS=""; FS="\n" } { print $1 }' some_file The RS="" makes blank lines the record separator, and the FS="\n" allows the first line of the record to be obtained using "$1". -- Geoff Clare <gwc@root.co.uk> (Dumb American mailers: ...!uunet!root.co.uk!gwc) UniSoft Limited, London, England. Tel: +44 71 729 3773 Fax: +44 71 729 3273
lwall@jpl-devvax.jpl.nasa.gov (Larry Wall) (06/07/91)
In article <2732@root44.co.uk> gwc@root.co.uk (Geoff Clare) writes: : Distribution: comp : Organization: UniSoft Ltd., London, England : Lines: 18 : : tchrist@convex.COM (Tom Christiansen) writes: : : > perl -00 -ne 'print /(.*\n)/' some_file : : >The -00 put us in paragraph mode, and the (.*\n) isolates the first : >line of each paragraph for printing. : : The exact equivalent in awk is: : : awk 'BEGIN { RS=""; FS="\n" } : { print $1 }' some_file : : The RS="" makes blank lines the record separator, and the FS="\n" allows : the first line of the record to be obtained using "$1". That's an exact equivalent except in one Important Respect: $ perl -00 -ne 'print /(.*\n)/' u.usa.va.3 # u.usa.va.3 uucp-map@acsu.buffalo.edu #N ukelele #N un1 #N usancon #N usaos #N .uu.net, uunet #N .uucom.com, uucom #N vast #N .verdix.com, vrdxhq #N viar #N virgil #N .virginia.edu, virginia #N virtech #N visenix #N visix #N viusys #N vssadm #N vtserf #N wimpy #N wperkins #N .wsrcc.com, wsrcc.com, wsrcc #N wyvern #N xlisa #N xrxedds #N yendor #END u.usa.va.3 $ awk 'BEGIN { RS=""; FS="\n" }{ print $1 }' u.usa.va.3 # u.usa.va.3 uucp-map@acsu.buffalo.edu #N ukelele #N un1 #N usancon #N usaos Segmentation fault (core dumped) That's on a Vax. On Suns, at least it's polite enough to give an error message about the line being too long. Arbitrary limits are for the birds. They crap on you when you're already halfway to the celebration. Larry Wall lwall@netlabs.com