[net.emacs] Much ado about regular expressions

chris@umcp-cs.UUCP (Chris Torek) (10/22/86)

(Warning: the following article will tell you more than you ever
wanted to know about playing with regular expressions.)

In article <1168@peregrine.UUCP> someone writes:
>Since I have switched from vi to EMACS, there is one thing that I missed
>more than anything else.  The ability to perform an operation on all
>the lines that met a particular criteria(specified by a regular expression).
>For instance in vi, I could type in "/[A-Z][a-z]*/d" to delete all lines
>that met the specified criteria or I could type in 
>"/\([A-Za-z][A-Za-z]*(\).*\()\)/s//\1\2".  How would I do similar operations
>in EMACS?

(You left out the `g': `g/[A-Z][a-z]*/d'.)  Some of these operations
are best done by writing MLisp or elisp code, but note that a global
delete operation is trivial due to the way regular expressions work,
with the addition that Emacs can match newlines explicitly.  Simply
add `.*' at the front of your R.E., and add `.*<^J>' at the end:

	<ESC>x
	: re-replace-string<RET>
	Old pattern: .*[A-Z][a-z]*.*<^Q><^J><RET>
	New string: <RET>

(Note that this should be done after moving to the top of the
buffer, since Emacs's replace operations work from wherever you
are now to the end of the buffer.)  Since `.' matches any character
but newline, and `{class}*' matches the longest possible sequence
of {class}, this will always match full lines containing at least
one [A-Z].

The pattern can be simplified as well.  The [a-z]* part is unnecessary,
as it matches zero or more `a's, `b's, ..., `z's.  Yet the implied
`.*' in vi's global, or the explicit one in Emacs, subsumes this:

	Old pattern: .*[A-Z].*<^Q><^J><RET>

There is one final possible optimisation that is very useful when
dealing with large files.  Emacs's search code runs faster when it
can do an `anchored search'.  (I am not using `anchored' in quite
the same sense as Snobol here.  There may be a better term, but I
cannot think of it offhand.)  By this I mean that a first character
that is considered `literal' speeds the matching operation.

For example, searching for `[A-Z][A-Z]*' is slow, but searching
for `A[A-Z]*' is fast.  The reason is that a literal match (the
first `A' here) is a common case, and has been optimised by having
the search code first find one `A' before trying the full-blown
regular expression match operation.

But look at this: our original pattern is required to match a full
line!  It must start at the beginning of a line, find one character
in [A..Z], match the rest of the line, then pick up a newline.
So we should be able to `anchor' it to the beginning of a line.
What begins a line?  Well, `^' in a regular expression should do
this.  We could use the pattern

	^.*[A-Z].*<^J>

Unfortunately, this does not run any faster.  Peeking at the innards
of the regular expression matcher shows why: `^' is not considered
a literal character.  Curses!  (No, not the library.)  But lo! there
is another way to denote the beginning of a line.  Every line begins
after the previous line ends, and every previous line ends with a
newline!  We can use instead the pattern

	<^J>.*[A-Z].*<^J>

But---oops!---we forgot something.  The very first line does not have
a previous line.  Now what can we do?

When all else fails, cheat:  Add a blank line at the top of the file.
Now we have a previous line, and can use our modified pattern:

	Old pattern: <^Q><^J>.*[A-Z].*<^Q><^J><RET>
	New string: <RET>

Whoops, that seems to have deleted all the newlines as well.  That
anchor we added came from the previous line, so we must put it back:

	New string: <^Q><^J><RET>

But this is not necessary.  Since we know all about how .* matches
everything it can, we simply notice that that final newline on the
original pattern is not necessary.  If we leave it out, Emacs will
not match the newline between the line we wanted to delete and the
next.  But that is all right:  If we have Emacs leave that newline
behind, it will make up for the newline we stole from the previous
line.  Thus the final pattern is:

	Old pattern: <^Q><^J>.*[A-Z].*<RET>
	New string: <RET>

Of course, when we are all done we have to clean up: we stuck an
extra blank line at the top of the buffer so that we could cheat.

The ultimate sequence of commands, then, is

	ESC-<				(top of buffer)
	^O				(add that extra blank line)
	ESC-x re-replace-string		(do the replace)
	^Q ^J .*[A-Z].* RET		(type in the old pattern)
	RET				(specify a blank new string)
	^D				(delete that extra blank line)

And lo! Emacs deletes every line containing an uppercase letter.
Not only that, it even does it faster than vi!  :-)

(Actually, chances are that typing

	ESC-< ^@ ESC-> ESC-x filter-region egrep -v "[A-Z]" RET

is just as fast, and easier to remember.  We can use a wrench
as a hammer, but having the hammer too is nice.)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
UUCP:	seismo!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@mimsy.umd.edu

ihm@minnie.UUCP (Ian Merritt) (10/25/86)

There are some kinds of replacement functions for which I would really
like the good ol' MIT-TECO (or even a reasonable subset) minibuffer.  I
realize this wouldn't be of much value for the newcomers to the EMACS
world, but for many of us who have been using EMACS since ITS/TOPS-20,
that was a really quick escape mechanism for certain transformations of
medium complexity which probably could be performed with regex or other
scenarios, but not as quickly.  How long it has been since I have seen
something like:

jsfoo$.,.+4uxsbar$xi$$


------------------------------

It has been so long that I am not even sure I remember the command set
quite correcltly, but I found it quite useful back then and there have
been times recently when I would have found it much faster than a ^X(
macro or other method.

Oh well...

						<>IHM<>
-- 

uucp:	ihnp4!nrcvax!ihm