[net.unix] treatise on regular expressions

kemp@noscvax.UUCP (Stephen P. Kemp) (09/28/84)

The following appeared in the Naval Ocean Systems Center
newsletter COMPUTING HIGHLIGHTS. I thought it would be
useful to USENETers. If you have comments, please mail 
them to ME and NOT to Mike.

--------------------------- * ------------------------------

                     by Mike Bloomberg

            *** Regular expressions in Unix ***
                    Theme and Variations

    A regular expression can be considered to be a string of
characters  with  certain characters having special meanings
as defined below. These special characters  enable  patterns
to be defined in an efficient and general manner.

  The grep family (consisting of  grep,  egrep  for  expres-
sions,  and fgrep for fixed strings), as well as awk, ed and
sed make  heavy use of regular  expressions.   Greps  search
for the  defined  regular expression  within  the input text
file reporting any occurrences found.

  Awk is a pattern processing language and is a  generaliza-
tion  of  grep.  Awk  uses  regular  expressions,  which are
enclosed in slashes (), to locate the lines in the text file
to perform actions upon.

Ed  and  sed  are  line  editors.  Ed interacts   with   the
user   while   sed  (stream   editor)  works  in   a   "non-
interactive" mode.  Both processors use regular  expressions
as     context  addresses.  A  context  address is the abso-
lute  position  determined  by  the next  location   of  the
character  string  that   the   regular  expression  matches
within the text file.

  These special characters are:

                        -----------
. (period)
    matches any single character (wildcard character)

    EXAMPLE: the expression un.x will correspond to where
    the third character can be ANY character.  So, unax,
    unbx, .... unzx, un3x, un&x, un;x etc. will all match.

                          -----------
[]  any one of the characters or range of characters within
    these square brackets will match.

    EXAMPLE: un[a3z]x will match unax or un3x or unzx only.

    EXAMPLE: un[a-z]x will match unax, unbx ... unzx.
    However, un3x will NOT match.

    NOTE: placing a ^ (caret) preceding a group of
    characters will match the COMPLEMENT of those letters.

    EXAMPLE: un[^abcde] will match unfx ...  unzx un1x ...
    un9x un!x un$x etc.

                          -----------
()  Used for grouping of an expression. The expression
    enclosed in the parentheses can be operated by such
    operators as the * or + operators.

    EXAMPLE: (unix)+ would match unix or unixunix or
    unixunixunix etc.
    NB: Only for egrep,awk,ed,sed

                          -----------
|   an "or" conditional. Matches EITHER of the expressions
    to the left or right of the | (vertical bar) symbol.

    EXAMPLE: unix|grep will find the line that has either
    (or both) unix OR grep in it.
    NB: Only for egrep,awk,ed,sed

                          -----------
^ (caret)
    placed at the beginning of the expression, means to
    match the expression ONLY if it is at the beginning of
    the line (column 1).  Otherwise, the ^ is taken as a
    literal.

    EXAMPLE: ^unix will match only if the line begins with
    the word unix

                          -----------
$ (dollar sign)
    placed at the end of the expression means to match the
    expression ONLY if it is at the end of the line.
    Otherwise, the $ is taken as a literal.

    EXAMPLE: unix$ will match only if the line ends with the
    word unix.

                          -----------
\   disables the special characters.

    EXAMPLE: \. would look for a period in the text. \\
    would look for a backslash in the text.

                  ============================
                  =  The following symbols   =
                  =  apply to the character  =
                  =  immediately PRECEDING   =
                  =  the symbol.             =
                  ============================

* (asterisk)
    Matches on any number (including zero) of occurrences of
    the character immediately preceding.
    EXAMPLE: un*x would match ux, unx, unnx, unnnx etc.

                          -----------
+   Similar to * but matches one or more occurrences.

    EXAMPLE: un+x would match unx,  unnx, unnnx, unnnnx etc.

                          -----------
\<  matches expression that follows anything but a letter,
    digit or underscore. Normally used to find expressions
    at the front of the word.

    EXAMPLE: \<abc will find all words beginning with the
    letters abc
    NB: Only for grep,fgrep,egrep

                          -----------
\>  matches expression that precedes anything but a letter,
    digit or underscore. Normally used to find expressions
    at the end of the word.

    EXAMPLE: abc\> will find all words ending with the
    letters abc
    NB: Only for grep,egrep,fgrep

                          -----------
\( and \)
    Enclosing an expression with a "\(" on the left and a
    "\)" on the right makes it referable later in the
    expression by the syntax "\n". "n" is the numeric order
    of the enclosed expression.

    EXAMPLE: \(abc\)def\1 will match the string abcdefabc

    EXAMPLE: \(abc\)\(def\)ghi\2\1 would match
    abcdefghidefabc
    NB: Only for ed,sed

                     ======================

 For more more documentation about regular expressions, see
the "Unix Programmer's Manual, Seventh Edition, November
1980, Computer Science Division, Univ. of California at
Berkeley" which contains hardcopy version of "man"
description of processors.

 Articles of Interest:
 1. Tutorial Introduction to the Unix Text Editor (Ed)
 2. Advanced Editing on Unix (Ed)
 3. SED - non-interactive text editor.
 4. AWK - A pattern scanning and processing language




--------------------------- * ------------------------------
Steve Kemp	 {ihnp4, decvax, akgua, dcdwest, ucbvax}!sdcsvax!noscvax!kemp
Computer Sciences Corp.         kemp@nosc
Naval Ocean Systems Center
San Diego, CA