[comp.os.minix] bawk.doc

ast@cs.vu.nl (09/26/89)

NAME

        bawk - text processor

SYNOPSIS

        bawk rules [file] ...

DESCRIPTION

        Bawk is a text processing program that searches files for
        specific patterns and performs "actions" for every occurrance
        of these patterns.  The patterns can be "regular expressions"
        as used in the UNIX "ex" editor.  The actions are expressed
        using a subset of the "C" language.

        The patterns and actions are usually placed in a "rules" file
        whose name must be the first argument in the command line.
        All other arguments are taken to be the names of text files on
        which the rules are to be applied.
        The special file name "-" may also be used anywhere on the
        command line to take input from the standard input device.

        The command:

                bawk - prog.c - prog.h

        would read the patterns and actions rules from the standard
        input, then apply them to the files "prog.c", the standard
        input and "prog.h" in that order.

        The general format of a rules file is:

                <pattern> { <action> }
                <pattern> { <action> }
                ...

        There may be any number of these <pattern> { <action> }
        sequences in the rules file.  Bawk reads a line of input from
        the current input file and applies every <pattern> { <action> }
        in sequence to the line.

        If the <pattern> corresponding to any { <action> } is missing,
        the action is applied to every line of input.  The default
        { <action> } is to print the matched input line.

PATTERNS

        The <pattern>'s may consist of any valid C expression.  If the
        <pattern> consists of two expressions seperated by a comma, it
        is taken to be a range and the <action> is performed on all
        lines of input that match the range.  <pattern>'s may contain
        "regular expressions" delimited by an '@' symbol.  Regular
        expressions can be thought of as a generalized "wildcard"
        string matching mechanism, similar to that used by many
        operating systems to specify file names.  Regular expressions
        may contain any of the following characters:

                x       An ordinary character (not mentioned below)
                        matches that character.
                '\'     The backslash quotes any character.
                        "\$" matches a dollar-sign.
                '^'     A circumflex at the beginning of an expression
                        matches the beginning of a line.
                '$'     A dollar-sign at the end of an expression
                        matches the end of a line.
                '.'     A period matches any single character except
                        newline.
                ':x'    A colon matches a class of characters described
                        by the character following it:
                ':a'    ":a" matches any alphabetic;
                ':d'    ":d" matches digits;
                ':n'    ":n" matches alphanumerics;
                ': '    ": " matches spaces, tabs, and other control
                        characters, such as newline.
                '*'     An expression followed by an asterisk matches
                        zero or more occurrances of that expression:
                        "fo*" matches "f", "fo", "foo", "fooo", etc.
                '+'     An expression followed by a plus sign matches
                        one or more occurrances of that expression:
                        "fo+" matches "fo", "foo", "fooo", etc.
                '-'     An expression followed by a minus sign
                        optionally matches the expression.
                '[]'    A string enclosed in square brackets matches
                        any single character in that string, but no
                        others.  If the first character in the string
                        is a circumflex, the expression matches any
                        character except newline and the characters in
                        the string.  For example, "[xyz]" matches "xx"
                        and "zyx", while "[^xyz]" matches "abc" but not
                        "axb".  A range of characters may be specified
                        by two characters separated by "-".  Note that,
                        [a-z] matches alphabetics, while [z-a] never
                        matches.

        For example, the following rules file would print every line
        that contained a valid C identifier:

                @[a-zA-Z][a-zA-Z0-9]@

        And this rules file would print all lines between and including
        the ones that contained the word "START" and "END":

                @START@, @END@

ACTIONS

        Actions are expressed as a subset of the C language.  All
        variables are global and default to int's if not formally
        declared.  Variable declarations may appear anywhere within
        an action.  Only char's and int's and pointers and arrays of
        char and int are allowed.  Bawk allows only decimal integer
        constants to be used - no hex (0xnn) or octal (0nn). String
        and character constants may contain all of the special C
        escapes (\n, \r, etc.).

        Bawk supports the "if", "else", "while" and "break" flow of
        control constructs, which behave exactly as in C.

        Also supported are the following unary and binary operators,
        listed in order from highest to lowest precedence:

                operator           type    associativity
                () []              unary   left to right
                ! ~ ++ -- - * &    unary   right to left
                * / %              binary  left to right
                + -                binary  left to right
                << >>              binary  left to right
                < <= > >=          binary  left to right
                == !=              binary  left to right
                &                  binary  left to right
                ^                  binary  left to right
                |                  binary  left to right
                &&                 binary  left to right
                ||                 binary  left to right
                =                  binary  right to left

        Comments are introduced by a '#' symbol and are terminated by
        the first newline character.  The standard "/*" and "*/"
        comment delimiters are not supported and will result in a
        syntax error.

FIELDS

        When bawk reads a line from the current input file, the
        record is automatically seperated into "fields".  A field is
        simply a string of consecutive characters delimited by either
        the beginning or end of line, or a "field seperator" character
        Initially, the field seperators are the space and tab character.
        The special unary operator '$' is used to reference one of the
        fields in the current input record (line).  The fields are
        numbered sequentially starting at 1.  The expression "$0"
        references the entire input line.

        Similarly, the "record seperator" is used to determine the end
        of an input "line", initially the newline character.
        The field and record seperators may be changed programatically
        by one of the actions and will remain in effect until changed
        again.

        Fields behave exactly like strings; and can be used in the same
        context as a character array.  These "arrays" can be considered
        to have been declared as:

                char ($n)[ 128 ];

        In other words, they are 128 bytes long.  Notice that the
        parentheses are necessary because the operators [] and $
        associate from right to left; without them, the statement
        would have parsed as:

                char $(1[ 128 ]);

        which is obviously ridiculous.

        If the contents of one of these field arrays is altered, the
        "$0" field will reflect this change.  For example, this
        expression:

                *$4 = 'A';

        will change the first character of the fourth field to an upper-
        case letter 'A'.  Then, when the following input line:

                120 PRINT "Name         address        Zip"

        is processed, it would be printed as:

                120 PRINT "Name         Address        Zip"

        Fields may also be modified with the strcpy() function (see
        below).  For example, the expression:

                strcpy( $4, "Addr." );

        applied to the same line above would yield:

                120 PRINT "Name         Addr.        Zip"

PREDEFINED VARIABLES

        The following variables are pre-defined:

                FS              Field seperator (see below).
                RS              Record seperator (see below also).
                NF              Number of fields in current input
                                record (line).
                NR              Number of records processed thus far.
                FILENAME        Name of current input file.
                BEGIN           A special <pattern> that matches the
                                beginning of input text, before the
                                first record is read.
                END             A special <pattern> that matches the
                                end of input text, after the last
                                record has been read.

        Bawk also provides some useful builtin functions for string
        manipulation and printing:

                printf(arg..)   Exactly the printf() function from C.
                getline()       Reads the next record from the current
                                input file and returns 0 on end of file.
                nextfile()      Closes out the current input file and
                                begins processing the next file in the
                                list (if any).
                strlen(s)       Returns the length of its string argument.
                strcpy(s,t)     Copies the string "t" to the string "s".
                strcmp(s,t)     Compares the "s" to "t" and returns 0 if
                                they match.
                toupper(c)      Returns its character argument converted
                                to upper-case.
                tolower(c)      Returns its character argument converted
                                to lower-case.
                match(s,@re@)   Compares the string "s" to the regular
                                expression "re" and returns the number
                                of matches found (zero if none).

EXAMPLES

        The following rules file will scan a C program, counting the
        number of mismatched parentheses, brackets, and braces.

                /[()\[\]{}]/
                {
                        parens = parens + match( $0, @(@ );
                        parens = parens - match( $0, @)@ );
                        bracks = bracks + match( $0, @[@ );
                        bracks = bracks - match( $0, @]@ );
                        braces = braces + match( $0, @{@ );
                        braces = braces - match( $0, @}@ );
                }
                END { printf("parens=%d, brackets=%d, braces=%d\n",
                                parens, bracks, braces );
                }

        This program will capitalize the first word in every sentence of
        a document:

                BEGIN
                {
                        RS = '.';  # set record seperator to a period
                }
                {
                        if ( match( $1, @^[a-z]@ ) )
                                *$1 = toupper( *$1 );
                        printf( "%s\n", $0 );
                }

LIMITATIONS

        Bawk was originally written in BDS C, but every attempt was made
        to keep the code as portable as possible.  The program should
        be compilable with any "standard" C compiler.  On CP/M systems
        compiled with BDS C, bawk takes up about 24K.

        An input record may be no longer than 128 characters. If longer
        records are encountered, they terminate prematurely and the
        next record starts where the previous one was hacked off.

        A single pattern or action statement may be no longer than about
        4K characters, excluding comments and whitespace.  Since the
        program is semi-compiled the tokenized version will probably
        wind up being smaller than the source code, so the 4K figure is
        only approximate.

AUTHOR

        Bob Brodt
        486 Linden Ave.
        Bogota, NJ 07603

ACKNOWLEDGEMENTS

        The concept for bawk (and 3/4 of the name!) was taken from
        the program "awk" written by Afred V. Aho, Brian W. Kernighan
        and Peter J. Weinberger.  My apologies for any irreverences.

        The regular expression compiler/parser was borrowed from a
        program called "grep" and has been highly modified.  Grep is
        distributed by the DEC Users Society (DECUS) and is Copyright
        (C) 1980 by DECUS.  The author acknowledges DECUS with a nod of
        thanks for giving their general permission and okey-dokey to
        copy or modify the grep program.

        UNIX is a trademark of AT&T Bell Labs.