[comp.os.minix] manual page for BAWK

hall@cod.NOSC.MIL (Robert R. Hall) (09/26/89)

/*
	HEADER:		CUG000.00;
	TITLE:		BAWK User Manual;
	DATE:		05/17/1987;
	VERSION:	1.1;
	FILENAME:	BAWK.DOC;
	SEE-ALSO:	BAWK.C;
	AUTHORS:	B. Brodt;
*/

NAME
	bawk - text processor

SYNOPSIS

	bawk rules [file] ...

DESCRIPTION

	Bawk is a text processing program that searches files for
	specific patterns and performs "actions" for every occurrance
	of these patterns.  The patterns can be "regular expressions"
	as used in the UNIX "ex" editor.  The actions are expressed
	using a subset of the "C" language.

	The patterns and actions are usually placed in a "rules" file
	whose name must be the first argument in the command line,
	preceeded by the flag -f.  Otherwise, the first argument on the
	command line is taken to be a string containing the rules
	themselves. All other arguments are taken to be the names of text
	files on which the rules are to be applied, with '-' being the
	standard input.

	The command:

		bawk -f rules prog.*

	would read the patterns and actions rules from the file "rules"
	and apply them to all the files found by "prog.*".

		bawk '{ print $0 }' prog.*

	would print all the lines of all the files found by prog.*.

	The general format of a rules file is:

		<pattern> { <action> }
		<pattern> { <action> }
		...

	There may be any number of these <pattern> { <action> }
	sequences in the rules file.  Bawk reads a line of input from
	the current input file and applies every <pattern> { <action> }
	in sequence to the line.
	
	If the <pattern> corresponding to any { <action> } is missing,
	the action is applied to every line of input.  The default
	{ <action> } is to print the matched input line.

PATTERNS

	The <pattern>'s may consist of any valid C expression.  If the
	<pattern> consists of two expressions separated by a comma, it
	is taken to be a range and the <action> is performed on all
	lines of input that match the range.  <pattern>'s may contain
	"regular expressions" delimited by an '@' symbol.  Regular
	expressions can be thought of as a generalized "wildcard"
	string matching mechanism, similar to that used by many
	operating systems to specify file names.  Regular expressions
	may contain any of the following characters:

		x	An ordinary character (not mentioned below)
			matches that character.
		'\'	The backslash quotes any character.
			"\$" matches a dollar-sign.
		'^'	A circumflex at the beginning of an expression
			matches the beginning of a line.
		'$'	A dollar-sign at the end of an expression
			matches the end of a line.
		'.'	A period matches any single character except
			newline.
		':x'	A colon matches a class of characters described
			by the character following it:
		':a'	":a" matches any alphabetic;
		':d'	":d" matches digits;
		':n'	":n" matches alphanumerics;
		': '	": " matches spaces, tabs, and other control
			characters, such as newline.
		'*'	An expression followed by an asterisk matches
			zero or more occurrances of that expression:
			"fo*" matches "f", "fo", "foo", "fooo", etc.
		'+'	An expression followed by a plus sign matches
			one or more occurrances of that expression:
			"fo+" matches "fo", "foo", "fooo", etc.
		'-'	An expression followed by a minus sign
			optionally matches the expression.
		'[]'	A string enclosed in square brackets matches
			any single character in that string, but no
			others.  If the first character in the string
			is a circumflex, the expression matches any
			character except newline and the characters in
			the string.  For example, "[xyz]" matches "xx"
			and "zyx", while "[^xyz]" matches "abc" but not
			"axb".  A range of characters may be specified
			by two characters separated by "-".  Note that,
			[a-z] matches alphabetics, while [z-a] never
			matches.

	For example, the following rules file would print every line
	that contained a valid C identifier:

		@[a-zA-Z][a-zA-Z0-9]@

	And this rules file would print all lines between and including
	the ones that contained the word "START" and "END":

		@START@, @END@

ACTIONS

	Actions are expressed as a subset of the C language.  All
	variables are global and default to int's if not formally
	declared.  Variable declarations may appear anywhere within
	an action.  Only char's and int's and pointers and arrays of
	char and int are allowed.  Bawk allows only decimal integer
	constants to be used - no hex (0xnn) or octal (0nn). String
	and character constants may contain all of the special C
	escapes (\n, \r, etc.).

	Bawk supports the "if", "else", "while" and "break" flow of
	control constructs, which behave exactly as in C.

	Also supported are the following unary and binary operators,
	listed in order from highest to lowest precedence:

		operator           type    associativity
		() []              unary   left to right
		! ~ ++ -- - * &    unary   right to left
		* / %              binary  left to right
		+ -                binary  left to right
		<< >>              binary  left to right
		< <= > >=          binary  left to right
		== !=              binary  left to right
		&                  binary  left to right
		^                  binary  left to right
		|                  binary  left to right
		&&                 binary  left to right
		||                 binary  left to right
		=                  binary  right to left

	Comments are introduced by a '#' symbol and are terminated by
	the first newline character.  The standard "/*" and "*/"
	comment delimiters are not supported and will result in a
	syntax error.

FIELDS

	When bawk reads a line from the current input file, the
	record is automatically separated into "fields".  A field is
	simply a string of consecutive characters delimited by either
	the beginning or end of line, or a "field separator" character
	Initially, the field separators are the space and tab character.
	The special unary operator '$' is used to reference one of the
	fields in the current input record (line).  The fields are
	numbered sequentially starting at 1.  The expression "$0"
	references the entire input line.

	Similarly, the "record separator" is used to determine the end
	of an input "line", initially the newline character.  The field
	and record separators may be changed programatically by one of
	the actions and will remain in effect until changed again.
	Multiple (up to 10) field separators are allowed at a time, but
	only one record separator.

	Fields behave exactly like strings; and can be used in the same
	context as a character array.  These "arrays" can be considered
	to have been declared as:

		char ($n)[ 128 ];

	In other words, they are 128 bytes long.  Notice that the
	parentheses are necessary because the operators [] and $
	associate from right to left; without them, the statement
	would have parsed as:

		char $(1[ 128 ]);

	which is obviously ridiculous.

	If the contents of one of these field arrays is altered, the
	"$0" field will reflect this change.  For example, this
	expression:

		*$4 = 'A';

	will change the first character of the fourth field to an upper-
	case letter 'A'.  Then, when the following input line:

		120 PRINT "Name         address        Zip"

	is processed, it would be printed as:

		120 PRINT "Name         Address        Zip"

	Fields may also be modified with the strcpy() function (see
	below).  For example, the expression:

		strcpy( $4, "Addr." );

	applied to the same line above would yield:

		120 PRINT "Name         Addr.        Zip"

PREDEFINED VARIABLES

	The following variables are pre-defined:

		FS		Field separator (see below).
		RS		Record separator (see below also).
		NF		Number of fields in current input
				record (line).
		NR		Number of records processed thus far.
		FILENAME	Name of current input file.
		BEGIN		A special <pattern> that matches the
				beginning of input text, before the
				first record is read.
		END		A special <pattern> that matches the
				end of input text, after the last
				record has been read.

	Bawk also provides some useful builtin functions for string
	manipulation and printing:

		print(arg)	Simple printing of strings only,
				automatically terminated by '\n'.
		printf(arg..)	Exactly the printf() function from C.
		getline()	Reads the next record from the current
				input file and returns 0 on end of file.
		nextfile()	Closes out the current input file and
				begins processing the next file in the
				list (if any).
		strlen(s)	Returns the length of its string argument.
		strcpy(s,t)	Copies the string "t" to the string "s".
		strcmp(s,t)	Compares the "s" to "t" and returns 0 if
				they match.
		toupper(c)	Returns its character argument converted
				to upper-case.
		tolower(c)	Returns its character argument converted
				to lower-case.
		match(s,@re@)	Compares the string "s" to the regular
				expression "re" and returns the number
				of matches found (zero if none).

EXAMPLES

	The following rules file will scan a C program, counting the
	number of mismatched parentheses, brackets, and braces.

		/[()\[\]{}]/
		{	parens = parens + match( $0, @(@ );
			parens = parens - match( $0, @)@ );
			bracks = bracks + match( $0, @[@ );
			bracks = bracks - match( $0, @]@ );
			braces = braces + match( $0, @{@ );
			braces = braces - match( $0, @}@ );
		}
		END { printf("parens=%d, brackets=%d, braces=%d\n",
				parens, bracks, braces );
		}
	This program will capitalize the first word in every sentence of
	a document:

		BEGIN
		{
			strcpy( RS, "." );  # set record separator to a period
		}
		{
			if ( match( $1, @^[a-z]@ ) )
				*$1 = toupper( *$1 );
			printf( "%s\n", $0 );
		}

LIMITATIONS

	Bawk was originally written in BDS C, but every attempt was made
	to keep the code as portable as possible.  The program should
	be compilable with any "standard" C compiler.  On CP/M systems
	compiled with BDS C, bawk takes up about 24K.

	An input record may be no longer than 128 characters. If longer
	records are encountered, they terminate prematurely and the
	next record starts where the previous one was hacked off.

	A single pattern or action statement may be no longer than about
	4K characters, excluding comments and whitespace.  Since the
	program is semi-compiled the tokenized version will probably
	wind up being smaller than the source code, so the 4K figure is
	only approximate.

AUTHOR

	Bob Brodt
	486 Linden Ave.
	Bogota, NJ 07603

ACKNOWLEDGEMENTS

	The concept for bawk (and 3/4 of the name!) was taken from
	the program "awk" written by Afred V. Aho, Brian W. Kernighan
	and Peter J. Weinberger.  My apologies for any irreverences.

	The regular expression compiler/parser was borrowed from a
	program called "grep" and has been highly modified.  Grep is
	distributed by the DEC Users Society (DECUS) and is Copyright
	(C) 1980 by DECUS.  The author acknowledges DECUS with a nod of
	thanks for giving their general permission and okey-dokey to
	copy or modify the grep program.

	UNIX is a trademark of AT&T Bell Labs.

AMENDMENTS

	Addition of command line input option, the simple print function,
	and MSDOS wildexp.c from the CUG library by C W Rose.  Program
	amended to compile under Microsoft Quick C version 1.0.  Program
	subsequently compiled unchanged under MINIX.