[comp.unix.questions] Import variables in to awk.

warner@unc.cs.unc.edu (Byron Warner) (11/15/89)

My questions is how do you import csh variables into an awk script.
for example if I have a file called foo, which contains:
{
	print import,$0
}

and I issue the command 
awk -F: -f foo /etc/passwd import='hello
why do I get just a list of logins?
Thanx in Advance

jik@athena.mit.edu (Jonathan I. Kamens) (11/15/89)

In article <10531@thorin.cs.unc.edu> warner@unc.cs.unc.edu (Byron Warner)
writes:
>My questions is how do you import csh variables into an awk script.
>for example if I have a file called foo, which contains:
>{
>	print import,$0
>}
>
>and I issue the command 
>awk -F: -f foo /etc/passwd import='hello
>why do I get just a list of logins?
>Thanx in Advance

  First of all, I have never known the C-shell to allow the syntax
"foo=bar" on a command-line to import a variable into a program.  C
shell doesn't have anything like that.

  Second, the only way to do what you want is to actually make the
creation of this variable part of the awk script.  Like this:

% set import = 'hello'
% awk 'BEGIN { import = "'"$import"'" } { print import, $0}' /etc/passwd

The $import is evaluated before awk is actually called, and replaced
by 'hello' (sans quotes).

Jonathan Kamens			              USnail:
MIT Project Athena				11 Ashford Terrace
jik@Athena.MIT.EDU				Allston, MA  02134
Office: 617-253-8495			      Home: 617-782-0710

chris@mimsy.umd.edu (Chris Torek) (11/15/89)

>In article <10531@thorin.cs.unc.edu> warner@unc.cs.unc.edu (Byron Warner)
>writes:
[file foo]
>>{ print import,$0 }
[command]
>>awk -F: -f foo /etc/passwd import='hello
>>why do I get just a list of logins?

In article <15919@bloom-beacon.MIT.EDU> jik@athena.mit.edu
(Jonathan I. Kamens) writes:
>  First of all, I have never known the C-shell to allow the syntax
>"foo=bar" on a command-line to import a variable into a program.

It does not.  However, awk does.  That is, you are looking at the wrong
program.

>  Second, the only way to do what you want is to actually make the
>creation of this variable part of the awk script.  Like this:

Not so:  Within some limits, you can set awk variables from its
invocation.  For instance:

	% cat t
	BEGIN { print "BEGIN: " this; }
	{ print "INPUT: " this " " $0; }
	END { print "END: " this; }
	% cat u
	first line
	second line
	% awk -f t u this=that
	BEGIN: 
	INPUT:  first line
	INPUT:  second line
	END: that
	% awk -f t this=that u
	BEGIN:
	INPUT: that first line
	INPUT: that second line
	END: that
	% rm t u

The `BEGIN' statement is done before any `files' are opened; the `END'
statement is done after all `files' have been read.  Any `files' of
the form `a=b' set variable `a' to value `b'.

All of the above is with respect to the 4.3BSD flavour of `awk'.  The
new awk (as described in the awk book) appears to open the first `file'
before executing the BEGIN statement, so that any assignments that
appear before the first real file happen before the BEGIN.  What GNU
awk does, I do not know (but the above technique will tell you).
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris

steinbac@hpl-opus.HP.COM (Gunter Steinbach) (11/16/89)

> / hpl-opus:comp.unix.questions / warner@unc.cs.unc.edu (Byron Warner)
> / 1:15 pm Nov 14, 1989 /

> My questions is how do you import csh variables into an awk script.

> [ deleted ]

> awk -F: -f foo /etc/passwd import='hello

The variable assignment has to come before the input file name.

	 Guenter Steinbach	 |	 hplabs!gunter_steinbach
				 |	 gunter_steinbach@hplabs.hp.com

jik@athena.mit.edu (Jonathan I. Kamens) (11/16/89)

In article <20774@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes:
>...
>The `BEGIN' statement is done before any `files' are opened; the `END'
>statement is done after all `files' have been read.  Any `files' of
>the form `a=b' set variable `a' to value `b'.

  Nifty!  Two questions:

1. Why isn't this mentioned in the BSD man page awk(1), or in the
   /usr/doc documentation about awk?
2. What happens if you actually want to read in a file that has = in
   the filename?  How am I supposed to know what happens if the
   feature isn't mentioned in documentation? :-)

Jonathan Kamens			              USnail:
MIT Project Athena				11 Ashford Terrace
jik@Athena.MIT.EDU				Allston, MA  02134
Office: 617-253-8495			      Home: 617-782-0710

tale@pawl.rpi.edu (David C Lawrence) (11/16/89)

In <10531@thorin.cs.unc.edu> warner@unc.cs.unc.edu (Byron Warner) writes:
Byron>  [file foo]
Byron> { print import,$0 }
Byron>  [command]
Byron> awk -F: -f foo /etc/passwd import='hello
Byron> why do I get just a list of logins?

Because the variable assignment has to come before file name.  I'm
also assuming here that the ' is a typo, or the absence of a match is;
either way variable assignment comes before the file list.  If you
change it to "awk -F: -f foo import=hello /etc/passwd" it will work.

This applies to V7 awk, nawk and gawk.

In <20774@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes:
Chris> All of the above is with respect to the 4.3BSD flavour of `awk'.  The
Chris> new awk (as described in the awk book) appears to open the first `file'
Chris> before executing the BEGIN statement, so that any assignments that
Chris> appear before the first real file happen before the BEGIN.  What GNU
Chris> awk does, I do not know (but the above technique will tell you).

Variables set as above are not available in the BEGIN block with gawk,
but a special option, -v, is provided to do this.  -v VAR=VAL will
assign VAL to VAR before script execution begins; another -v must be
specified for each variable you want to declare this way.

Dave
-- 
 (setq mail '("tale@pawl.rpi.edu" "tale@ai.mit.edu" "tale@rpitsmts.bitnet"))

tale@pawl.rpi.edu (David C Lawrence) (11/16/89)

In <15924@bloom-beacon.MIT.EDU> jik@athena.mit.edu (Jonathan I. Kamens) writes:

Jon> 1. Why isn't this mentioned in the BSD man page awk(1), or in the
Jon>    /usr/doc documentation about awk?

Oversight, I suppose.  SunOS manual page has it.

Jon> 2. What happens if you actually want to read in a file that has = in
Jon>    the filename?  How am I supposed to know what happens if the
Jon>    feature isn't mentioned in documentation? :-)

Good question.  I just tried a few different things which I thought
might work and none of them did.  It appears as though .*=.* patterns
which appear after a file name (/dev/null in my test case) are simply
ignored; they are neither interpreted as variable assigments nor as
file names.  I also tried passing it an arg of foo\=bar (my test case)
and it still did nothing.  In fact, it didn't even read stdin.  Hmm ...

Dave
-- 
 (setq mail '("tale@pawl.rpi.edu" "tale@ai.mit.edu" "tale@rpitsmts.bitnet"))

merlyn@iwarp.intel.com (Randal Schwartz) (11/16/89)

In article <10531@thorin.cs.unc.edu>, warner@unc (Byron Warner) writes:
| My questions is how do you import csh variables into an awk script.
| for example if I have a file called foo, which contains:
| {
| 	print import,$0
| }
| 
| and I issue the command 
| awk -F: -f foo /etc/passwd import='hello
                                          ^ missing quote,  perhaps?
| why do I get just a list of logins?

The order of command-line options is significant:

% awk -F: -f foo import='hello' /etc/passwd

yields the result you want.  Also note that these variables are not
available in the "BEGIN" action (unless something happened after the
V7 version of awk).

Just another UNIX old-timer,
-- 
/== Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ====\
| on contract to Intel's iWarp project, Hillsboro, Oregon, USA, Sol III  |
| merlyn@iwarp.intel.com ...!uunet!iwarp.intel.com!merlyn	         |
\== Cute Quote: "Welcome to Oregon... Home of the California Raisins!" ==/

richsc@ism780c.isc.com (Rich Scott) (11/17/89)

In article <15919@bloom-beacon.MIT.EDU> jik@athena.mit.edu (Jonathan I. Kamens) writes:
>In article <10531@thorin.cs.unc.edu> warner@unc.cs.unc.edu (Byron Warner)
>writes:
>>My questions is how do you import csh variables into an awk script.
>>for example if I have a file called foo, which contains:
>>{
>>	print import,$0
>>}
>>
>>and I issue the command 
>>awk -F: -f foo /etc/passwd import='hello
>>why do I get just a list of logins?

	Well, apparently awk wants its 'imported' variables specified on
the command line *before* the datafile(s), but this isn't obvious from the
manual page. Someone here told me that the argument parsing may not be done
correctly. Anyway, on my system, which runs SunOS3.5, I get the desired effect
(using csh) by doing:  awk -F: -f foo import='hello' /etc/passwd

(This is running the 4.2 or 4.3 BSD 'awk'; I can't speak for the "new" awk.)

>
>  First of all, I have never known the C-shell to allow the syntax
>"foo=bar" on a command-line to import a variable into a program.  C
>shell doesn't have anything like that.

	Umm, I don't think it's up to the shell in this case to do
anything with it; it's simply an argument to the program. Perhaps Byron,
if he really wants to import a C-shell variable into awk, should do:

	hostname% setenv VAR='hello'
	hostname% awk -F: -f foo.awk import=$VAR /etc/passwd

	The first example doesn't set any C-shell variables.
 
----------------
        rich scott                              rls@i88.isc.com
        interactive systems corporation         voice: (800) LAI-UNIX x255
        (formerly lachman associates)           naperville, il, usa

lang@PRC.Unisys.COM (Francois-Michel Lang) (11/17/89)

It's time once again to post to this group a document that I have
which explains some important things about (vanilla) AWK
that are not elsewhere documented....

****************************************************************

\" to print this document, do ditroff -ms -Pip2 awk.supp
.RP
.TL
.B
A Supplemental Document For AWK
.sp
.R
- or -
.sp
.I
Things Al, Pete, And Brian Didn't Mention Much
.R
.AU
John W. Pierce
.AI
Department of Chemistry
University of California, San Diego
La Jolla, California  92093
jwp%chem@sdcsvax.ucsd.edu
.AB
As
.B awk
and its documentation are distributed with
.I
4.2 BSD UNIX*
.R
there are a number of bugs, undocumented features,
and features that are touched on so briefly in the
documentation that the casual user may
not realize their full significance.  While this document
applies primarily to the \fI4.2 BSD\fR version of \fIUNIX\fR,
it is known that the \fI4.3 BSD\fR version does not have
all of the bugs fixed, and that it does not have updated
documentation.  The situation with respect to the versions
of \fBawk\fR distributed with other versions \fIUNIX\fR and
similar systems is unknown to the author.
.FS
*UNIX is a trademark of AT&T
.FE
.AE
.LP
In this document references to "the user manual" mean
.I
Awk - A Pattern Scanning and Processing Language (Second Edition)
.R
by Aho, Kernighan, and Weinberger.  References to "awk(1)" mean
the entry for
.B awk
in the
.I
UNIX Programmer's Manual, 4th Berkeley Distribution.
.R
References to "the documentation" mean both of those.
.LP
In most examples, the outermost set of braces ('{ }') have been
ommitted.  They would, of course, be necessary in real scripts.
.NH
Known Bugs
.LP
There are three main bugs known to me.  They involve:
.IP
Assignment to input fields.
.IP
Piping output to a program from within an \fBawk\fR script.
.IP
Using '*' in \fIprintf\fR field width and precision specifications
does not work, nor do '\\f' and '\\b' print formfeed and backspace
respectively.
.NH 2
Assignment to Input Fields
.LP
[This problem is partially fixed in \fI4.3BSD\fR;
see the last paragraph of this section regarding the unfixed portion.]
.LP
The user manual states that input fields may be objects of assignment
statements.  Given the input line
.DS
field_one field_two field_three
.DE
the script
.DS
$2 = "new_field_2"
print $0
.DE
should print
.DS
field_one new_field_2 field_three
.DE
.LP
This does not work; it will print
.DS
field_one field_two field_three
.DE
That is, the script will behave as if the
assignment to $2 had not been made.  However,
explicitly referencing an "assigned to" field
.I does
recognize that the assignment has been made.
If the script
.DS
$2 = "new_field_2"
print $1, $2, $3
.DE
is given the same input it will [properly] print
.DS
field_one new_field_2 field_three
.DE
Therefore, you can
get around this bug with, e.g.,
.DS
$2 = "new_field_2"
output = $1                       # Concatenate output fields
for(i = 2; i <= NF; ++i)          # into a single output line
	output = output OFS $i    # with OFS between fields
print output
.DE
.LP
In \fI4.3BSD\fR, this bug has been fixed to the extent that
the failing example above works correctly.  However, a script like
.DS
$2 = "new_field_2"
var = $0
print var
.DE
still gives incorrect output.  This problem can be bypassed by using
.DS
\fIvar\fR = sprintf("%s", $0)
.DE
instead of "\fIvar\fR = $0"; \fIvar\fR will have the correct value.
.NH 2
Piping Output to a Program
.LP
[This problem appears to have been fixed in \fI4.3BSD\fR,
but that has not been exhaustively tested.]
.LP
The user manual states that
.I print
and
.I printf
statements may write to a program using, e.g.,
.DS
print | "\fIcommand\fR"
.DE
This would pipe the output into \fIcommand\fR, and it
does work.  However, you should be aware that this causes
.B awk
to spawn a child process (\fIcommand\fR), and that it
.I
does not
.R
wait for the child to exit before it exits itself.  In the case of a
"slow" command like
.B sort,
.B awk
may exit before
.I command
has finished.
.LP
This can cause problems in, for example, a shell script that
depends on everything done by
.B awk
being finished before the next shell command is executed.
Consider the shell script
.DS
awk -f awk_script input_file
mv sorted_output somewhere_else
.DE
and the
.B awk
script
.DS
print output_line | "sort -o sorted_output"
.DE
If
.I input_file
is large
.B awk
will exit long before
.B sort
is finished.  That means that the
.B mv
command will be executed before
.B sort
is finished, and the result is unlikely to be what you wanted.
Other than fixing the source, there is no way to avoid this
problem except to handle such pipes outside of the awk script, e.g.
.DS
awk -f awk_file input_file | sort -o sorted_output
mv sorted_output somewhere_else
.DE
which is not wholly satisfactory.
.LP
See
.I
Sketchily Documented Features
.R
below for other considerations in redirecting
output from within an
.B awk
script.
.NH 2
Printf and '*', '\\f', and '\\b'
.LP
The document says that the \fIprintf\fR function provided is
identical to the \fIprintf\fR provided by the \fIC\fR language
\fBstdio\fR package.  This is incorrect:  '*' cannot be used to
specify a field width or precision, and '\\f' and '\\b' cannot
be used to print formfeeds and backspaces.
.LP
The command
.DS
printf("%*.s", len, string)
.DE
will cause a core dump.  Given \fBawk\fR's age, it is likely
that its \fIprintf\fR was written well before the use of '*'
for specifying field width and precision appeared in the \fBstdio\fR
library's \fIprintf\fR.  Another possibility is that it wasn't
implemented because it isn't really needed to achieve the same effect.
.LP
To accomplish this effect, you can utilize the fact that \fBawk\fR
concatenates variables before it does any other processing on them.
For example, assume a script has two variables \fIwid\fR and
\fIprec\fR which control the width and precision used for printing
another variable \fIval\fI:
.DS
[code to set "wid", "prec", and "val"]

printf("%" wid "." prec "d\en", val)
.DE
If, for example, \fIwid\fR is 8 and \fIprec\fR is 3, then /fBawk\fR
will concatenate everything to the left of the comma in
the \fIprintf\fR statement, and the statement will really be
.DS
printf(%8.3d\en, val)
.DE
These could, of course, been assigned to some variable \fIfmt\fR before
being used:
.DS
fmt = "%" wid "." prec "d"

printf(fmt "\en", val)
.DE
Note, however, that the newline ("\en") in the second form \fIcannot\fR
be included in the assignment to \fIfmt\fR.
.LP
To allow use of '\\f' and '\\b', \fBawk\fR's \fIlex\fR script must
be changed.  This is trivial to do (it is done at the point
where '\\n' and '\\t' are processed), but requires having source
code.  [I have fixed this and have not seen any unwanted effects.]
# .bp
.NH
Undocumented Features
.LP
There are several undocumented features:
.IP
Variable values may be established on the command line.
.IP
A
.B getline
function exists that reads the next input line and starts processing it
immediately.
.IP
Regular expressions accept octal representations of characters.
.IP
A
.B -d
flag argument produces debugging output if
.B awk
was compiled with "DEBUG" defined.
.IP
Scripts may be "compiled" and run later (providing the installer
did what is necessary to make this work).
.NH 2
Defining Variables On The Command Line
.LP
To pass variable values into a script at run time, you may use
.IP
.I variable=value
.LP
(as many as you like) between any "\fB-f \fIscriptname\fR" or
.I program
and the names of any files to be processed.  For example,
.DS
awk -f awkscript today=\e"`date`\e" infile
.DE
would establish for
.I awkscript
a variable named
.B today
that had as its value the output of the
.B date
command.
.LP
There are a number of caveats:
.IP
Such assignments may appear only between
.B -f
.I awkscript
(or \fIprogram\fR or [see below] \fB-R\fIawk.out\fR)
and the name of any
input file (or '-').
.IP
Each
.I variable=value
combination must be a single argument (i.e. there must not be spaces
around the '=' sign);
.I value
may be either a numeric value or a string.  If it is a string,
it must be enclosed in
double quotes at the time \fBawk\fR reads the argument.  That means
that the double quotes enclosing \fIvalue\fR on the command line
must be protected from the shell as in the example above or it will
remove them.
.IP
.I Variable
is not available for use within the script until after the first record
has been read and parsed, but it is available as soon as
that has occurred so that it may be used before any other
processing begins.  It does not exist at the time the
.B BEGIN
block is executed, and if there was no input it will not exist in the
.B END
block (if any).
.NH 2
Getline Function
.LP
.B Getline
immediately reads the next input line (which is parsed into \fI$1\fR,
\fI$2\fR, etc) and starts processing it at the location of the call
(as opposed to
.B next
which immediately reads the next input line but starts processing
from the start of the script).
.LP
.B Getline
facilitates performing some types of tasks such as
processing files with multiline records and merging
information from several files.  To use the latter as an example,
consider a case where two files, whose lines do not share
a common format, must be processed together.  Shell and \fBawk\fR
scripts to do this might look something like
.sp
In the shell script
.DS
( echo DATA1; cat datafile1; echo ENDdata1 \e
  echo DATA2; cat datafile2; echo ENDdata2 \e
) | \e
    awk -f awkscript - > awk_output_file
.DE
In the
.B awk
script
.DS
/^DATA1/  {       # Next input line starts datafile1
          while (getline && $1 !~ /^ENDdata1$/)
                 {
                 [processing for \fIdata1\fR lines]
                 }
          }
.sp 1
/^DATA2/  {       # Next input line starts datafile2
          while (getline && $1 !~ /^ENDdata2$/)
                 {
                 [processing for \fIdata2\fR lines]
                 }
          }
.DE
There are, of course, other ways of accomplishing this particular task
(primarily using \fBsed\fR to preprocess the information),
but they are generally more difficult to write and more
subject to logic errors.  Many cases arising in practice
are significantly more difficult, if not impossible, to handle
without \fBgetline\fR.
.NH 2
Regular Expressions
.LP
The sequence "\fI\eddd\fR" (where 'd' is a digit)
may be used to include explicit octal
values in regular expressions.  This is often useful if "nonprinting"
characters have been used as "markers" in a file.  It has not been
tested for ASCII values outside the range 01 through 0127.
.NH 2
Debugging output
.LP
[This is unlikely to be of interest to the casual user.]
.sp
If \fBawk\fR was compiled with "DEBUG" defined, then giving it a
.B -d
flag argument will cause it to produce debugging output when it is run.
This is sometimes useful in finding obscure problems in scripts, though
it is primarily intended for tracking down problems with \fBawk\fR itself.
.NH 2
Script "Compilation"
.LP
[It is likely that this does not work at most sites.  If it does not, the
following will probably not be of interest to the casual user.]
.sp
The command
.DS
awk -S -f script.awk
.DE
produces a file named
.B awk.out.
This is a core image of
.B awk
after parsing the file
.I script.awk.
The command
.DS
awk -Rawk.out datafile
.DE
causes
.B awk.out
to be applied to \fIdatafile\fR (or the standard input if no
input file is given).  This avoids having to reparse large
scripts each time they are used.  Unfortunately, the way this
is implemented requires some special action on the part of the
person installing \fBawk\fR.
.LP
As \fBawk\fR is delivered with \fI4.2 BSD\fR (and \fI4.3 BSD\fR),
.I awk.out
is created by the \fBawk -S ...\fR process by calling
.B sbrk()
with '0', writing out the returned value, then
writing out the core image from location 0 to
the returned address.  The \fBawk -R...\fR process
reads the first word of
.I awk.out
to get the length of the image, calls
.B brk()
with that length, and
then reads the image into itself starting at location 0.
For this to work, \fBawk\fR must have been loaded with its
text segment writeable.  Unfortunately,
the \fIBSD\fR default for \fBld\fR is to load with the text
read-only and shareable.  Thus, the installer must remember to take
special action (e.g. "cc -N ..."
[equivalently "ld -N ..."] for \fI4BSD\fR) if these
flags are to work.
.LP
[Personally, I don't think it is
a very good idea to give \fBawk\fR the opportunity
to write on its text segment; I changed it so that
only the data segment is overwritten.]
.LP
Also, due to what appears to be a lapse in logic, the first
non-flag argument following \fB-R\fIawk.out\fR is discarded.
[Disliking that behavior, the I changed it so that the \fB-R\fR flag
is treated like the \fB-f\fR flag:  no flag arguments may follow it.]
# .bp
.NH
Sketchily Documented Features
.LP
.NH 2
Exit
.LP
The user manual says that using the
.B exit
function causes the script to behave as if end-of-input has been reached.
Not menitoned explicitly is the fact that this will cause the
.B END
block to be executed if it exists.
Also, two things are ommitted:
.IP
\fBexit(\fIexpr\fB)\fR causes the script's exit status to be
set to the value of \fIexpr\fR.
.IP
If
.B exit
is called within the
.B END
block, the script exits immediately.
.NH 2
Mathematical Functions
.LP
The following builtin functions exist and are mentioned in
.I awk(1)
but not in the user manual.
.IP \fBint(\fIx\fB)\fR 10
\fIx\fR trunctated to an integer.
.IP \fBsqrt(\fIx\fB)\fR 10
the square root of \fIx\fR for \fIx\fR >= 0, otherwise zero.
.IP \fBexp(\fIx\fB)\fR 10
\fBe\fR-to-the-\fIx\fR for -88 <= \fIx\fR <= 88, zero
for \fIx\fR < -88, and dumps core for \fIx\fR > 88.
.IP \fBlog(\fIx\fB)\fR 10
the natural log of \fIx\fR.
.NH 2
OFMT Variable
.LP
The variable
.B OFMT
may be set to, e.g. "%.2f", and purely numerical output will be
bound by that restriction in
.B print
statements.  The default value is "%.6g".  Again, this is mentioned in
.I awk(1)
but not in the user manual.
.NH 2
Array Elements
.LP
The user manual states that "Array elements ... spring into existence by
being mentioned."  This is literally true;
.I any
reference to an array element causes it to exist.
("I was thought about, therefore I am.")
Take, for example,
.DS
if(array[$1] == "blah")
	{
	[process blah lines]
	}
.DE
If there is not an existing element of
.B array
whose subscript is the same as the contents of the
current line's first field,
.I
one is created
.R
and its value (null, of course) is then compared
with "blah".  This can be a bit
disconcerting, particularly when later processing is using
.DS
for (i in \fBarray\fR)
        {
        [do something with result of processing
	"blah" lines]
        }
.DE
to walk the array and expects all the elements to be non-null.
Succinct practical examples are difficult to construct, but
when this happens in a 500 line
script it can be difficult to determine what has gone wrong.
.NH 2
FS and Input Fields
.LP
By default any number of spaces or tabs can separate fields (i.e.
there are no null input fields) and trailing spaces and tabs
are ignored.  However, if
.B FS
is explicitly set to any character other than a space
(e.g., a tab: \fBFS = "\et"\fR), then a field is defined
by each such character and trailing field separator characters are
not ignored.  For example, if '>' represents a tab then
.DS
one>>three>>five>
.DE
defines six fields, with fields two, four, and six being empty.
.LP
If
.B FS
is explicitly set to a space (\fBFS\fR = "\ "), then
the default behavior obtains (this may be a bug); that
is, both spaces
and tabs are taken as field separators, there can be no
null input fields, and trailing spaces and tabs are ignored.
.NH 2
RS and Input Records
.LP
If
.B RS
is explicitly set to the null string (\fBRS\fR = ""), then the input
record separator becomes a blank line, and the newlines at the end
of input lines is a field separator.  This facilitates
handling multiline records.
.NH 2
"Fall Through"
.LP
This is mentioned in the user manual, but it is important
enough that it is worth pointing out here, also.
.LP
In the script
.DS
/\fIpattern_1\fR/  {
             [do something]
             }
.sp
/\fIpattern_2\fR/  {
             [do something]
             }
.DE
all input lines will be compared with both 
.I pattern_1
and
.I pattern_2
unless the
.B next
function is used before the closing '}' in the
.I pattern_1
portion.
.NH 2
Output Redirection
.LP
Once a file (or pipe) is opened by
.B awk
it is not closed until
.B awk
exits.  This can occassionally cause problems.  For example,
it means that a script that sorts its input lines into
output files named by the contents of their first fields
(similar to an example in the user manual)
.DS
{ print $0 > $1 }
.DE
is going to fail if the number of different first fields exceeds
about 10.
This problem
.I cannot
be avoided by using something like
.DS
{
command = "cat >> " $1
print $0 | command
}
.DE
as the value of the variable
.B command
is different for each different value of
.I $1
and is therefore treated as a different output "file".
.LP
[I have not been able to create a truly satisfactory
fix for this that doesn't involve having \fBawk\fR treat output
redirection to pipes differently from output to files; I
would greatly appreciate hearing of one.]
.NH 2
Field and Variable Types, Values, and Comparisons
.LP
The following is a synopsis of notes included with \fBawk\fR's
source code.
.NH 3
Types
.LP
Variables and fields can be strings or numbers or both.
.NH 4
Variable Types
.LP
When a variable is set by the assignment
.DS
\fIvar\fR = \fIexpr\fR
.DE
its type is set to the type of
.I expr
(this includes +=, ++, etc). An arithmetic
expression is of type
.I number,
a concatenation is of type
.I string,
etc.
If the assignment is a simple copy, e.g.
.DS
\fIvar1\fR = \fIvar2\fR
.DE
then the type of
.I var1
becomes that of
.I var2.
.LP
Type is determined by context; rarely, but always very inconveniently,
this context-determined type is incorrect.  As mentioned in
.I awk(1)
the type of an expression can be coerced to that desired.  E.g.
.DS
{
\fIexpr1\fR + 0
.sp 1
\fIexpr2\fR ""    # Concatenate with a null string
}
.DE
coerces
.I expr1
to numeric type and
.I expr2
to string type.
.NH 4
Field Types
.LP
As with variables, the type of a field is determined by
context when possible, e.g.
.RS
.IP $1++ 8
clearly implies that \fI$1\fR is to be numeric, and
.IP $1\ =\ $1\ ","\ $2 16
implies that $1 and $2 are both to be strings.
.RE
.LP
Coercion is done as needed.
In contexts where types cannot be reliably determined, e.g.,
.DS
if($1 == $2) ...
.DE
the type of each field is determined on input by inspection.  All fields are
strings; in addition, each field that contains only a number
is also considered numeric.  Thus, the test
.DS
if($1 == $2) ...
.DE
will succeed on the inputs
.DS
0       0.0
100     1e2
+100    100
1e-3    1e-3
.DE
and fail on the inputs
.DS
(null)      0
(null)      0.0
2E-518      6E-427
.DE
"only a number" in this case means matching the regular expression
.DS
^[+-]?[0-9]*\e.?[0-9]+(e[+-]?[0-9]+)?$
.DE
.NH 3
Values
.LP
Uninitialized variables have the numeric value 0 and the string value "".
Therefore, if \fIx\fR is uninitialized,
.DS
if(x) ...
if (x == "0") ...
.DE
are false, and
.DS
if(!x) ...
if(x == 0) ...
if(x == "") ...
.DE
are true.
.LP
Fields which are explicitly null have the string value "", and are not numeric.
Non-existent fields (i.e., fields past \fBNF\fR) are also treated this way.
.NH 3
Types of Comparisons
.LP
If both operands are numeric, the comparison is made
numerically.  Otherwise, operands are coerced to type
string if necessary, and the comparison is made on strings.
.NH 3
Array Elements
.LP
Array elements created by
.B split
are treated in the same way as fields.
----------------------------------------------------------------------------
Francois-Michel Lang
Paoli Research Center, Unisys         lang@prc.unisys.com      (215) 648-7256
Dept of Comp & Info Science, U of PA  lang@linc.cis.upenn.edu  (215) 898-9511

arnold@mathcs.emory.edu (Arnold D. Robbins {EUCC}) (11/17/89)

OK. Hopefully this is the definitive word on how things work.

V7 awk (old awk, /usr/bin/awk on Suns and other 4.3 based machines)

	awk '....' a=1 b=2 file c=3 file

	a is set to 1, b to 2, then the files are read and no more
	assignments are done.  This feature was undocumented On my Sun,
	the value of a and b are NOT available in the BEGIN block.
	After the first file is read c gets set to 3. Then the next
	one is read.

S5R3.n, n >= 1 nawk (new awk)

	awk '....' a=1 b=2 file c=3 file

	a is set to 1, b to 2, and those values ARE available in the
	BEGIN block.  Then the first file is read, then c is set to 3,
	then the second file is read. The value of c is NOT set in the
	BEGIN block.

	There are inconsistencies here, since conceptually the assignments
	are done when it goes to do a file open, and it "notices" that it's
	really a variable assignment.  But a and b are assigned before
	any program execution begins, while files aren't opened until
	after the BEGIN block has been run.  Note that the assignment of
	c is done correctly, after the BEGIN block.

GNU Awk 2.11 and S5R4 nawk

	awk -v z=26 '....' a=1 b=2 file c=3 file

	z is set to 26 before the BEGIN block is executed.  Then
	the BEGIN block is run. a is set to 1, b to 2, the first file
	is opened and processed, then c is set to 3, and then the
	second file is processed.

Unfortunately, people had come to rely on the way nawk did assignments
before the BEGIN block was run.  But yet the behavior was inconsistent.
So, to have our cake and eat it too, ALL assignments that are where
file names are supposed to be are done after the BEGIN block.  But,
to make a variable be available in the BEGIN block, the new -v option
was added.  You must supply a -v option for each variable to be assigned.

It is important to note that normal assignments are done AT THE TIME they
would have been opened as a file; don't expect c to be set while the
first file is being processed.

This is something that took some discussion and hammering out between
the GNU people (me and David Trueman), Brian Kernighan at Bell Labs
(and Al Aho through him), and Randall Howard at MKS.

In fact, when Brian first changed his awk to be consistent he got the loudest
complaints about needing variable assignments to happen before the BEGIN 
block was run (Hi Tom!).  Adding a command line option was the best compromise
we could come up with -- the text of the awk program does not change,
just the command line to invoke it, and everyone felt that while it
wasn't particularly pretty, we could all live with it.

(I mentioned the S5R4 awk above; I can't promise this, but I do know that
Brian has made his version of awk, which works as described above, available
to them for inclusion is S5R4.  Perhaps someone doing S5R4 at AT&T can
let us know if it made it in.  He also should have gotten his version
to the toolchest, but I don't know about that for sure either.)

GNU Awk 2.11.1 (version 2.11 at patchlevel 1) has been sent to
comp.sources.unix and should be appearing there shortly. Some version
of gnu awk will be in 4.4 BSD, when that comes out.

***

There is the separate question, "what if I have a filename with an `=' in it?"
The short answer is "don't do that".  It should perhaps be possible to come
up with a simple and consistent rule.  I don't know what that rule is
right now though, since we haven't given it a lot of thought yet.  But
I suspect you can look for a change in gawk 2.12 to address this.

Any more questions, class? :-)
-- 
Arnold Robbins -- guest account at Emory Math/CS	| Laundry increases
DOMAIN: arnold@emory.mathcs.emory.edu			| exponentially in the
UUCP: gatech!emory!arnold  PHONE: +1 404 636-7221	| number of children.
BITNET: arnold@emory	   				| -- Miriam Hartholz

merlyn@iwarp.intel.com (Randal Schwartz) (11/19/89)

In article <15924@bloom-beacon.MIT.EDU>, jik@athena (Jonathan I. Kamens) writes:
| 1. Why isn't this mentioned in the BSD man page awk(1), or in the
|    /usr/doc documentation about awk?

I got it by looking through the source.  That's the One True Way to
know things about UNIX.  Too bad the commercial world has seen fit to
lock up the sources now that they "support" (ha!) things.

Just another person who's read *every* line of the V7 source,
-- 
/== Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ====\
| on contract to Intel's iWarp project, Hillsboro, Oregon, USA, Sol III  |
| merlyn@iwarp.intel.com ...!uunet!iwarp.intel.com!merlyn	         |
\== Cute Quote: "Welcome to Oregon... Home of the California Raisins!" ==/