carl@nrcaero.UUCP (Carl P. Swail) (02/18/85)
_U_N_I_X _c_l_i_n_i_c
This column first appears in the German quarterly
_u_n_i_x/_m_a_i_l (Hanser Verlag, Munich, Germany). It is copy-
righted: copyright 1984 by Axel T. Schreiner, Ulm, West Ger-
many. It may be reproduced as long as the copyright notice
is included and reference is made to the original publica-
tion.
The column attempts to discuss typical approaches to
problem solving using the UNIX* system. It emphasizes what
the author considers to be good programming pratices and
appropriate choice of tools.
/_l_i_b/_c_p_p
This quarter's column deals with uses and abuses of the
C preprocessor. We demonstrate some techniques which can
save a lot of work (and even more errors). The discussion
applies to programming in C in general, and it assumes only
very elementary prerequisites:
C programs are run through a preprocessor _b_e_f_o_r_e
they are handed to the actual compiler. The
preprocessor performs (parametrized) text substi-
tution (#define), inserts _h_e_a_d_e_r _f_i_l_e_s (#include),
and can exclude parts of the source from compila-
tion (#if).
Since the preprocessor is independent of the
actual compiler - and does not know C at all - one
can use it in particular to extend the C language.
Only one's taste limits one's imagination here...
_E_x_c_l_u_d_i_n_g _t_e_x_t
Every programmer presumably writes occasional comments.
Sometimes we comment quite intentionally to exclude program
parts from a compilation. Since in Standard C comments may
not be nested, there is considerable temptation not to com-
ment such excluded program parts any more.
The following technique for text exclusion is much more
__________________________
*UNIX is a Trademark of Bell Laboratories.
February 18, 1985
- 2 -
appropriate:
#ifdef not_defined
crash_the_system(NOW); /* this definitely goes wrong */
#endif not_defined
Of course, the name _n_o_t__d_e_f_i_n_e_d should really not be
defined...
_V_e_c_t_o_r _d_i_m_e_n_s_i_o_n_s
In principle one can determine the size of a vector
using the _s_i_z_e_o_f operator. However, _s_i_z_e_o_f yields the size
in bytes, not in elements. The following macro determines
the number of elements in an arbitrary vector:
#define DIM(x) (sizeof (x) / sizeof ((x)[0]))
_s_i_z_e_o_f does not really need parentheses, if it is used
to determine the size of an object and not of a data type.
One should, however, enclose macro parameters in
parentheses. Then things work out for a vector with more
than one dimension, too:
main()
{ struct { int a; char b } v[10][20][30];
printf("%d %d %d\n", DIM(v), DIM(v[1]), DIM(v[1][2]));
}
The program produces the values _1_0, _2_0 and _3_0.
Parentheses should not be necessary in this use of
_s_i_z_e_o_f since a vector subscript should have precedence over
_s_i_z_e_o_f. At least my copy of the Mark Williams CP/M-86 C
compiler does not seem to know this...
We can carry these ideas somewhat further. The last
element of a vector is
#define LAST(x) ((x)[DIM(x)-1])
and the customary _f_o_r loop is for example
#define END(x) ((x) + DIM(x)-1)
int vector[10], * vp;
...
for (vp = vector; vp <= END(vector); ++ vp)
...
February 18, 1985
- 3 -
_s_i_z_e_o_f is evaluated by the compiler during constant
expressions. This can be used to determine the length of
constant strings in an efficient and flexible fashion:
#define STRLEN(s) (sizeof s - 1)
char buf[STRLEN("model") + 1];
...
strcpy(buf, "model");
There is the danger, however, that _S_T_R_L_E_N is used for other
objects, i.e., non-strings, by mistaking it for _s_t_r_l_e_n...
_T_r_a_c_e
It is well known that a _m_a_c_r_o _c_a_l_l is not recognized in
a constant string. Less well known, but more useful, is
perhaps that a _m_a_c_r_o _p_a_r_a_m_e_t_e_r is recognized and replaced
within the replacement text of a macro definition. Rather
than
printf("variable = %d\n", variable);
printf("formula = %f\n", formula);
we write
#define SHOW(val,fmt) fprintf(stderr,"SHOW: val = fmt\n",val)
SHOW(variable, %d);
SHOW(formula, %f);
The latter is easier to use and conveys more information
since _v_a_l is replaced in the format by the entire macro
argument.
A bit of caution is required: if the % operator is used
within _v_a_l there will be problems with the format. This can
be corrected as follows:
#define SHOW(val,fmt) fprintf(stderr,"%s = fmt\n", "val",val)
A macro can be defined without a replacement text.
Uses of _S_H_O_W can thus easily be eliminated from the compiled
program altogether. Alternatively we can specify a condi-
tion:
February 18, 1985
- 4 -
#ifdef DEBUG
char debugflag;
# define SHOW(val,fmt) (debugflag && fprintf(...))
#else ! DEBUG
# define SHOW(val,fmt) /* null */
#endif DEBUG
In this example _S_H_O_W is always used as a statement and
not as an expression. Using && rather than _i_f has two
advantages: this way we do not _h_a_v_e to use _S_H_O_W as a state-
ment, and a use of _S_H_O_W does not invite an unintentional
_e_l_s_e...
_d_e_b_u_g_f_l_a_g, by the way, should be used as a bit vector,
e.g.:
#define SHOW(level,val,fmt) (debugflag & 1<<level && \
fprintf(...))
Now we can maintain different sets of trace information at
_l_e_v_e_l _0 through _7.
_G_l_o_b_a_l _v_a_r_i_a_b_l_e_s
Of course, you like modular programs, too?? with lots
of sources, _m_a_k_e_f_i_l_e, a central header file, and the (fee-
ble) hope that all global declarations really match?? You
like to _l_i_n_t, too??
The following technique simplifies maintaining global
variables. The central header file contains about the fol-
lowing:
#ifndef GLOBAL
# define GLOBAL extern
#endif GLOBAL
GLOBAL int global_variable;
If nothing else is arranged, a variable declared _G_L_O_B_A_L thus
is declared _e_x_t_e_r_n.
Within exactly _o_n_e of the source files which include
the header file we have to take care that the variables
which were declared _e_x_t_e_r_n elsewhere are really defined. In
the main source file we therefore write:
#define GLOBAL /* to define global variables */
#include "definitions.h"
February 18, 1985
- 5 -
One can even initialize global variables in this con-
text _w_i_t_h_o_u_t resorting to the -_m flag instructing the loader
_l_d to accept multiple definitions:
#ifdef GLOBAL
# define INIT(x) = x
#else ! GLOBAL
# define GLOBAL extern
# define INIT(x)
#endif GLOBAL
GLOBAL int variable INIT(10);
This technique is not very practical for aggregates.
The following variant is easier to use:
#ifdef GLOBAL
# define INIT(x) = x
# define GINIT
#else ! GLOBAL
# define GLOBAL extern
# define INIT(x) ;
# undef GINIT
#endif GLOBAL
GLOBAL struct { int a; char b; } variable INIT()
#ifdef GINIT
{ 10, 'b' };
#endif GINIT
This method requires that the C preprocessor permits a
macro call with an empty argument list and that the C com-
piler does not complain about superfluous semicolons between
global declarations. This method is admittedly no longer
very elegant but it has the significant advantage that the
text of central definitions exists only once in all cases.
/_b_i_n/_l_e_x
_N_o_w _y_o_u _s_e_e _i_t...
_l_e_x programs have lots in common with fashions: the
effect is not always what the pattern promises... If a
function generated by _l_e_x is used as a front end for a
parser generated by _y_a_c_c it is sometimes very hard to decide
where to place the blame for a bug: is there a bug in the
grammar presented to _y_a_c_c or are the patterns which were
processed by _l_e_x at fault?
The following technique permits the construction of a
__________________________
This technique was developed for the book _I_n_t_r_o_d_u_c_t_i_o_n
February 18, 1985
- 6 -
source file for _l_e_x which is conditionalized so that a
debugging version can be compiled at any time without any
changes to the source. In order to test the results of _l_e_x,
all inputs which the parser is to receive later are first
presented to the debugging version. This version of the
front end then prints a mnemonic version of the values which
the parser would receive:
%{
#ifdef TRACE
# include "assert.h"
main()
{ char * cp;
assert(sizeof(int) >= sizeof(char *));
while (cp = (char *) yylex())
printf("%-.10s is \"%s\"\n",cp,yytext);
}
# define token(x) (int) "x"
#else ! TRACE
# include "y.tab.h"
# define token(x) x
#endif TRACE
%}
Normally _T_R_A_C_E is undefined and the _t_o_k_e_n_s, i.e., the
values which are to be returned to the parser, are defined
in the file _y._t_a_b._h generated by _y_a_c_c as:
#define NAME 257
...
These defined names are used directly in the source
presented to _l_e_x and are returned as a result of the func-
tion _y_y_l_e_x().
If _T_R_A_C_E is defined, _y._t_a_b._h need not yet exist. In
this case, i.e., in the debugging version, we want to return
a string as a result of _y_y_l_e_x() which is then printed by the
_m_a_i_n() program included in this case. Analyzing the
__________________________
_t_o _C_o_m_p_i_l_e_r _C_o_n_s_t_r_u_c_t_i_o_n by A. T. Schreiner and H. G.
Friedman Jr., to be published in January 1985 by
Prentice-Hall.
The technique requires that a pointer to a character
string can be returned in place of an _i_n_t value. This
February 18, 1985
- 7 -
debugging output is most easily accomplished if the output
uses exactly those words which later will appear in _y._t_a_b._h,
i.e., which are a result of %_t_o_k_e_n statements in the source
presented to _y_a_c_c.
We are using the fact that macro parameters are
replaced within strings in the replacement text of a macro.
token(_x) either returns _x itself (to be passed on to _y_a_c_c),
or a string "_x" for the purposes of _T_R_A_C_E.
The remainder of the _l_e_x program is now quite obvious:
%%
[0-9]+ return token(NUMBER);
[a-z_A-Z][a-z_A-Z0-9]* return word();
[ \t\n]+ ;
. return token(yytext[0]);
%%
struct reserved { char * text; int yylex; } reserved[] = {
{ "begin", token(BEGIN) },
{ "end", token(END) },
(char *) 0 };
int word()
{ struct reserved * rp;
for (rp = reserved; rp->text; ++ rp)
if (strcmp(yytext, rp->text) == 0)
return rp->yylex;
return token(NAME);
}
Yes - there should have been a binary chopped search,
but we are dealing only with the principles...
/_u_s_r/_s_r_c/_m_a_i_n._c
_A_r_g_u_m_e_n_t _s_t_a_n_d_a_r_d_s
Command arguments are always good for surprises. Some-
times several options may be combined into one argument;
sometimes each option must be a separate argument; sometimes
a parameter value follows as part of the argument; sometimes
it does not; all of the above; some of the above... ?
__________________________
is not possible across all implementations of C, e.g.,
it is probably not allowed on the 7300 systems. We
guard against a portability problem using _a_s_s_e_r_t().
February 18, 1985
- 8 -
If one consults the sources of certain UNIX utilities,
one learns to appreciate the flexibility of C (or the infin-
ite patience of the C compiler?): everybody does his own
thing, and most do it differently in every program! How-
ever, it would be so simple to develop a standard:
#include <stdio.h>
#define show(x) printf("x = %d\n", x)
#define USAGE fputs("cmd [-f] [-v #]\n", stderr), exit(1)
main(argc, argv)
int argc;
char ** argv;
{ int f = 0, v = 0;
while (--argc > 0 && **++argv == '-')
{ switch (*++*argv) {
case 0: /* - */
--*argv;
break;
case '-':
if (! (*argv)[1]) /* -- */
{ ++ argv, -- argc;
break;
}
default:
do
{ switch (**argv) {
case 'f': /* -f */
++ f;
continue;
case 'v':
if (*++*argv)
; /* -v# */
else if (--argc > 0)
++argv; /* -v # */
else
break;
v = atoi(*argv);
*argv += strlen(*argv)-1;
continue;
}
USAGE;
} while (*++*argv);
continue;
}
break;
}
show(f), show(v), show(argc);
if (argc) puts(*argv);
}
February 18, 1985
- 9 -
At _s_h_o_w() _a_r_g_c contains the number of arguments which
have not yet been processed and *_a_r_g_v is the first one of
these. This argument can be a single - character - in some
ancient (_c_a_t) and almost new (_t_a_r) utilities this indicates
that standard input or output is to be used in place of a
file argument.
Flags can be combined at will. If an option requires a
value, it can follow immediately (and then as rest of the
argument) or it can be an argument of its own.
Following a standard proposed in the "USENIX login" an
option -- serves to terminate processing of the option list.
Apart from that options must start with - and they must pre-
cede other arguments. These rules, however, still do not
cover all possibilities of _p_r...
The skeleton above is useful but anatomically somewhat
terrifying. The following incarnation is perhaps more
attractive:
#include <stdio.h>
#include "main.h"
#define show(x) printf("x = %d\n", x)
#define USAGE fputs("cmd [-f] [-v #]\n", stderr), exit(1)
MAIN
{ int f = 0, v = 0;
OPT
ARG 'f':
++ f;
ARG 'v': PARM
v = atoi(*argv);
NEXTOPT
OTHER
USAGE;
ENDOPT
show(f), show(v), show(argc);
if (argc) puts(*argv);
}
The trick of course is concealed in the header file
_m_a_i_n._h: here the macros _O_P_T, _A_R_G, _P_A_R_M, _N_E_X_T_O_P_T, _O_T_H_E_R, and
_E_N_D_O_P_T must be defined using exactly those texts which were
given explicitly in the previous example:
February 18, 1985
- 10 -
#define MAIN main(argc, argv) \
int argc; \
char ** argv;
#define OPT while (--argc > 0 && **++argv == '-') \
{ switch (*++*argv) { \
case 0: \
--*argv; \
break; \
case '-': \
if (! (*argv)[1]) \
{ ++ argv, -- argc; \
break; \
} \
default: \
do \
{ switch (**argv) {
#define ARG continue; \
case
#define OTHER continue; \
}
#define ENDOPT } while (*++*argv); \
continue; \
} \
break; \
}
#define PARM if (*++*argv); \
else if (--argc > 0)++argv; else break;
#define NEXTOPT *argv += strlen(*argv)-1;
The definitions are not exactly beautiful - especially
if they need to be compacted so that the C preprocessor
accepts the lengthy replacement texts - but they need to be
developed only once to make the argument standard available
for all applications. An application then is almost self-
documenting:
MAIN is the function header of the main program.
OPT starts the loop during which the options are pro-
cessed.
ENDOPTcompletes this loop.
ARG within the loop starts the processing of one
option; the name of the option (a single charac-
ter) enclosed in single quotes and a colon must
follow.
PARM follows the option specification if the option has
a value parameter. The parameter itself is then
available as *_a_r_g_v.
February 18, 1985
- 11 -
NEXTOPTis used in particular once such a parameter has
been processed to advance to the next command
argument.
OTHERmust follow all options; following this, one
specifies what should be done if an option could
not be recognized. _N_E_X_T_O_P_T may be specified in
this case, too. The unknown option itself is
**_a_r_g_v.
After the _O_P_T _E_N_D_O_P_T loop _a_r_g_c contains the number of
command arguments which have not yet been processed and
*_a_r_g_v is the first such argument. Arbitrarily many (dif-
ferent) options _A_R_G can be specified. _p_r would be imple-
mented approximately as follows:
February 18, 1985
- 12 -
MAIN
{
do
{ OPT
ARG 'h': PARM
header = *argv;
NEXTOPT
ARG 'w': PARM
width = atoi(*argv);
NEXTOPT
ARG 'l': PARM
length = atoi(*argv);
NEXTOPT
ARG 't':
tflag = 1;
ARG 's': PARM
delimeter = **argv;
++*argv;
NEXTOPT
ARG 'm':
mflag = 1;
OTHER
if (isdigit(**argv))
columns = atoi(*argv), NEXTOPT
else
USAGE, exit(1);
ENDOPT
if (argc)
{ if (**argv == '+')
{ PARM
first_page = atoi(*argv);
continue;
}
dopr(*argv);
}
else
dopr("-");
} while (argc > 1);
}
There is a blemish: -_c_o_l_u_m_n_s must be specified as a _s_i_n_g_l_e
argument (since - alone refers to standard input).
February 18, 1985
--
Carl Swail Mail: National Research Council of Canada
Building U-66, Montreal Road
Ottawa, Ontario, Canada K1A 0R6
Phone: (613) 998-3408
USENET:
{cornell,uw-beaver}!utcsrgv!dciem!nrcaero!carl
{allegra,decvax,duke,floyd,ihnp4,linus}!utzoo!dciem!nrcaero!carldave@lsuc.UUCP (David Sherman) (02/22/85)
From Axel Schreiner's UNIX Clinic: || One can even initialize global variables in this con- || text _w_i_t_h_o_u_t resorting to the -_m flag instructing the loader || _l_d to accept multiple definitions: Question: is there any cost associated with using the -m flag? I use -m as a matter of course. Aside from the occasional P-E dependency such as this reference, I think the UNIX Clinic is excellent general fodder for UNIX programmers, and warrants posting to net.lang.c. Dave Sherman -- {utzoo pesnta nrcaero utcs hcr}!lsuc!dave {allegra decvax ihnp4 linus}!utcsri!lsuc!dave