carl@nrcaero.UUCP (Carl P. Swail) (02/18/85)
_U_N_I_X _c_l_i_n_i_c This column first appears in the German quarterly _u_n_i_x/_m_a_i_l (Hanser Verlag, Munich, Germany). It is copy- righted: copyright 1984 by Axel T. Schreiner, Ulm, West Ger- many. It may be reproduced as long as the copyright notice is included and reference is made to the original publica- tion. The column attempts to discuss typical approaches to problem solving using the UNIX* system. It emphasizes what the author considers to be good programming pratices and appropriate choice of tools. /_l_i_b/_c_p_p This quarter's column deals with uses and abuses of the C preprocessor. We demonstrate some techniques which can save a lot of work (and even more errors). The discussion applies to programming in C in general, and it assumes only very elementary prerequisites: C programs are run through a preprocessor _b_e_f_o_r_e they are handed to the actual compiler. The preprocessor performs (parametrized) text substi- tution (#define), inserts _h_e_a_d_e_r _f_i_l_e_s (#include), and can exclude parts of the source from compila- tion (#if). Since the preprocessor is independent of the actual compiler - and does not know C at all - one can use it in particular to extend the C language. Only one's taste limits one's imagination here... _E_x_c_l_u_d_i_n_g _t_e_x_t Every programmer presumably writes occasional comments. Sometimes we comment quite intentionally to exclude program parts from a compilation. Since in Standard C comments may not be nested, there is considerable temptation not to com- ment such excluded program parts any more. The following technique for text exclusion is much more __________________________ *UNIX is a Trademark of Bell Laboratories. February 18, 1985 - 2 - appropriate: #ifdef not_defined crash_the_system(NOW); /* this definitely goes wrong */ #endif not_defined Of course, the name _n_o_t__d_e_f_i_n_e_d should really not be defined... _V_e_c_t_o_r _d_i_m_e_n_s_i_o_n_s In principle one can determine the size of a vector using the _s_i_z_e_o_f operator. However, _s_i_z_e_o_f yields the size in bytes, not in elements. The following macro determines the number of elements in an arbitrary vector: #define DIM(x) (sizeof (x) / sizeof ((x)[0])) _s_i_z_e_o_f does not really need parentheses, if it is used to determine the size of an object and not of a data type. One should, however, enclose macro parameters in parentheses. Then things work out for a vector with more than one dimension, too: main() { struct { int a; char b } v[10][20][30]; printf("%d %d %d\n", DIM(v), DIM(v[1]), DIM(v[1][2])); } The program produces the values _1_0, _2_0 and _3_0. Parentheses should not be necessary in this use of _s_i_z_e_o_f since a vector subscript should have precedence over _s_i_z_e_o_f. At least my copy of the Mark Williams CP/M-86 C compiler does not seem to know this... We can carry these ideas somewhat further. The last element of a vector is #define LAST(x) ((x)[DIM(x)-1]) and the customary _f_o_r loop is for example #define END(x) ((x) + DIM(x)-1) int vector[10], * vp; ... for (vp = vector; vp <= END(vector); ++ vp) ... February 18, 1985 - 3 - _s_i_z_e_o_f is evaluated by the compiler during constant expressions. This can be used to determine the length of constant strings in an efficient and flexible fashion: #define STRLEN(s) (sizeof s - 1) char buf[STRLEN("model") + 1]; ... strcpy(buf, "model"); There is the danger, however, that _S_T_R_L_E_N is used for other objects, i.e., non-strings, by mistaking it for _s_t_r_l_e_n... _T_r_a_c_e It is well known that a _m_a_c_r_o _c_a_l_l is not recognized in a constant string. Less well known, but more useful, is perhaps that a _m_a_c_r_o _p_a_r_a_m_e_t_e_r is recognized and replaced within the replacement text of a macro definition. Rather than printf("variable = %d\n", variable); printf("formula = %f\n", formula); we write #define SHOW(val,fmt) fprintf(stderr,"SHOW: val = fmt\n",val) SHOW(variable, %d); SHOW(formula, %f); The latter is easier to use and conveys more information since _v_a_l is replaced in the format by the entire macro argument. A bit of caution is required: if the % operator is used within _v_a_l there will be problems with the format. This can be corrected as follows: #define SHOW(val,fmt) fprintf(stderr,"%s = fmt\n", "val",val) A macro can be defined without a replacement text. Uses of _S_H_O_W can thus easily be eliminated from the compiled program altogether. Alternatively we can specify a condi- tion: February 18, 1985 - 4 - #ifdef DEBUG char debugflag; # define SHOW(val,fmt) (debugflag && fprintf(...)) #else ! DEBUG # define SHOW(val,fmt) /* null */ #endif DEBUG In this example _S_H_O_W is always used as a statement and not as an expression. Using && rather than _i_f has two advantages: this way we do not _h_a_v_e to use _S_H_O_W as a state- ment, and a use of _S_H_O_W does not invite an unintentional _e_l_s_e... _d_e_b_u_g_f_l_a_g, by the way, should be used as a bit vector, e.g.: #define SHOW(level,val,fmt) (debugflag & 1<<level && \ fprintf(...)) Now we can maintain different sets of trace information at _l_e_v_e_l _0 through _7. _G_l_o_b_a_l _v_a_r_i_a_b_l_e_s Of course, you like modular programs, too?? with lots of sources, _m_a_k_e_f_i_l_e, a central header file, and the (fee- ble) hope that all global declarations really match?? You like to _l_i_n_t, too?? The following technique simplifies maintaining global variables. The central header file contains about the fol- lowing: #ifndef GLOBAL # define GLOBAL extern #endif GLOBAL GLOBAL int global_variable; If nothing else is arranged, a variable declared _G_L_O_B_A_L thus is declared _e_x_t_e_r_n. Within exactly _o_n_e of the source files which include the header file we have to take care that the variables which were declared _e_x_t_e_r_n elsewhere are really defined. In the main source file we therefore write: #define GLOBAL /* to define global variables */ #include "definitions.h" February 18, 1985 - 5 - One can even initialize global variables in this con- text _w_i_t_h_o_u_t resorting to the -_m flag instructing the loader _l_d to accept multiple definitions: #ifdef GLOBAL # define INIT(x) = x #else ! GLOBAL # define GLOBAL extern # define INIT(x) #endif GLOBAL GLOBAL int variable INIT(10); This technique is not very practical for aggregates. The following variant is easier to use: #ifdef GLOBAL # define INIT(x) = x # define GINIT #else ! GLOBAL # define GLOBAL extern # define INIT(x) ; # undef GINIT #endif GLOBAL GLOBAL struct { int a; char b; } variable INIT() #ifdef GINIT { 10, 'b' }; #endif GINIT This method requires that the C preprocessor permits a macro call with an empty argument list and that the C com- piler does not complain about superfluous semicolons between global declarations. This method is admittedly no longer very elegant but it has the significant advantage that the text of central definitions exists only once in all cases. /_b_i_n/_l_e_x _N_o_w _y_o_u _s_e_e _i_t... _l_e_x programs have lots in common with fashions: the effect is not always what the pattern promises... If a function generated by _l_e_x is used as a front end for a parser generated by _y_a_c_c it is sometimes very hard to decide where to place the blame for a bug: is there a bug in the grammar presented to _y_a_c_c or are the patterns which were processed by _l_e_x at fault? The following technique permits the construction of a __________________________ This technique was developed for the book _I_n_t_r_o_d_u_c_t_i_o_n February 18, 1985 - 6 - source file for _l_e_x which is conditionalized so that a debugging version can be compiled at any time without any changes to the source. In order to test the results of _l_e_x, all inputs which the parser is to receive later are first presented to the debugging version. This version of the front end then prints a mnemonic version of the values which the parser would receive: %{ #ifdef TRACE # include "assert.h" main() { char * cp; assert(sizeof(int) >= sizeof(char *)); while (cp = (char *) yylex()) printf("%-.10s is \"%s\"\n",cp,yytext); } # define token(x) (int) "x" #else ! TRACE # include "y.tab.h" # define token(x) x #endif TRACE %} Normally _T_R_A_C_E is undefined and the _t_o_k_e_n_s, i.e., the values which are to be returned to the parser, are defined in the file _y._t_a_b._h generated by _y_a_c_c as: #define NAME 257 ... These defined names are used directly in the source presented to _l_e_x and are returned as a result of the func- tion _y_y_l_e_x(). If _T_R_A_C_E is defined, _y._t_a_b._h need not yet exist. In this case, i.e., in the debugging version, we want to return a string as a result of _y_y_l_e_x() which is then printed by the _m_a_i_n() program included in this case. Analyzing the __________________________ _t_o _C_o_m_p_i_l_e_r _C_o_n_s_t_r_u_c_t_i_o_n by A. T. Schreiner and H. G. Friedman Jr., to be published in January 1985 by Prentice-Hall. The technique requires that a pointer to a character string can be returned in place of an _i_n_t value. This February 18, 1985 - 7 - debugging output is most easily accomplished if the output uses exactly those words which later will appear in _y._t_a_b._h, i.e., which are a result of %_t_o_k_e_n statements in the source presented to _y_a_c_c. We are using the fact that macro parameters are replaced within strings in the replacement text of a macro. token(_x) either returns _x itself (to be passed on to _y_a_c_c), or a string "_x" for the purposes of _T_R_A_C_E. The remainder of the _l_e_x program is now quite obvious: %% [0-9]+ return token(NUMBER); [a-z_A-Z][a-z_A-Z0-9]* return word(); [ \t\n]+ ; . return token(yytext[0]); %% struct reserved { char * text; int yylex; } reserved[] = { { "begin", token(BEGIN) }, { "end", token(END) }, (char *) 0 }; int word() { struct reserved * rp; for (rp = reserved; rp->text; ++ rp) if (strcmp(yytext, rp->text) == 0) return rp->yylex; return token(NAME); } Yes - there should have been a binary chopped search, but we are dealing only with the principles... /_u_s_r/_s_r_c/_m_a_i_n._c _A_r_g_u_m_e_n_t _s_t_a_n_d_a_r_d_s Command arguments are always good for surprises. Some- times several options may be combined into one argument; sometimes each option must be a separate argument; sometimes a parameter value follows as part of the argument; sometimes it does not; all of the above; some of the above... ? __________________________ is not possible across all implementations of C, e.g., it is probably not allowed on the 7300 systems. We guard against a portability problem using _a_s_s_e_r_t(). February 18, 1985 - 8 - If one consults the sources of certain UNIX utilities, one learns to appreciate the flexibility of C (or the infin- ite patience of the C compiler?): everybody does his own thing, and most do it differently in every program! How- ever, it would be so simple to develop a standard: #include <stdio.h> #define show(x) printf("x = %d\n", x) #define USAGE fputs("cmd [-f] [-v #]\n", stderr), exit(1) main(argc, argv) int argc; char ** argv; { int f = 0, v = 0; while (--argc > 0 && **++argv == '-') { switch (*++*argv) { case 0: /* - */ --*argv; break; case '-': if (! (*argv)[1]) /* -- */ { ++ argv, -- argc; break; } default: do { switch (**argv) { case 'f': /* -f */ ++ f; continue; case 'v': if (*++*argv) ; /* -v# */ else if (--argc > 0) ++argv; /* -v # */ else break; v = atoi(*argv); *argv += strlen(*argv)-1; continue; } USAGE; } while (*++*argv); continue; } break; } show(f), show(v), show(argc); if (argc) puts(*argv); } February 18, 1985 - 9 - At _s_h_o_w() _a_r_g_c contains the number of arguments which have not yet been processed and *_a_r_g_v is the first one of these. This argument can be a single - character - in some ancient (_c_a_t) and almost new (_t_a_r) utilities this indicates that standard input or output is to be used in place of a file argument. Flags can be combined at will. If an option requires a value, it can follow immediately (and then as rest of the argument) or it can be an argument of its own. Following a standard proposed in the "USENIX login" an option -- serves to terminate processing of the option list. Apart from that options must start with - and they must pre- cede other arguments. These rules, however, still do not cover all possibilities of _p_r... The skeleton above is useful but anatomically somewhat terrifying. The following incarnation is perhaps more attractive: #include <stdio.h> #include "main.h" #define show(x) printf("x = %d\n", x) #define USAGE fputs("cmd [-f] [-v #]\n", stderr), exit(1) MAIN { int f = 0, v = 0; OPT ARG 'f': ++ f; ARG 'v': PARM v = atoi(*argv); NEXTOPT OTHER USAGE; ENDOPT show(f), show(v), show(argc); if (argc) puts(*argv); } The trick of course is concealed in the header file _m_a_i_n._h: here the macros _O_P_T, _A_R_G, _P_A_R_M, _N_E_X_T_O_P_T, _O_T_H_E_R, and _E_N_D_O_P_T must be defined using exactly those texts which were given explicitly in the previous example: February 18, 1985 - 10 - #define MAIN main(argc, argv) \ int argc; \ char ** argv; #define OPT while (--argc > 0 && **++argv == '-') \ { switch (*++*argv) { \ case 0: \ --*argv; \ break; \ case '-': \ if (! (*argv)[1]) \ { ++ argv, -- argc; \ break; \ } \ default: \ do \ { switch (**argv) { #define ARG continue; \ case #define OTHER continue; \ } #define ENDOPT } while (*++*argv); \ continue; \ } \ break; \ } #define PARM if (*++*argv); \ else if (--argc > 0)++argv; else break; #define NEXTOPT *argv += strlen(*argv)-1; The definitions are not exactly beautiful - especially if they need to be compacted so that the C preprocessor accepts the lengthy replacement texts - but they need to be developed only once to make the argument standard available for all applications. An application then is almost self- documenting: MAIN is the function header of the main program. OPT starts the loop during which the options are pro- cessed. ENDOPTcompletes this loop. ARG within the loop starts the processing of one option; the name of the option (a single charac- ter) enclosed in single quotes and a colon must follow. PARM follows the option specification if the option has a value parameter. The parameter itself is then available as *_a_r_g_v. February 18, 1985 - 11 - NEXTOPTis used in particular once such a parameter has been processed to advance to the next command argument. OTHERmust follow all options; following this, one specifies what should be done if an option could not be recognized. _N_E_X_T_O_P_T may be specified in this case, too. The unknown option itself is **_a_r_g_v. After the _O_P_T _E_N_D_O_P_T loop _a_r_g_c contains the number of command arguments which have not yet been processed and *_a_r_g_v is the first such argument. Arbitrarily many (dif- ferent) options _A_R_G can be specified. _p_r would be imple- mented approximately as follows: February 18, 1985 - 12 - MAIN { do { OPT ARG 'h': PARM header = *argv; NEXTOPT ARG 'w': PARM width = atoi(*argv); NEXTOPT ARG 'l': PARM length = atoi(*argv); NEXTOPT ARG 't': tflag = 1; ARG 's': PARM delimeter = **argv; ++*argv; NEXTOPT ARG 'm': mflag = 1; OTHER if (isdigit(**argv)) columns = atoi(*argv), NEXTOPT else USAGE, exit(1); ENDOPT if (argc) { if (**argv == '+') { PARM first_page = atoi(*argv); continue; } dopr(*argv); } else dopr("-"); } while (argc > 1); } There is a blemish: -_c_o_l_u_m_n_s must be specified as a _s_i_n_g_l_e argument (since - alone refers to standard input). February 18, 1985 -- Carl Swail Mail: National Research Council of Canada Building U-66, Montreal Road Ottawa, Ontario, Canada K1A 0R6 Phone: (613) 998-3408 USENET: {cornell,uw-beaver}!utcsrgv!dciem!nrcaero!carl {allegra,decvax,duke,floyd,ihnp4,linus}!utzoo!dciem!nrcaero!carl
dave@lsuc.UUCP (David Sherman) (02/22/85)
From Axel Schreiner's UNIX Clinic: || One can even initialize global variables in this con- || text _w_i_t_h_o_u_t resorting to the -_m flag instructing the loader || _l_d to accept multiple definitions: Question: is there any cost associated with using the -m flag? I use -m as a matter of course. Aside from the occasional P-E dependency such as this reference, I think the UNIX Clinic is excellent general fodder for UNIX programmers, and warrants posting to net.lang.c. Dave Sherman -- {utzoo pesnta nrcaero utcs hcr}!lsuc!dave {allegra decvax ihnp4 linus}!utcsri!lsuc!dave