[pe.cust.general] UNIX Clinic

carl@nrcaero.UUCP (Carl P. Swail) (02/18/85)

                                  _U_N_I_X _c_l_i_n_i_c






               This column  first  appears  in  the  German  quarterly
          _u_n_i_x/_m_a_i_l  (Hanser  Verlag,  Munich,  Germany).  It is copy-
          righted: copyright 1984 by Axel T. Schreiner, Ulm, West Ger-
          many.   It may be reproduced as long as the copyright notice
          is included and reference is made to the  original  publica-
          tion.

               The column attempts to discuss  typical  approaches  to
          problem solving using the UNIX* system.  It emphasizes  what
          the  author  considers  to  be good programming pratices and
          appropriate choice of tools.

          /_l_i_b/_c_p_p

               This quarter's column deals with uses and abuses of the
          C  preprocessor.   We  demonstrate some techniques which can
          save a lot of work (and even more errors).   The  discussion
          applies  to programming in C in general, and it assumes only
          very elementary prerequisites:

               C programs are run through a  preprocessor  _b_e_f_o_r_e
               they  are  handed  to  the  actual  compiler.  The
               preprocessor performs (parametrized) text  substi-
               tution (#define), inserts _h_e_a_d_e_r _f_i_l_e_s (#include),
               and can exclude parts of the source from  compila-
               tion (#if).

               Since  the  preprocessor  is  independent  of  the
               actual compiler - and does not know C at all - one
               can use it in particular to extend the C language.
               Only one's taste limits one's imagination here...

          _E_x_c_l_u_d_i_n_g _t_e_x_t

               Every programmer presumably writes occasional comments.
          Sometimes  we comment quite intentionally to exclude program
          parts from a compilation.  Since in Standard C comments  may
          not  be nested, there is considerable temptation not to com-
          ment such excluded program parts any more.

               The following technique for text exclusion is much more
          __________________________
          *UNIX is a Trademark of Bell Laboratories.




                               February 18, 1985





                                     - 2 -


          appropriate:

                  #ifdef  not_defined
                          crash_the_system(NOW); /* this definitely goes wrong */
                  #endif  not_defined


          Of  course,  the  name  _n_o_t__d_e_f_i_n_e_d  should  really  not  be
          defined...

          _V_e_c_t_o_r _d_i_m_e_n_s_i_o_n_s

               In principle one can determine the  size  of  a  vector
          using  the  _s_i_z_e_o_f operator. However, _s_i_z_e_o_f yields the size
          in bytes, not in elements.  The following  macro  determines
          the number of elements in an arbitrary vector:

                  #define DIM(x)  (sizeof (x) / sizeof ((x)[0]))


               _s_i_z_e_o_f does not really need parentheses, if it is  used
          to  determine  the size of an object and not of a data type.
          One   should,   however,   enclose   macro   parameters   in
          parentheses.   Then  things  work out for a vector with more
          than one dimension, too:

                  main()
                  {       struct { int a; char b } v[10][20][30];

                          printf("%d %d %d\n", DIM(v), DIM(v[1]), DIM(v[1][2]));
                  }


               The program produces the values _1_0, _2_0 and _3_0.

               Parentheses should not be  necessary  in  this  use  of
          _s_i_z_e_o_f  since a vector subscript should have precedence over
          _s_i_z_e_o_f.  At least my copy of the  Mark  Williams  CP/M-86  C
          compiler does not seem to know this...

               We can carry these ideas somewhat  further.   The  last
          element of a vector is

                  #define LAST(x) ((x)[DIM(x)-1])


          and the customary _f_o_r loop is for example

                  #define END(x)  ((x) + DIM(x)-1)

                  int vector[10], * vp;
                          ...
                          for (vp = vector; vp <= END(vector); ++ vp)
                                  ...



                               February 18, 1985





                                     - 3 -


               _s_i_z_e_o_f is evaluated by  the  compiler  during  constant
          expressions.   This  can  be used to determine the length of
          constant strings in an efficient and flexible fashion:

                  #define STRLEN(s) (sizeof s - 1)

                  char buf[STRLEN("model") + 1];
                          ...
                          strcpy(buf, "model");


          There is the danger, however, that _S_T_R_L_E_N is used for  other
          objects, i.e., non-strings, by mistaking it for _s_t_r_l_e_n...

          _T_r_a_c_e

               It is well known that a _m_a_c_r_o _c_a_l_l is not recognized in
          a  constant  string.   Less  well known, but more useful, is
          perhaps that a _m_a_c_r_o _p_a_r_a_m_e_t_e_r is  recognized  and  replaced
          within  the  replacement text of a macro definition.  Rather
          than

                  printf("variable = %d\n", variable);
                  printf("formula = %f\n", formula);


          we write

                  #define SHOW(val,fmt) fprintf(stderr,"SHOW: val = fmt\n",val)

                          SHOW(variable, %d);
                          SHOW(formula, %f);


          The latter is easier to use  and  conveys  more  information
          since  _v_a_l  is  replaced  in  the format by the entire macro
          argument.

               A bit of caution is required: if the % operator is used
          within _v_a_l there will be problems with the format.  This can
          be corrected as follows:

                  #define SHOW(val,fmt) fprintf(stderr,"%s = fmt\n", "val",val)


               A macro can be  defined  without  a  replacement  text.
          Uses of _S_H_O_W can thus easily be eliminated from the compiled
          program altogether.  Alternatively we can specify  a  condi-
          tion:








                               February 18, 1985





                                     - 4 -



                  #ifdef  DEBUG
                          char debugflag;
                  #       define  SHOW(val,fmt)   (debugflag && fprintf(...))
                  #else   ! DEBUG
                  #       define  SHOW(val,fmt)   /* null */
                  #endif  DEBUG


               In this example _S_H_O_W is always used as a statement  and
          not  as  an  expression.   Using  &&  rather than _i_f has two
          advantages: this way we do not _h_a_v_e to use _S_H_O_W as a  state-
          ment,  and  a  use  of _S_H_O_W does not invite an unintentional
          _e_l_s_e...

               _d_e_b_u_g_f_l_a_g, by the way, should be used as a bit  vector,
          e.g.:

                  #define SHOW(level,val,fmt) (debugflag & 1<<level && \
                                                 fprintf(...))


          Now we can maintain different sets of trace  information  at
          _l_e_v_e_l _0 through _7.

          _G_l_o_b_a_l _v_a_r_i_a_b_l_e_s

               Of course, you like modular programs, too??  with  lots
          of  sources,  _m_a_k_e_f_i_l_e, a central header file, and the (fee-
          ble) hope that all global declarations really  match??   You
          like to _l_i_n_t, too??

               The following technique simplifies  maintaining  global
          variables.   The central header file contains about the fol-
          lowing:

                  #ifndef GLOBAL
                  #       define  GLOBAL  extern
                  #endif  GLOBAL

                  GLOBAL  int     global_variable;


          If nothing else is arranged, a variable declared _G_L_O_B_A_L thus
          is declared _e_x_t_e_r_n.

               Within exactly _o_n_e of the source  files  which  include
          the  header  file  we  have  to take care that the variables
          which were declared _e_x_t_e_r_n elsewhere are really defined.  In
          the main source file we therefore write:

                  #define GLOBAL  /* to define global variables */
                  #include "definitions.h"




                               February 18, 1985





                                     - 5 -


               One can even initialize global variables in  this  con-
          text _w_i_t_h_o_u_t resorting to the -_m flag instructing the loader
          _l_d to accept multiple definitions:

                  #ifdef  GLOBAL
                  #       define  INIT(x) = x
                  #else   ! GLOBAL
                  #       define  GLOBAL  extern
                  #       define  INIT(x)
                  #endif  GLOBAL

                  GLOBAL  int variable INIT(10);


               This technique is not very  practical  for  aggregates.
          The following variant is easier to use:

                  #ifdef  GLOBAL
                  #       define  INIT(x) = x
                  #       define  GINIT
                  #else   ! GLOBAL
                  #       define  GLOBAL  extern
                  #       define  INIT(x) ;
                  #       undef   GINIT
                  #endif  GLOBAL

                  GLOBAL  struct { int a; char b; } variable INIT()
                  #ifdef  GINIT
                                  { 10, 'b' };
                  #endif  GINIT


               This method requires that the C preprocessor permits  a
          macro  call  with an empty argument list and that the C com-
          piler does not complain about superfluous semicolons between
          global  declarations.   This  method is admittedly no longer
          very elegant but it has the significant advantage  that  the
          text of central definitions exists only once in all cases.

          /_b_i_n/_l_e_x

          _N_o_w _y_o_u _s_e_e _i_t...

               _l_e_x programs have lots in  common  with  fashions:  the
          effect  is  not  always  what  the pattern promises...  If a
          function generated by _l_e_x is used  as  a  front  end  for  a
          parser generated by _y_a_c_c it is sometimes very hard to decide
          where to place the blame for a bug: is there a  bug  in  the
          grammar  presented  to  _y_a_c_c  or are the patterns which were
          processed by _l_e_x at fault?

               The following technique permits the construction  of  a
          __________________________
          This technique was developed for the book  _I_n_t_r_o_d_u_c_t_i_o_n



                               February 18, 1985





                                     - 6 -


          source file for _l_e_x  which  is  conditionalized  so  that  a
          debugging  version  can  be compiled at any time without any
          changes to the source.  In order to test the results of _l_e_x,
          all  inputs  which  the parser is to receive later are first
          presented to the debugging version.   This  version  of  the
          front end then prints a mnemonic version of the values which
          the parser would receive:

                  %{
                  #ifdef  TRACE

                  #       include "assert.h"

                          main()
                          {       char * cp;

                                  assert(sizeof(int) >= sizeof(char *));
                                  while (cp = (char *) yylex())
                                      printf("%-.10s is \"%s\"\n",cp,yytext);
                          }

                  #       define  token(x)        (int) "x"

                  #else   ! TRACE

                  #       include "y.tab.h"
                  #       define  token(x)        x

                  #endif  TRACE
                  %}


               Normally _T_R_A_C_E is undefined and the _t_o_k_e_n_s,  i.e.,  the
          values  which  are to be returned to the parser, are defined
          in the file _y._t_a_b._h generated by _y_a_c_c as:

                  #define NAME    257
                          ...


          These  defined  names  are  used  directly  in  the   source
          presented  to  _l_e_x and are returned as a result of the func-
          tion _y_y_l_e_x().

               If _T_R_A_C_E is defined, _y._t_a_b._h need not  yet  exist.   In
          this case, i.e., in the debugging version, we want to return
          a string as a result of _y_y_l_e_x() which is then printed by the
          _m_a_i_n()  program  included  in  this  case.   Analyzing   the
          __________________________
          _t_o  _C_o_m_p_i_l_e_r  _C_o_n_s_t_r_u_c_t_i_o_n by A. T. Schreiner and H. G.
          Friedman Jr.,  to  be  published  in  January  1985  by
          Prentice-Hall.
          The  technique  requires  that a pointer to a character
          string can be returned in place of an _i_n_t  value.  This



                               February 18, 1985





                                     - 7 -


          debugging output is most easily accomplished if  the  output
          uses exactly those words which later will appear in _y._t_a_b._h,
          i.e., which are a result of %_t_o_k_e_n statements in the  source
          presented to _y_a_c_c.

               We  are  using  the  fact  that  macro  parameters  are
          replaced  within strings in the replacement text of a macro.
          token(_x) either returns _x itself (to be passed on to  _y_a_c_c),
          or a string "_x" for the purposes of _T_R_A_C_E.

               The remainder of the _l_e_x program is now quite obvious:

                  %%

                  [0-9]+                          return token(NUMBER);
                  [a-z_A-Z][a-z_A-Z0-9]*          return word();
                  [ \t\n]+                        ;
                  .                               return token(yytext[0]);

                  %%

                  struct reserved { char * text; int yylex; } reserved[] = {
                          { "begin", token(BEGIN) },
                          { "end", token(END) },
                          (char *) 0 };

                  int word()
                  {       struct reserved * rp;

                          for (rp = reserved; rp->text; ++ rp)
                                  if (strcmp(yytext, rp->text) == 0)
                                          return rp->yylex;
                          return token(NAME);
                  }


               Yes - there should have been a binary  chopped  search,
          but we are dealing only with the principles...

          /_u_s_r/_s_r_c/_m_a_i_n._c

          _A_r_g_u_m_e_n_t _s_t_a_n_d_a_r_d_s

               Command arguments are always good for surprises.  Some-
          times  several  options  may  be combined into one argument;
          sometimes each option must be a separate argument; sometimes
          a parameter value follows as part of the argument; sometimes
          it does not; all of the above; some of the above... ?

          __________________________
          is  not possible across all implementations of C, e.g.,
          it is probably not allowed on  the  7300  systems.   We
          guard against a portability problem using _a_s_s_e_r_t().




                               February 18, 1985





                                     - 8 -


               If one consults the sources of certain UNIX  utilities,
          one learns to appreciate the flexibility of C (or the infin-
          ite patience of the C compiler?):  everybody  does  his  own
          thing,  and  most  do it differently in every program!  How-
          ever, it would be so simple to develop a standard:

                  #include <stdio.h>

                  #define show(x) printf("x = %d\n", x)
                  #define USAGE   fputs("cmd [-f] [-v #]\n", stderr), exit(1)

                  main(argc, argv)
                          int argc;
                          char ** argv;
                  {       int f = 0, v = 0;

                          while (--argc > 0 && **++argv == '-')
                          {       switch (*++*argv) {
                                  case 0:                         /* - */
                                          --*argv;
                                          break;
                                  case '-':
                                          if (! (*argv)[1])       /* -- */
                                          {       ++ argv, -- argc;
                                                  break;
                                          }
                                  default:
                                          do
                                          {       switch (**argv) {
                                                  case 'f':         /* -f */
                                                      ++ f;
                                                      continue;
                                                  case 'v':
                                                      if (*++*argv)
                                                          ;        /* -v# */
                                                      else if (--argc > 0)
                                                          ++argv; /* -v # */
                                                      else
                                                          break;
                                                      v = atoi(*argv);
                                                      *argv += strlen(*argv)-1;
                                                      continue;
                                                  }
                                                  USAGE;
                                          } while (*++*argv);
                                          continue;
                                  }
                                  break;
                          }
                          show(f), show(v), show(argc);
                          if (argc) puts(*argv);
                  }





                               February 18, 1985





                                     - 9 -


               At _s_h_o_w() _a_r_g_c contains the number of  arguments  which
          have  not  yet  been processed and *_a_r_g_v is the first one of
          these.  This argument can be a single - character - in  some
          ancient  (_c_a_t) and almost new (_t_a_r) utilities this indicates
          that standard input or output is to be used in  place  of  a
          file argument.

               Flags can be combined at will.  If an option requires a
          value,  it  can  follow immediately (and then as rest of the
          argument) or it can be an argument of its own.

               Following a standard proposed in the "USENIX login"  an
          option -- serves to terminate processing of the option list.
          Apart from that options must start with - and they must pre-
          cede  other  arguments.   These rules, however, still do not
          cover all possibilities of _p_r...

               The skeleton above is useful but anatomically  somewhat
          terrifying.   The  following  incarnation  is  perhaps  more
          attractive:

                  #include <stdio.h>
                  #include "main.h"

                  #define show(x) printf("x = %d\n", x)
                  #define USAGE   fputs("cmd [-f] [-v #]\n", stderr), exit(1)

                  MAIN
                  {       int f = 0, v = 0;

                          OPT
                          ARG 'f':
                                  ++ f;
                          ARG 'v': PARM
                                  v = atoi(*argv);
                                  NEXTOPT
                          OTHER
                                  USAGE;
                          ENDOPT
                          show(f), show(v), show(argc);
                          if (argc) puts(*argv);
                  }


               The trick of course is concealed  in  the  header  file
          _m_a_i_n._h:  here the macros _O_P_T, _A_R_G, _P_A_R_M, _N_E_X_T_O_P_T, _O_T_H_E_R, and
          _E_N_D_O_P_T must be defined using exactly those texts which  were
          given explicitly in the previous example:









                               February 18, 1985





                                     - 10 -



                  #define MAIN    main(argc, argv)                          \
                                          int argc;                         \
                                          char ** argv;
                  #define OPT     while (--argc > 0 && **++argv == '-')     \
                                  {       switch (*++*argv) {               \
                                          case 0:                           \
                                                  --*argv;                  \
                                                  break;                    \
                                          case '-':                         \
                                                  if (! (*argv)[1])         \
                                                  {       ++ argv, -- argc; \
                                                          break;            \
                                                  }                         \
                                          default:                          \
                                                  do                        \
                                                  {       switch (**argv) {
                  #define ARG                                     continue; \
                                                          case
                  #define OTHER                                   continue; \
                                                          }
                  #define ENDOPT                  } while (*++*argv);       \
                                                  continue;                 \
                                          }                                 \
                                          break;                            \
                                  }
                  #define PARM if (*++*argv);                               \
                               else if (--argc > 0)++argv; else break;
                  #define NEXTOPT *argv += strlen(*argv)-1;


               The definitions are not exactly beautiful -  especially
          if  they  need  to  be  compacted so that the C preprocessor
          accepts the lengthy replacement texts - but they need to  be
          developed  only once to make the argument standard available
          for all applications.  An application then is  almost  self-
          documenting:

               MAIN is the function header of the main program.

               OPT  starts the loop during which the options are  pro-
                    cessed.

               ENDOPTcompletes this loop.

               ARG  within the  loop  starts  the  processing  of  one
                    option;  the  name of the option (a single charac-
                    ter) enclosed in single quotes and  a  colon  must
                    follow.

               PARM follows the option specification if the option has
                    a  value  parameter.  The parameter itself is then
                    available as *_a_r_g_v.




                               February 18, 1985





                                     - 11 -


               NEXTOPTis used in particular once such a parameter  has
                    been  processed  to  advance  to  the next command
                    argument.

               OTHERmust  follow  all  options;  following  this,  one
                    specifies  what  should be done if an option could
                    not be recognized.  _N_E_X_T_O_P_T may  be  specified  in
                    this  case,  too.   The  unknown  option itself is
                    **_a_r_g_v.

               After the _O_P_T _E_N_D_O_P_T loop _a_r_g_c contains the  number  of
          command  arguments  which  have  not  yet been processed and
          *_a_r_g_v is the first such argument.   Arbitrarily  many  (dif-
          ferent)  options  _A_R_G  can be specified.  _p_r would be imple-
          mented approximately as follows:










































                               February 18, 1985





                                     - 12 -



                  MAIN
                  {
                          do
                          {       OPT
                                  ARG 'h': PARM
                                          header = *argv;
                                          NEXTOPT
                                  ARG 'w': PARM
                                          width = atoi(*argv);
                                          NEXTOPT
                                  ARG 'l': PARM
                                          length = atoi(*argv);
                                          NEXTOPT
                                  ARG 't':
                                          tflag = 1;
                                  ARG 's': PARM
                                          delimeter = **argv;
                                          ++*argv;
                                          NEXTOPT
                                  ARG 'm':
                                          mflag = 1;
                                  OTHER
                                          if (isdigit(**argv))
                                                  columns = atoi(*argv), NEXTOPT
                                          else
                                                  USAGE, exit(1);
                                  ENDOPT

                                  if (argc)
                                  {       if (**argv == '+')
                                          {       PARM
                                                  first_page = atoi(*argv);
                                                  continue;
                                          }

                                          dopr(*argv);
                                  }
                                  else
                                          dopr("-");
                          } while (argc > 1);
                  }


          There is a blemish: -_c_o_l_u_m_n_s must be specified as  a  _s_i_n_g_l_e
          argument (since - alone refers to standard input).











                               February 18, 1985


-- 

Carl Swail      Mail: National Research Council of Canada
		      Building U-66, Montreal Road
		      Ottawa, Ontario, Canada K1A 0R6
		Phone: (613) 998-3408
USENET:
{cornell,uw-beaver}!utcsrgv!dciem!nrcaero!carl
{allegra,decvax,duke,floyd,ihnp4,linus}!utzoo!dciem!nrcaero!carl

dave@lsuc.UUCP (David Sherman) (02/22/85)

From Axel Schreiner's UNIX Clinic:
||               One can even initialize global variables in  this  con-
||          text _w_i_t_h_o_u_t resorting to the -_m flag instructing the loader
||          _l_d to accept multiple definitions:

Question: is there any cost associated with using the -m flag?
I use -m as a matter of course.

Aside from the occasional P-E dependency such as this
reference, I think the UNIX Clinic is excellent general
fodder for UNIX programmers, and warrants posting to
net.lang.c.

Dave Sherman
-- 
{utzoo pesnta nrcaero utcs hcr}!lsuc!dave
{allegra decvax ihnp4 linus}!utcsri!lsuc!dave