[net.micro.cpm] TOPS20 style parsing

mem@sii.UUCP (Mark Mallett) (05/27/85)

Greetings.

About a year ago, I wrote a set of C routines to implement TOPS-20 style
parsing on my CP/M system.  I have used these routines for a number of
programs, such as a reminder program and software which composes my
bulletin board system here in NH.  With that, I am now convinced that
the routines work (at least to the point where I can use them), and have
put them together as a kit.  What follows is the document for the
library.

If you have comments, please send me mail.  If there is enough interest,
I shall post the sources.  I hope that if this happens, someone will tell
me whether the net.micro.cpm or the net.sources area is appropriate; I am
sending this message off to both.

Cheers,
Mark Mallett
decvax!sii!mem   or   ittvax!sii!mem




                                   C O M N D

                    A TOPS-20 style command parsing library
                            for personal computers




             Documentation  and  source  code Copyright (C) 1985 by Mark
        E. Mallett;  permission is granted to distribute  this  document
        and the  code  indiscriminately.  Please leave credits in place,
        and add your own as appropriate.




                                 This Document



             This document contains the following sections:


         o Document overview (this here section)

         o Introduction and history

         o Functional overview

         o How to write programs using the subroutine library

         o How to make the library work on your system




                           Introduction and History



        This document describes  the  COMND  subroutine  package  for  C
        programmers.    COMND   is   a   subroutine  library  to  effect
        consistent parsing of user input, and in general is well  suited
        for verb-argument   style   command  interfaces.    The  library
        provides a consistent  user  interface  as  well  as  a  program
        interface  which,  I believe, could well remain unchanged if the
        parsing library were re-written to support  different  interface
        requirements (such as menu interaction).

             The   COMND  interface  is  based  on  the  TOPS-20  model.
        TOPS-20 is an operating system  which  is/was  used  by  Digital
        Equipment Corporation  on  their  PDP-20  computer.  TOPS-20 was
        based on TENEX, written by BBN (I  think,  I  think).    TOPS-20
        COMND  is much more robust and consistent than the library which
        this document describes;  this library being intended for  small
        computer  applications,  it  provides  the  most  commonly  used
        functions.

             This library was written on a Z-80 system  running  Digital
        Research   Corporation's   CP/M  operating  system  version  3.0
        (CPM+).  I have also compiled and  tried  it  on  a  VAX  11/780
        running VMS.    It  is completely written in the C language, and
        contains only a few operating system specific elements.

             The COMND JSYS section of the TOPS-20 Monitor Calls  manual
        is probably a good thing to read.

             Please note:   while there are a few unimplemented sections
        of this library, I felt that it was nevertheless  worthwhile  to
        submit  it  to  public  domain since it is usable for almost all
        general command parsing and since the  call  interface  is  well
        defined.   I  have  used this library extensively since sometime
        in 1984.




                              Functional Overview




             The COMND subroutine library  provides  a  command-oriented
        user  interface  which is consistent at the programmer level and
        at the  user  level.    At  the  program  level,  it  gives   an
        algorithmically  controlled  parsing  flow,  where a call to the
        library exists  for  each  field  or  choice  of  fields  to  be
        parsed.

             At the user level, the interface provides:


         o Command prompting.

         o Consistent  command  line  editing.  The user may use editing
           keys to erase the last character or word,  and  to  echo  the
           current input line and prompt.

         o Input  abbreviation  and  defaulting.    The  user  may  type
           abbreviations of  keywords,  or  may  type  nothing  to  have
           defaults applied.

         o Incremental  help.    By  pressing  a  known  key  (usually a
           question mark), the user  can  find  out  what  choices  s/he
           has.

         o Guide  strings.    Parenthesized guide words are shown at the
           users option.

         o Command completion.  Where the subroutine library  can  judge
           what  the  succesful  completion  of  a portion of user input
           will be, the user can elect to have this input completed  and
           shown automatically.




                            Using the COMND Library



             While  you  read  this part of the document, you might want
        to look at the  sample  program  named  TEST.C  which  has  been
        included with  this  package.   It is an over-commented guide to
        the use of the COMND library.

             Any module which makes use of this  library  shall  include
        the definition   file  named  "comnd.h".    This  file  contains
        definitions  which   are   necessary   to   the   caller-library
        interface.   Mnemonics  (structures  and constants) mentioned in
        relation to this interface are defined in this file.

             The philosophy of parsing with the COMND library is that  a
        command  line  is  typed,  the  program  inspects  it,  then the
        program acts on  the  directions  given  in  that  line.    This
        process is  repeated  until  the  program  finishes.   The COMND
        library assists the user in typing  the  command  line  and  the
        program in  inspecting  it.    Acting  on  it  is left up to the
        calling program.

             The typing and parsing of fields in  the  command  line  go
        essentially hand-in-hand   with   this   library.    The  single
        subroutine COMND() is used to effect all parsing.  This  routine
        is  called  for  each  element  of  the input line to be parsed.
        Parsing is done according to a current  parse  state,  which  is
        maintained  in  a  parameter  block  passed  between  caller and
        library.   The  state  block  contains  the  following  sort  of
        information (described in detail later):


         o What to use for a prompt string.

         o  Addresses  of  scratch  buffers  for  user  input  and  atom
           storage.

         o How much the user has entered.

         o How much of the line the program has parsed.




             An important thing to note is that the  indexes  (how  much
        entered and  parsed)  are  both  variable.    The program begins
        parsing of the input line upon a break signal by the user  (such
        as the  typing  of  a carriage return, question mark, etc).  The
        user may then resume typing  and  erase  characters  back  to  a
        point BEFORE  that  already  parsed.   It is very important that
        the program does not take any action on  what  has  been  parsed
        until  the  line  has  been completely processed, otherwise that
        action could be undesired.

             Since the user may back up the command  input  to  a  point
        before  that  already  processed  by  the application program, a
        mechanism must be provided to backup the program to the  correct
        point.   Rather  than going to the point backed up to, the COMND
        library  expects  the  application  program  to  return  to  the
        beginning of  the  line,  and start again.  The user's input has
        remained in the command line buffer, and the library  will  take
        care  of  buffering  the rest of the input when that parse point
        is again reached.  However, this means  that  there  must  be  a
        method  of  communicating  to  the  calling  program  that  this
        "reparse" is  necessary.    Actually  there  are   two   methods
        provided, as follows:


         o  Each  call  to  the command parsing routine COMND() yields a
           result code.  The result may indicate that a reparse  has  to
           take place.    The  program  shall  then back up to the point
           where the parse of the line began, and start again.

         o The application program may specify the address of  a  setjmp
           buffer which  identifies  the reparse point.  (Note setjmp is
           a facility provided as part of  most  standard  C  libraries.
           It  allows  you  to  mark a point in the procedure flow [call
           frame,  registers,  and  whatever  else  is  involved  in   a
           context],  and  return to that point from another part of the
           program as if control  had  never  proceeded.    If  you  are
           unfamiliar  with  this  facility,  you  might  want to find a
           description in your C manual.) It is  up  to  the  caller  to
           setup the setjmp environment at the reparse point.

        In  either case, the reparse point (the point at which the parse
        will be restarted if necessary) is the point at which the  first
        element of  the  command  line  is  parsed.    This is after the
        initialization call which starts every parse.



        Every call to the COMND() subroutine involves two arguments:   a
        command  state block, in which is kept track of the parse state,
        and a command function  block,  which  describes  what  sort  of
        thing to  parse  next.    The  command  state  block  is given a
        structure called "CSBs", and  a  typedef  called  "CSB".    Each




        element  of  the structure is named with a form "CSB_xxx", where
        "xxx" is  representative  of  the  element's   purpose.      The
        following  are  the  elements of the command state block, in the
        order that they appear in the structure.


         o CSB_PFL is a BYTE.  This contains flags which are set by  the
           caller  to  indicate  specifics  of  the  command processing.
           These flags are:


            o _CFNEC:  Do not echo user input.

            o _CFRAI:  Convert lowercase input to uppercase.


         o CSB_RFL, a BYTE value, contains flags which are kept  by  the
           library in  the  performance  of the parse.  Generally, these
           flags  are  of  no  interest  to  the  caller   since   their
           information  can  be  gleaned  from  the  result  code of the
           COMND() call.  However, they are:


            o _CFNOP:  No  parse.    Nothing  matched,  i.e.,  an  error
              occured.

            o _CFESC:  Field terminated by escape.

            o _CFEOC:  Field terminated by CR.

            o _CFRPT:  Reparse required.

            o _CRSWT:  Switch ended with colon.

            o _CFPFE:  Previous field terminated with escape.


         o  CSB_RSB  is  the  address  of a setjmp buffer describing the
           environment at  the  reparse  point.    If  this   value   is
           non-NULL,   then  if  a  reparse  is  required,  a  longjmp()
           operation is performed using this setjmp buffer.

         o CSB_INP is the address  of  the  input-character  routine  to
           use.   If this value is non-NULL, then this routine is called
           to get each character of input.  No line editing  or  special
           interactive  characters are recognized in this mode, since it
           is assumed that this will be  used  for  file  input.    Note
           especially:   this  facility  is not yet implemented, however
           the definition is provided for future expansion.  Thou  shalt
           always leave this NULL, or write the facility thyself.

         o  CSB_OUT is the inverse correspondent to the previous element




           (CSB_INP).  It is the address of a routine to process  output
           from the  command  library.    Please  see the warning in the
           CSB_INP description about not being implemented.

         o CSB_PMT is the address  of  the  prompt  string  to  use  for
           command parsing.      The   command  library  takes  care  of
           prompting, so make sure this is filled in.

         o CSB_BUF is the address of the buffer to put user  input  into
           as s/he is typing it in.

         o  CSB_BSZ,  an int, is the number of bytes which can be stored
           in CSB_BUF;  i.e., it is the buffer size.

         o CSB_ABF is the address of an atom buffer.  Some (if not  all)
           parsing   functions   involve   extracting   some  number  of
           characters from the input buffer and interpreting  or  simply
           returning this  extracted  string.   This buffer is necessary
           for those operations.  It should probably be as large as  the
           input buffer (CSB_BUF), but it is really up to you.

         o  CSB_ASZ,  an  int,  is the number of characters which can be
           stored in CSB_ABF;  i.e., it is the size of that buffer.

           ** Note ** CSB elements from here to the end do not  have  to
           be initialized  by  the  calling  program.   They are used to
           store state information and are initialized  as  required  by
           the library.

         o CSB_PRS,  an  int,  contains  the  parse  index.  This is the
           point in the command buffer up  to  which  parsing  has  been
           achieved.

         o  CSB_FLN,  an  int,  is  the  filled  length  of  the command
           buffer.  This is the number of  characters  which  have  been
           typed by the user.

         o CSB_RCD,  an int, is a result code of the parse.  This is the
           same value which is returned as the  result  of  the  COMND()
           procedure call.

         o  CSB_RVL is a union which is used to contain either an int or
           a long value.  The names of the union  elements  are:    _INT
           for  int,  _ADR  for  address (note that a typecast should be
           used for proper address assignment).  This  element  contains
           a  value  returned  from  some  parse  functions which return
           values which are single values.  For example, if  an  integer
           is parsed, its value is returned here.

         o  CSB_CFB is the address of a command function block for which
           a parse was successful.  This is significant in  cases  where
           there  are  alternative  possible interpretations of the next




           command element.




        The parse of each element in a command line  involves,  as  well
        as  the  Command  State Block just described, a Command Function
        Block which identifies the sort of thing to  be  parsed.    This
        block  is  defined  in  a  structure  named  "CFBs", which has a
        corresponding typedef named "CFB".  Elements of the  CFB,  named
        "CFB_xxx",  are  as  follows  (in  the  order they appear in the
        structure):



         o CFB_FNC, a BYTE, is the function  code.    This  defines  the
           function to  be  performed.    The function codes are listed,
           and their actions described, a little later.

         o CFB_FLG, a BYTE, contains flags which  the  caller  specifies
           to the  library.    These  are  very significant, and in most
           cases affect the presentation to the user.    The  flag  bits
           are:


            o _CFHPP:    A  help  string has been supplied and should be
              given when the user types the help character ("?").

            o _CFDPP:  A default string has been supplied, and shall  be
              used  if  the  user  does  not type anything at this point
              (typing  nothing  means  typing  a  return  or  requesting
              command completion).     Note  that  this  flag  (and  the
              default string) is ONLY significant for the CFB passed  in
              the  call  to  the COMND() routine, and not for any others
              referenced as alternatives by that CFB.

            o _CFSDH:  The default help message should be  supressed  if
              the user   types  the  help  character  ("?").    This  is
              normally  used  in  conjunction  with  the  _CFHPP   flag.
              However,  if  this  flag  is present and the _CFHPP is not
              selected, then the help operation is  inhibited,  and  the
              help  character becomes insignificant (just like any other
              character).

            o _CFCC:    A  character  characteristic  table   has   been
              provided.   A  CC table identifies which characters may be
              part of the element being recognized.  Not  all  functions
              support  this  table  (for example, it does not make sense
              to  re-specify  which  characters  may   compose   decimal
              numbers).   This table also specifies which characters are
              break characters, causing the  parser  to  "wake  up"  the
              calling program  when  one  of them is typed.  If this bit




              is not set (as is usually the case), a  default  table  is
              associated according to the function code.

            o _CFDTD:    For  parsing  date and time, specifies that the
              date should be parsed.

            o _CFDTT:  For parsing date and  time,  specifies  that  the
              time should be parsed.


         o  CFB_CFB  is  the address of another CFB which may be invoked
           if the user input does not satisfy this CFB.    CFBs  may  be
           chained in  this  manner  at  will.  Recognize, however, that
           the ORDER of the chain plays an important part in  how  input
           is handled,  particularly  in  disambiguation of input.  Note
           also that only the  first  CFB  of  the  chain  is  used  for
           specifying  a  default  string  and  CC  table  (for  command
           wake-up).

           CFB chaining is a very important part of  parsing  with  this
           library.

         o  CFB_DAT  is  defined  as a long, since it is used to contain
           address or  int  values.    It  should  be   referenced   via
           typecast.   It  is  not  defined  as  a  union  because it is
           inconvenient or impossible to initialize  unions  at  compile
           time  with  most  (all?)  C  compilers, and initialization of
           these blocks at runtime  is  not  desirable.    This  element
           contains  data  used  in  parsing  of  a field in the command
           line.  For  instance,  in  parsing  an  integer,  the  caller
           specifies the default radix of the integer here.

         o  CFB_HLP  is  the  address  of a caller-supplied help string.
           This is only significant if the flag bit  _CFHPP  is  set  in
           the CFB_FLG byte.

         o  CFB_DEF  is the address of a caller-supplied default string.
           This is only significant if the flag bit  _CFDPP  is  set  in
           the  CFB_FLG  byte,  and  only  for  the first CFB in the CFB
           chain.

         o CFB_CC is the address of a character  characteristics  table.
           This  is only significant if the flag bit _CFCC is set in the
           CFB_FLG byte.  This is the address of a 16-word  table,  each
           word  containing  16  bits  which  are interpreted as 8 2-bit
           characteristic entries.      The   most   significant    bits
           correspond to  the lower ASCII values, etc.  The 2-bit binary
           value has the following meaning, per character:


            o 00:  Character may  not  be  part  of  the  element  being
              parsed.




            o 01:    Character  may be part of the element only if it is
              not the first character of that element.

            o 02:  Character may be part of the element.

            o 03:    Character  may  not  be  part   of   the   element;
              furthermore,  when  it  is  typed, it will case parsing to
              begin immediately (a wake-up character).






             The function code in the  CFB_FC  element  of  the  command
        function  block  specifies  the  operation  to  be  performed on
        behalf of that function block.  Functions are described now.



        CFB function _CMINI:  Initialize

             Every  parse  of  a  command  line  must  begin   with   an
        initialization call.    This  tells the command library to reset
        its indexes, that the user must be prompted, etc.  There may  be
        NO  other  CFBs  chained  to this one, because if they are, they
        are ignored.

             The reparse point is the point right after this call.    If
        the  setjmp  method  is used, then the setjmp environment should
        be defined here.  After the reparse  point,  any  variables  etc
        which  may  be  the  victims  of  parsing side-effects should be
        initialized.



        CFB function _CMKEY:  Keyword parse

             _CMKEY parses a keyword from a given  list.    The  CFB_DAT
        element  of the function block should point to a table of string
        pointers, ending with a NULL pointer.  The  user  may  type  any
        unique  abbreviation  of  a  keyword,  and may use completion to
        fill out the rest of a known match.  The address of the  pointer
        to  the  matching  string  is returned in the CSB_RVL element of
        the command state block.  The value  is  returned  this  way  so
        that  the  index  can  be  easily  calculated, and because it is
        consistent  with   the   general   keyword   parsing   mechanism
        (_CMGSK).




             The  incremental  help  associated  with keyword parsing is
        somewhat special.  The default  help  string  is  "Keyword,  one
        of:"  followed  by  a  list  of  keywords  which  match anything
        already typed.  If a help string has  been  supplied  (indicated
        by  _CFHPP) and no suppression of the default help is specified,
        then the  initial  part  ("Keyword,  ")  is  replaced  with  the
        supplied help  string  and the help is otherwise the same.  If a
        help  string  has  been  supplied  and  the  default  has   been
        supressed, then the given help string is presented unaltered.



        CFB function _CMNUM:  number

             This parses  a  number.  The caller supplies a radix in the
        CFB_DAT element of the function block.   The  number  parsed  is
        returned  (as  an  int)  in  the  CSB_RVL  element  of the state
        block.



        CFB function _CMNOI:  guide word string

             This function parses a guide  word  string  (noise  words).
        Guide  words  appear  between  significant  parts of the command
        line, if they are in parentheses.    They  do  not  have  to  be
        typed, but  if  they  are, they must match what is expected.  If
        the previous field  ended  with  command  completion,  then  the
        guide words are shown automatically by the parser.

             An  interesting  use  of  guide  word strings is to provide
        alternate sets with the command chaining  feature.    The  parse
        (and  program) flow can be altered depending on which string was
        matched.



        CFB function _CMCFM:  confirmation

             A confirmation is a carriage return.    The  caller  should
        parse  a  confirmation  as the last thing before processing what
        was parsed.  Since carriage  return  is  by  default  a  wake-up
        character,  requiring  a  confirmation will (if you don't change
        this wake-up attribute) require  that  the  parse  be  completed
        with no  extra  characters  typed.    A parse with this function
        code returns only a status.




        CFB function _CMGSK:  General storage keyword

             This call provides for parsing of one of a set of  keywords
        which are  not  arranged  in  a  table.    Often,  keywords  are
        actually stored in a file or in  a  linked  list.    The  caller
        fills  in the CFB_DAT element of the command function block with
        the address of a  structure  named  CGKs  (typedef  CGK),  which
        contains the following elements:


         o CGK_BAS:   A base address to give to the fetch routine.  Does
           not matter what  this  is,  as  long  as  the  fetch  routine
           understands it.

         o CFK_CFR:   The  address  of  a  keyword  fetch  routine.  The
           routine is called with the CGK_BAS value, and the address  of
           the pointer  to  the  previous  keyword.    It is expected to
           return the address of the pointer to  the  next  keyword,  or
           with  the  first  one  if  the  passed value for the previous
           pointer is NULL.

                When this function completes  successfully,  it  returns
           the  address  of  the  pointer  to  the string in the CSB_RVL
           element in  the  command  state  block.    Please   see   the
           description  of the _CMKEY function code for a description of
           help and other processing.




        CFB function _CMSWI:  Parse a switch.

             This is functionally equivalent to _CMKEY,  and  exists  to
        fill a  need  for switch parsing.  Basically it is a placeholder
        for an unimplemented function.




        CFB function _CMTXT:  Rest of line

             This function parses the text  to  the  end  of  the  line.
        Note  that  this  does  not  parse  the trailing break character
        (i.e. the carriage return).  The text is returned  in  the  atom
        buffer  which  is  defined  (by  the  caller) by the CSB_ABF and
        CSB_ASZ elements of the command state block.



        CFB function _CMTOK:  token

             This function will parse an exact  match  of  a  particular
        token.   A  token  is  a  string of characters, whose address is
        supplied by the caller in the CFB_DAT  element  of  the  command
        function block.    This  function  is  mainly useful for parsing
        such things as commas and other separators, especially where  it
        is one  of  several  alternative parse functions.  It returns no
        value other than its status.



        CFB function _CMUQS:  unquoted string

             This function parses an unquoted string, consisting of  any
        characters other  than  spaces,  tabs, slashes, or commas.  This
        set may of course be changed by  supplying  a  CC  table.    The
        unquoted  string  is returned in the atom buffer associated with
        the command state block.



        CFB function _CMDAT:  parse date/time

             This function parses  a  date  and/or  time.    The  caller
        specifies,  via  flag  bits  in  the CFB_FLG byte of the command
        function block (as identified above) which  of  date,  time,  or
        both, are  to  be parsed.  The date and time are returned as the
        first two ints in the atom buffer which is associated  with  the
        command state   block.    Note  that  both  date  and  time  are
        returned, regardless of which were requested.




             Note further that this routine is not fully implemented  as
        of this writing.




                           Calling the COMND library


             All  that  you need to know to use the above information is
        how to call the  command  library.    Basically,  there  is  one
        support routine:  COMND().  It is used like this:

                 status = COMND (csbp, cfbp);

             Here,  "csbp"  is  the  address of the command state block,
        and "cfbp" is the address of the command function  block.    The
        COMND()  routine  returns  an  int status value, which is one of
        the following:


         o _CROK:   The  call  succeeded;    a  requested  function  was
           performed.   The  address  of  the matching function block is
           returned in the CSB_CFB element of the command  state  block,
           and other information is returned as described above.

         o _CRNOP:  The call did not succeed;  nothing matched.

         o _CRRPT:   The call did not succeed because the user took back
           some of what had already been parsed.    In  other  words,  a
           reparse  is  required,  and  your program must back up to the
           reparse point.  Note that if  you  specify  a  setjmp  buffer
           address  in  the  CSB_RSB element of the command state block,
           you will never see this value because the COMND library  will
           execute a longjmp() operation using that setjmp buffer.

         o _CRIFC:    The  call  failed  because you provided an invalid
           function code in the command function block (or in one  which
           is chained to it).  You have made a programming error.

         o _CRBOF:   Buffer  overflow.   The atom buffer is too small to
           contain the parsed field.

         o _CRBAS:  Invalid radix for number parse.

         o _CRAGN:  You should not see this code.  It is reserved for  a
           support-mode call to the subroutine library.




                         Installing the COMND library



             This  part of the document describes the modules which come
        with the COMND library kit, and what you might have to  look  at
        if  the  code does not instantly work on your system (which will
        probably be the case if your system is not the same kind as  the
        one which you got it from).

             The files which come in the COMND kit are as follows:


         o  COMND.R  -  Source for this document, in a form suitable for
           the public domain formatting program called "roff4".

         o COMND.DOC - This document.

         o MEM.H - A file of my (Mark  Mallett)  definitions  which  are
           used by the code in the command subroutine library.

         o COMND.H - Command library interface definitions.

         o COMNDI.H - Command library implementation definitions.

         o COMND.C  -  Primary  module  of  the COMND library.  Contains
           user input buffering and various library support routines.

         o  CMDPF1.C  -  First  module  of  parse  function   processing
           routines.

         o  CMDPF2.C  -  Second  module  of  parse  function  processing
           routines.

         o CMDPFD.C - Contains the date/time  parse  function  routines.
           This  is  included  in  a  separate  module so that it can be
           replaced with  a  stub,  since  few  programs  (that  I  have
           written,  anyway)  use  this  function, and it does take up a
           bit of code.

         o CMDPSD.C - A stub for the date/time parsing functions.   This
           can  be  linked  with  programs which do not actually use the
           date/time parse function.

         o CMDOSS.CPM - Operating system specific code which  works  for
           CP/M.   This  is  provided  as a model for the routines which
           you will have to write for your system.

         o CMDDTM.CPM - Date/time support routines for  version  3.0  of
           CP/M.   This  is a module containing routines to get the date
           and time from the  operating  system,  and  to  encode/decode
           these values  to and from internal form.  This is provided as




           a model;  you will probably have to  rewrite  them  for  your
           system.