[comp.lang.fortran] Lexical analysis tricks.

steve@oakhill.UUCP (steve) (11/12/88)
<11112@cup.portal.com>

In article <11112@cup.portal.com>, PLS@cup.portal.com (Paul L Schauble) writes:
> The tricks in a Fortran lexical analyzer are indeed interesting. What I'd
> like to know is: Are these documented anywhere? Surely with all of those

When I wrote the lexical analyzer for F77 fortran, I could find no such
bag of tricks.  I should add that the reason I got pulled into the 
company was because I had a good background in front end compiler
theory.  Most of the tricks become self-evident in heat of writing
a lexer, others took some imagination.  This is noway a complete list
of all the tricks, just the ones I can come off of on the top of my head.

1) Read in the whole statement at once into a buffer.  This solves
   any complications arising from continuation lines and inbedded
   comments.  Also since you have a limited number of continuation lines
   allowed, you can pre-allocated the buffer used (Note : If they
   have extra continuation lines then they should be willing to
   pay the price when they fill the buffer.  They probably will not
   fill up the buffer though because of reason 2.)

2) As you read in the line throw out inbedded spaces.  Since they are
   not significant, they only get in the way.  This actually gives some
   spare room in your buffer.  You need to be careful not to throw out
   spaces in quotes and holeriths, people get mad when you do (Since
   holeriths are not part of the core 77 standard, you could make this
   job easier if you don't care about supporting this feature).

3) Treat each line as a token sequence with an END OF STATEMENT (EOS)
   token at the end.  This will help with parsing and error catching.
   Also sending an EOS token will help the lexer know when it can read
   the next line.

4) Since you need to read the first 6 columns to tell if the next statement
   is a continuation, store these away for future use (if label or if
   you allow special statements denoted by special first characters).
   Before reading the next line, check these to see if you have an 
   initial token (line number) or a special action (ignore if debug line,
   (you still have to read off the full line), etc.)

5) Before you begin parsing, figure out what type of line you have, this
   is where most of the tricks occur.

   A) When reading a line keep track of the parenthesis count and
      wheather an equal sign occurs outside all parenthesis.  If
      one does, the statement is an assignment or a DO statement.
      If an equal sign does occur at this level and a comma occurs
      at this level after the equal sign and the first two letter are
      'DO' then you have a DO statement.

   B) All other statements the initial character sequence can be
      checked against a table for statementness.  Some statements
      need some pre-parsing though.  REAL FUNCTION A(34) and REAL
      FUNCTION A(B) look the same til you find what in the parenthsis
      (assuming you allow long variable names).  The first declares
      the array FUNCTIONA to be of size 34, the second is the initial
      statement of a function A with parameter B.  Note, you cannot
      decide on what type of statement you are looking for from the
      current phase of the program (declaration phase, data statement
      phase, statement phase, etc.) since both the above statements
      could occur as the first statement of a file.  Unfortunately
      I do not have the list of these available here.

   C) Any thing preparse should be put on a lex-queue for future
      reference.  No need reinventing the wheel in the same program.
      If you are putting pre-parse quanities on the queue, also
      preparse the ASSIGN statement up to the TO (the only occurence
      of the TO keyword, and the only keyword appearing after the
      initial keyword sequence).  With this done, all unread tokens
      can be read be a simple lexer, and appear to be what they are.

   D) The CASE= directives in I/O statements should be treated as key
      words.  They are checked for in the following fashion.  In an
      I/O statement at the first parenthesis, turn on a flag to search
      for these.  Once the flag is on, check all identifiers for a
      following equals sign.  If it has one, look it up in a table.
      Turn the flag off at the closing parenthesis.  The flag is
      necessary.  CASE= is valid in other constructs, including the
      implied do of the READ and WRITE.

   E) Treat labels as integers at lex time.  At parse time they
      become apparent as to what they are.

So the next token is derived like this

  if (new statement) 
      if (inital token)
	  keep new statement flag on
	  if (token is label)
	      return label
          else if (special action)
	      do special action
      else
          Read in the new statement
          Determine what the line is (find the initial token, put other
              tokens on the lex-queue).
          Return the first token on the line.
  else (if the is something on the lex-queue)
      Pop it off the lex stack return it.
  else (if anything else to read in buffer)
      Lex off the next token and return it.
  else
      Return an EOS token

                   enough from this mooncalf - Steven
----------------------------------------------------------------------------
These opinions aren't necessarily Motorola's or Remora's - but I'd like to
think we share some common views.
----------------------------------------------------------------------------
Steven R Weintraub                        cs.utexas.edu!oakhill!devsys!steve
Motorola Inc.  Austin, Texas 
(512) 440-3023 (office) (512) 453-6953 (home)
----------------------------------------------------------------------------