prohaska@coors.DEC (07/04/85)
Xref: Previuosly submitted SDCL (structured DCL) compiler to net.sources
Here is the NROFF/TROFF source for the project handout I used when I had
my UNIX class work on the SDCL project. It explains the design and imple-
mentation details of SDCL. The state diagram I refer to in the section on
lexical analysis is a graph; I cannot include that. The terminals recognized
are so few that the lexical analyser can be handcoded. Look up the paper
in Software-Practice and Experience I referenced in designing the lexical 
analyser (the complete citation would appear on page 4 when you run this
file through NROFF/TROFF.)
If you donot have NROFF/TROFF available, you will need to replace NROFF/TROFF
commands with RUNOFF, TeX, SCRIBE or whatever text formatter you have access to.
Most of the NROFF/TROFF commands are from the ms macro package; their function
can be inferred easily. On particular note: the sequence "\e", i.e, backslash
followed by lower case e is how one gets a literal backslash using NROFF/TROFF.
    Sohail Aslam
       ...decvax!decwrl!dec-rhea!dec-pulsar!sprohaska
-------------------- CUT HERE -----------------------------------------------
.nr LL 7i
.nr PO .75i
.OH ''SDCL''
.EH ''SDCL''
.EF ''-%-''
.OF ''-%-''
.ND
.LP
.sp 2
.ce 
.cu
CS409 Spring 1985 Semester Project
.sp 1
.PP
We can always work on specific programs that illustrate 
the usage of a language in a programming environment. I chose to have you
write a compiler (preprocessor) for structured DCL (Digital Control Language)
for the following reasons:
.IP a.
You will use C to develop a significantly sized program.
.IP b.
The methods used in developing the lexical analyser, the recursive descent
parser, and the code generator are applicable in a number of other compiler
type applications.
.IP c.
You would develop a very useful tool, which, I hope, you will cherish.
.PP
The command language in the VAX/VMS environment is called DCL. It provides
fairly elaborate facilties for executing commands. One can write command
procedures to execute a set of commands. The grave deficiency however, is the
fact that in writing DCL command procedures, the only control struture 
provided is the 
.sp 1
.ti +5
IF condition THEN command
.sp 1
If one wants to execute more that one command based on the condition, one
has to use GOTO's (yugh!). For example, a command procedure to swap the
contents of two files in DCL could be coded as follows:
.DS
$ IF (( p1 .eqs. "" ) .or. ( p2 .eqs. "" )) then goto usage
$ copy 'p1' tmpfile
$ copy 'p2' 'p1'
$ copy tmpfile 'p2'
$ exit
$ usage: write sys$output "Usage: swap file1 file2"
$ exit
.DE
p1 and p2 are first and second args to the command procedure. 'p1' and 'p2'
are akin to UNIX shell's $1 and $2, i.e, the values of the parameters. 
The construct \fBusage:\fP is label that can be targeted by \fBgoto\fP's
(yugh!).
We need not get into any details of DCL. The point to note is that we had to
resolve to using GOTO's to get the task done. What we want to do is to write
a program that would allow one to write the same command procedure as
follows:
.DS
if (( p1 .nes. "" ) .and. ( p2 .nes. "" )) {
    copy 'p1' tmpfile
    copy 'p2' 'p1'
    copy tmpfile 'p2'
}
else
    write sys$output "Usage: swap file1 file2"
.DE
I think you can easily infer our goal. We want an extension to DCL which
will provide such control structures. Others we may want are the while and
the for loops, break and next statements, and more if we become adventurous.
We are really providing a small programming language. The compiler will then
do a simple translation from the structured DCL to the form which VMS DCL
recognizes.
.PP
One convenient way to describe a programming language is the Backus-Naur
Form (BNF), which is a formal specification of the grammer of a language,
that is, the set of rules by which a legal program in the langauge is
written or recognized. Given a grammer, one can write a program that will
analyse or 
.ul
parse
programs in that language. Fortunately, the grammer for s-DCL (the name is
still debatable) is small:
.KS
.DS
program   : statement
	  | program statement
statement : \fBif (\fP condition \fB)\fP statement
          | \fBif (\fP condition \fB)\fP statement \fBelse\fP statement
	  | \fBwhile (\fP condition \fB)\fP statement
	  | \fBdo\fP statement \fBwhile (\fP condition \fB)\fP 
	  | \fBfor (\fP intialize \fB;\fP condition \fB;\fP reinitialze \fB)\fP
	         statement
	  | \fBbreak\fP
	  | \fBnext\fP
	  | \fB{\fP program \fB}\fP
	  | \fBother\fP
.DE
.KE
The first two lines say that a program is a statement, or a program followed
by a statement. In other words, a program consists of one or more statements.
A statement in turn is one of a handful of constructs; the vertical bar |
indicates a choice of alternatives. Most statements are straightforward
enough, standing for an occurence of the particular keyword. For example, a 
statement can consist of the keyword \fBif\fP followed by a parenthesized
\fBcondition\fP and a \fBstatement\fP. (The definition is recursive, as is the 
definition of \fBprogram\fP.) A group of statements in braces can be used
anywhere a single statement can, becuase of the rule
.sp 1
.DS
statement : { program }
.DE
.sp 1
This is the so-called \fBcompound\fP statement. \fBbreak\fP causes an exit
from one level of a \fBwhile\fP or \fBfor\fP loop. \fBnext\fP causes the 
next iteration of the loop to begin. In a \fBwhile\fP it goes immediately
to the \fBcondition\fP part; in a \fBfor\fP, it goes to the \fBreinitialize\fP
step.
.PP
The last grammetical type is an \fBother\fP, which is anything that wasn't
recognized as many of the preceding types. This category actually encompasses
most of the DCL. For example, the statement
.DS
copy 'p1' tmpfile
.DE
is nothing recognizable by s-DCL, and is thus an \fBother\fP.
.PP
Type \fBother\fP is an important simplification, for it frees s-DCL from
having to know very much about DCL. If a statement is encountered which
does not begin with one of the keywords (or a left brace), it
.ul
must
be an \fBother\fP, and no real processing is needed on it, other than emitting 
it with a $ preceeding it. The price of this simplication is that the error
detecting abilities of s-DCL are not as good as they might be with a more
comprehensive grammer. This is not a serious drawback, however, since we
are translating into DCL, and DCL is perfectly capable of detecting any
syntax errors that escape the preprocessor.
.PP
In principle, associated with each rule of the grammer is a \fBsemantic
action\fP which states what is to be done when that particular construction
is recognized in the program being translated. In s-DCL, the semantic
actions are usually very simple, involving reformatting the incoming
text and occasionally interspersing \fBif\fP's, \fBgoto\fP's and labels
to translate the control flow statements.
.PP
The preprocessor is organized as follows. The top level is a controlling
routine called the
.ul
parser,
so called because it controls by analysing (parsing) the grammatically
structure of the input it sees. For instance, when an 
.ul
if
is seen the parser calls a routine that handles
.ul
if
statements. That routine in turn isolates the 
.ul
condition
part and generates the test, a semantic action. The parser must also
remember that an
.ul
if
has been seen so that when the end of the
.ul
statement
part is reached, the correct terminating code for an
.ul
if
can be produced. This may include dealing with an
.ul
else
if it is prsent. Furthermore, all this is inherently recursive, as the 
BNF grammar indicates, because constructions can be nested as in
.sp 1
.DS
.KS
for( i = 1; i .le. n; i = i + 1 )
    for( j = 1; j .le. n; j = j + 1 )
	if( p1 .eqs. "copy" )
	    append 'i' 'p2'
.KE
.DE
.PP
At the beginning of each statement, the parser calls a "lexical analysis"
routine to classify it into one of the types specified in the grammer. The
lexical routine returns the first token of the statement, which will determine
the statement type. When the statement has been classified, the parser calls
the appropriate code generation routine.
.PP
We will begin by describing the the lexical analyser, since it is essentially
independent of everything else. Then we can present parsing and code
generation more easily.
.NH 
LEXICAL ANALYSIS
.PP
The purpose of the this stage is to provide a procedure that returns the next
token and its classification (tokencode) from the input.
Create a function called 
.ul
lex
which would begin as follows
.sp 1
.DS
lex(token)
char *token;
{ 
   ....
   ....
   return(tokencode);
}
.DE
The formal parameter "token" will be loaded with the actual string that
makes up a token, e.g, "if", "else", "*", "34587" etc. The value returned
by the function "lex" will the tokencode for the token. These tokencode
classify the input into catagories that are of interest to the parser.
In your case, this classification consists of ID, integers, singlechar,
comment, whitespace, newline, EOF and string.
Some of the ID's (identifiers) are special: they are keywords e.g, "if",
"else", "while", "for", "next" and "break". So if an identifier is found,
you will need check whether it is a keyword or not. The classfication
"singlechar" also emcompasses a number of single character tokens, e.g
"(", "{", "*", "+", ".", "," ";" etc. If one of these is found, you will
again need to check whether it happens to be one of the special one character
tokens e.g, "(", ")", "{", "}", "#" etc.
.PP
A very convenient scheme to implement a table driven lexical analyser
appeared in the following paper
.DS
Dedourek J. M., Gujar U. G., "Scanner Design", Software-Practice
and Experience, Vol. 10., 959-972 (1980)
.DE
The authors of this paper present a step by step method for developing
a lexical anaylser once a state diagram is at hand. Use the method
outlined in the paper.
Here are some
guidelines for you to follow, if you wish, when you begin to write the
"lex" function.
.PP
You may be tempted to define a new enumerated type that contains the
character classes and use this to declare your two dimensional array.
The only problem is that in C array indicies can only be integers. Unlike
pascal, you 
.ul
cannot 
use enumerated types to do things like
.DS
enum charclass { letter, digit, onechar, slash, 
		 whitespace, newline, eof, dqoute, star }
enum charclass nextstate[0..12][letter..star];
.DE
.PP
Assuming that you got your tables developed by hand, here's what "lex"
might look like.
.sp 1
.DS
/*
lex -- return token and tokencode from standard input
The lex routine is table driven. The two tables "nextstate" 
and "output" were developed by hand. Because no other 
routine needs to know what's in the tables, they are 
declared "static".
*/
/* 
first, defines to avoid using numbers for charclasses 
*/
#define  LETTER     0  /* [A-Za-z_$] (notice _ and $ are 
			  considered to be characters */
#define  DIGIT      1  /* [0-9] */
#define  ONECHAR    2  /* (){}[]'.;:#%?\e */
#define  SLASH      3  /* /  */
#define  STAR       4  /* * */
#define  WHITESPACE 5  /* blank,tab,non-printing chars, except
			  NEWLINE.
			*/
#define  EOL        6  /* \en */
#define  DQUOTE     7  /* "  */
#define  ENDFILE    8  /* don't use EOF */
/* 
declare and initialize the tables here. Use the correct size, the
following is just an example
*/
static int nextstate[14][8] = {
    { 0, 1, 4, ...................... , 12 },   /* first row */
    { 1, 1, 4, ...................... , 13 },   /* second row */
	     .
	     .
	     .
    { 2, 2, 4, ...................... , 12 }   /* last row  */
};
static int output[14][8] = {
    ....
    ....
};
/*
Use defines to set up token codes. It may better to put the 
following defines in a file, say "tcodes.h" and include  it
here because you would need the codes in your parsing routine
also.
So either
*/
#define  IF      0  /* keywords first, if */
#define  ELSE    1  
#define  WHILE   2
#define  FOR     3
#define  BREAK   4
#define  NEXT    5
#define  DO      6
#define  ID      7  /* if identifier is not a keyword */
#define  INTEGER 8
/* special single characters */
#define  OPAREN  9
#define  CPAREN  10
#define  OBRACE  11
#define  CBRACE  12
#define  SCHAR   13  /* single character otherwise */
#define  COMMENT 14
#define  WSPACE  15  /* blanks,tab,non-printing characters */
#define  NEWLINE 16
#define  FILEEND 17  /* can't use EOF */
#define  STRING  18  /* "...." */
/* or, if codes are in file "tcodes.h" */
#include "tcodes.h"
/*
variables that need to retain their values across calls to lex.
*/
static int state = 0;
static int nextchar;
lex(token)
char *token;
{
    int i, out;
    /* other declarations as needed */
    /* here's the psuedo code. See the flow chart on p967 */
    /*
      i = 0
      token[i] = '\e0'
      loop
	  if state != 0 then
	      if i < MAXTOKENLEN then
		  token[i] = nextchar
		  i = i + 1
	      endif
	  endif
	  nextchar = getchar()
	  class    = findclass(nextchar)
	  out = output[state][class]
	  state = nextstate[state][class]
	  exit loop if out == 0
      endloop
      token[i] = '\e0'   /* mark end of string */
      /* out contains the tokencode. If it is ID, 
	 check for keywords by searching table 
	 of keywords.
      */
      if out == ID then
	  out = iskeyword( out, token )
      else if out == SCHAR   /* single character */
	  out = isspecialonechar( out, token )
      endif
      return( out )   /* here's the token code */
} /* end of lex */
/* your keyword lookup routine and single character checking 
routines go here 
*/
.DE
.NH
ASSIGNMENT 1
.PP
Once you have your lexical part, test it out. As the first part of the
project assignment, write a driver that calls your lex function. Feed it
input from 
.ul
/users/sohail/hwk/bun.vms. 
Your driver will be set up as follows
.DS
main()       /* test of my lex */
{
    int tokecode;
    char token[81];    /* or whatever maximum you choose */
    extern int lex();  /* because lex is on a separate
			  file, you need this declaration
		       */
    while ( (tokencode = lex( token ) ) != FILEEND )
	printf( "%3d:  %s\en", tokencode, token )
}
.DE
Submit your driver routine, your lex files and the output as the first
part of the assignment. 
.NH
PARSER
.PP
Given the BNF grammar, you will write a
.cu
recursive descent
parser that will parse the input language (s-dcl) and emit DCL. First let
me describe how such a parser is developed; I'll come to code-generation
later.
.PP
A few notes about terminolgy. The grammar consists of \fBproduction rules\fP,
which essentially says that the left hand side is made up by gathering the
stuff on the right hand side, e.g
.DS
statement : \fBif (\fP condition \fB)\fP statement
.DE
This is a grammar rule that says that a \fBstatement\fP is made up of
\fBif\fP, followed by open-paren, followed by a condition, followed by
a close-paren followed by a statement. The components of a grammar rule
can be divided into two catagories: \fBterminals\fP and \fBnon-terminals\fP.
Things that are atomic, e.g keywords like \fBif, else, while, next\fP etc are
\fBterminals\fP because these cannot be broken down any further. Things like
\fBstatement, program, condition\fP are called \fBnon-terminals\fP because
they consists of further components, and this breakdown is indicated by a
grammar rule. What the parser has to do is to gather up terminals by calling
lex, and attempt to determine whether these terminals (tokens) can be grouped
to match a grammar rule. The highest non-terminal that is ultimately matched
is \fBprogram\fP. A recursive descent parser really begins with the highest
level and works it way down. The actual coding works as follows: for each
\fBnon-terminal\fP, you write a procedure. So in our case, you'll need 
procedures 
.ul
program,
.ul
statement
.ul
condition
.ul
initialize
.ul
reinitialize
and
.ul
other.
These procedures will be invoked at various stages based upon the 
\fBterminals\fP that are seen. Let's assume that we write another
function called 
.ul
scan
that calls lex. Scan will check to see if the token returned is a \fBCOMMENT,
WSPACE\fP or \fBNEWLINE\fP. If so, it will call lex again until it returns
a token that is not one of the above. Such a function comes in handy when
your parser is looking for keywords or special characters. The function lex
will still be available to the parser. Given this, here's how your parser will
look like. I have not put in statements that will generate the actual code.
Once you see the structure of the parser, you'll immediately realize how and
where the actual code generation will take place.
.DS
#include <stdio.h>
#include "tcodes.h"      
/*
Because a number of routines will want tokencode and token,
make them global.
*/
int tokencode;
char token[133];
scan()       
    /* 
    call lex and return a token that is not COMMENT,
    WSPACE or NEWLINE.
    */
    {
	do {
	    tokencode = lex( token );
	} while ( tokencode == COMMENT || tokencode == WSPACE 
		||tokencode == NEWLINE );
    }
main()
    /*
    This is the the top level routine. Because the rotuines
    it needs are either ahead in the file, or somewhere else,
    it need to have extern declarations.
    */
    {
	extern int statement();
	/* 
	keep going until EOF but first, get the 
	first non-blank token 
	*/
	scan();
	while( tokencode != FILEEND )
	    statement();
    }
statement()
    /*
    based on keyword seen or not seen, invoke the appropriate
    routine to process a statement.
    */
    {
	extern int ifstmt(), whilestmt(), dowhilestmt();
	extern int forstmt();
	extern int breakstmt(), nextstmt(), program();
	extern other();
	switch (tokencode) {
	case IF: 
	    ifstatement();  /* process rest of if-stmt */
            break;
	case WHILE:
	    whilestmt();
	    break;
	case DO;
	    dowhilestmt();
	    break;
	case FOR:
	    forstmt();
	    break;
	case BREAK;
	    breakstmt();  /* this would be tricky */
	    break;
	case NEXT:
	    nextstmt();
	case OBRACE:
	    program();  /* process { stmt stmt ... stmt } */
	    break;
	
	default:
	   otherstmt();
	   break;
        }
        if (tokencode != FILEEND)
	    return(true);       /* still more to come */
    
        return(false);   /* no more statements */
    }
ifstmt()
    /*
    process an if stmt. Look for an else part too before 
    returning
    */
    {
	extern int condition(), statement();
	scan();   /* get the next token */
	if( tokencode == OPAREN ){
	    /*
	    skip over OPAREN so conditon() gets the
	    next token
	    */
	    scan();
	    condition();    /* process the condition part */
	}
	else   /* missing "(", try anyway */
	    condition();
	
	/*
	The tokencode should be CPAREN here.
	*/
	if( tokencode == CPAREN){
	    scan();   /* it was, skip it */
	    /* here's a recursive call */
	    statement();
	}
	/*
	check for an else part
	*/
	if (tokencode == ELSE ){
	    scan();       /* get next token */
	    statement();  /* more recursion */
	}
	
	return;
    }
.DE
.bp
.NH
Assignment 2
.PP
It is easy to see that the parser is nothing more than the grammar rules
coded in C. For the second part of the project I want you to write the
recursive descent parser using the grammar previously specified in the
handout. Don't worry about error checking
or code generation. Pattern your functions similar to the examples in the
previous handout. When any of your function recognizes a statement, have
it print out a message indicating so. For example, right before the closing
curly brace that marks the end of the "ifstmt()" function of the first
handout, such a message can be produced with:
.DS
printf("Parsed an if statement\en");
.DE
.PP
There are a few consideration that you will have to keep in mind when
you code you parser. 
.IP 1.
When parsing a statement that belongs to the
category \fBother\fP, the function you invoke will need to gather tokens
by calling 
.ul
scan
or
.ul
lex
until a \fBnewline\fP. We will have the s-dcl 
programs follow the syntax that a newline will mark the end of a statement. 
The C code may look like this:
.DS
other()
    /*
    Parse statement that is none of the special statements.
    Newline will mark the end. Note that when we will come 
    to code generation, the blanks or tabs seen will
    be important. We can't just eat them up. Otherwise 
    statements like 
	 copy 'p1' 'p2'
    which is legal DCL, will come out as
	 copy'p1''p2'
    which is not legal DCL.
    */
    {
	extern int lex();
	while( tokencode != NEWLINE && 
	       tokencode != FILEEND )
	       tokencode = lex(token);
	/* 
	skip over the NEWLINE that made us stop.
	*/
	scan();
	printf("Parsed and other statement\en");
    }
.DE
.IP 2.
The function 
.ul
program()
is responsible for parsing compound statements. A compound statement
according to the grammar is statements enclosed in balanced curly
braces. This lends itself to the arrangement, as far as the function
.ul
program()
is concerned, as follows:
.DS
program()
    /*
    Parse a compound statement. When this function is 
    invoked the token is a  {. Fire up a loop that 
    terminates until the closing brace that matches the 
    opening brace is seen. Notice that if the compound 
    statement contains another compound statement, 
    recursion will take care of it.
    */
    {
	/* skip over the { */
	scan();
	while( tokencode != CBRACE ){
	    if( tokencode == FILEEND ){
		printf("Fatal error. Premature EOF\en");
		/* the c library function exit forces
		   immediate abort of the program.
		*/
		exit(1);  
	    }
	    /*
	    otherwise, call stmt().
	    */
	    stmt();
	}
	/*
	skip over the closing curly brace that stopped the
	while loop.
	*/
	scan();
    }
.DE
.bp
.NH
Assignment 3\ --\ Code\ Generation
.PP
If your recursive descent parser can recognize the statements and classify
them as an \fBif, while, for, other\fP etc, a few additions can now be made
to generate code. Remember that VMS DCL does not have all the control
structures your parser can recognize. Its job now is to take these control
structures and by using \fBIF-THEN-GOTO\fP construct, which DCL does 
recognize, translate the s-dcl code into DCL code.
In this assignment, you will implement the code generator. This also happens
to be the last phase of the project. 
Turn in a listing of the entire program. This includes the
lexical analyzer. Make sure that you organize the program in a modular
fashion. C supports independent compilation; I expect that you would put
this facility to use. 
The data for your parser is in "/users/sohail/hwk/bun.vms". Turn in the
generated code.
.NH
Code Generation Rules
.NH 1
\fBif\fP statement
.PP
The translation of
.DS
if ( condition ) statement
.DE
is essentially
.DS
if ( condition is not true ) go around statement
.DE
Thus when an \fBif\fP is encountered, we must
.DS
isolate the \fBcondition\fP part
generate and save some unique label L
output " \fBif( .not. (\fP condition )) \fBthen goto L\fP
.DE
In DCL, \fB.not.\fP inverts the truth value of the condition. When we get to
the end of the statement that follows the \fBif\fP, there are two 
possibilities. If there is no \fBelse\fP following, we need output only
.DS
$ L:
.DE
If an \fBelse\fP follows, however, we must generate another label \fBL1\fP
and output
.DS
$ goto L1
$ L:
.DE
to branch around the \fBelse\fP part, and then, after whatever statement 
follows the \fBelse\fP,
.DS
$ L1:
.DE
to terminate the \fBif-else\fP construction. In summary the code generation
for
.DS
\fBif( \fP condition \fB)\fP statement
.DE
is
.DS
\fB$ if(.not.( \fPcondition \fB)) then goto L\fP
$ statement
\fB$ L:\fP
.DE
and for
.DS
\fBif( \fP condition \fB)\fP statement1 \fBelse\fP statement2
.DE
is
.DS
\fB$ if(.not.( \fPcondition \fB)) then goto L\fP
$ statement1
\fB$ goto L+1\fP
\fB$ L:\fP
$ statement2
\fB$ L+1:\fP
.DE
In a DCL command procedure, every statement must begin with a $ sign.
Notice that this is opposite of what shell command procedures require.
That is why every line we generate, we must output a $ as the first
thing. If a line does not begin with a $ sign, DCL takes that to mean the
this line is a 
.ul
continuation
of the previous line. You will run across cases where the generated code
may turn out to be too long to fit on a 80-character line. It will better
to break up the line into two or more lines. To indicate continuation,
each line will need to end with minus sign "-" and the next line which
contains the continuation must not begin with a $ sign. So the following
will be taken as a single statement by DCL.
.DS
$ if( .not. (p1 .eqs. "" .and. -
p2 .eqs. ""))-
then goto 23002
.DE
In DCL, labels are specified according the following regular expression
.DS
[a-zA-Z0-9_$][a-zA-Z0-9_$]*:
.DE
that is, a label can consist of a string of letters or digits or underscore
or the dollar sign followed by a colon. It is the colon that differentiates
a label from other entities. We need to generate unique labels when we do
the code generation. To do so, the best choice is to use labels that are
made up of integers, starting at 23000 or so. For a new label, all we
need to do is increment the current value of the label and output this
incremented value as a string of digits, appended with a colon. I will come
to the details in a little while. First, let's finish the code generation
scheme.
.NH 1
\fBwhile\fP statement
.PP
The code for
.DS
\fBwhile( \fPcondition \fB)\fP statement
.DE
is
.DS
\fB$ L: if(.not.(\fP condition \fB)) then goto L+1\fP
$ statement
\fB$ goto L
$ L+1:\fP
.DE
\fBL+1\fP means the value of the label here is \fBone plus\fP the
value held by L. If the \fBwhile\fP contains a \fBnext\fP statement,
it would merely goto L. L+1 will serve as the target for \fBbreak\fP
statement.
.NH 1
\fBdo while\fP statement
.PP
The code for
.DS
\fBdo\fP statement \fBwhile (\fP condition \fB)\fP 
.DE
is
.DS
$ \fBL:\fP
$ statement
$ \fBL+1: if (\fP condition \fB) then goto L\fP
$ \fBL+2:\fP
.DE
\fBL+1\fP will serve as the target for any \fBnext\fP statment. \fBL+2\fP
will the target for any \fBbreak\fP statement.
.NH 1
\fBfor\fP statement
.PP
The code for
.DS
\fBfor( \fPinitialize\fB;\fP condition\fB;\fP reinitialize\fB )\fP
    statement
.DE
is
.DS
$ initialize
\fB$ L: if(.not.(\fP condition \fB)) then goto L+2\fP
$ statement
$ \fBL+1:\fP reinitialize
$ \fBgoto L
$ L+2:\fP
.DE
As with C, the intialize and reinitialize parts can only be a single
statement. Compound statements are not allowed.
The reason for generating L+1 is that if the statement part includes
a \fBnext\fP statement, it would be translated into \fBgoto L+1\fP, i.e,
L+1 is the target for any \fBnext\fP statement. A \fBbreak\fP statement
will generate a \fBgoto L+2\fP.
Note that the \fBfor\fP the initialize, condition and reinitialize parts
are all optional. The following is an infinite loop.
.DS
for( ; ; )
    stmt
.DE
In which case the generated code would simply be
.DS
$ L:
$ stmt
$ L+1:
$ goto L
$ L+2:
.DE
Notice that all the labels needed for a \fBfor\fP statment are still
generated because the body of the loop could include \fBbreak\fP and
\fBnext\fP statements. However, the initialize part, the "if(.not.(.." 
and the reinitialize portions are missing. A simple way to determine 
whether any of these parts are missing is as follows: after the open
paren, get a non blank token. If it is semicolon, then don't emit any
initialize statement. Then function
.ul
condition
is normally invoked but before calling condition, check what the
next non-blank token is. If it is another semicolon, donot call
condition and neither generate the "if(.not.((" before condition
nor generate the "then goto L+2" after the condition.
.NH 1
Code Generation example
.PP
The following illustrates how the translation scheme will work.
.DS
/* this s-dcl program is an example */
if (p1 .eq. p2){
    if (p1 .eqs. "greet")
	write sys$output "hello world"
    count = 20
    while (count .gt. 0)
	write 'count'
}
else
    for (count = 30; count .gt. 0; count=count-1){
	if (count .ge. 10)
	    next  /* goto reinit part */
	write sys$output "Count ''count'"
    }
.DE
Here is what should be generated by your compiler.
.DS
$ if(.not.(p1.eq.p2)) then goto 23000
$ if(.not.(p1.eqs."greet")) then goto 23002
$ write sys$output "hello world"
$ 23002:
$ count = 20
$ 23003: if(.not.(count.gt.0)) then goto 23004
$ write 'count'
$ goto 23003
$ 23004:
$ goto 23001
$ 23000:
$ count = 30
$ 23005: if(.not.(count.gt.0)) then goto 23007
$ if(.not.(count.ge.10)) then goto 23008
$ goto 23006
$ 23008:
$ write sys$output "Count ''count'"
$ 23006: count=count-1
$ goto 23005
$ 23007:
$ 23001:
.DE
How you place the labels and the statements that follow is a matter of
choice. Keep in mind that every line generated must begin with a $ sign
and that the label must terminate with a colon. If the generated line is
too long to fit in 80 columns, put a minus "-" at the end and continue on
the next line of output \fBwithout a $\fP.
.NH 1
\fBOther \fPstatement
.PP
If a s-dcl line does not belong to any of the above constructs, it must
\fBother\fP, i.e, it holds no special meaning to your compiler. In this
case all you need to do is emit a $ sign followed
by the rest of the tokens found until the end of the line is 
seen. Few considerations you must keep in mind. 
Whitespaces are significant
when you find a line that belongs to the \fBother\fP class. So call 
.ul
lex
and \fBnot\fP
.ul
scan
to get the tokens until end of line. If you look at the example code I gave
you in a previous handout for the function 
.ul
other
you'll notice that I call 
.ul
lex.
One liberty you may wish to take is that multiple blanks or tabs may be
compressed into a single blank or tab. So before emitting the token string,
you could give token to a function, say compress(token), that would
compress token into a string containing a single blank or tab.
If the token happens to a comment, do not emit it.
.PP
Sometimes, a need may arise where one may want to pass a line through your
compiler 
.ul
untouched.
To do so, such a line will begin with a pound "#" sign. Surely this line
of input will first be classified as \fBother\fP. In your function
.ul
other
check to see if the tokencode happens to be POUND. If so, eat up the
pound sign and emit the rest of without touching, i.e, no $ sign at the
beginning, no blank of tab compression. Just call lex repeatedly and
output whatever token is received until end of line.
.NH
Handling \fBnext\fP and \fBbreak\fP
.PP
When a \fBbreak\fP or a \fBnext\fP statement is seen, you need to generate
the right goto statements that will transfer control to the correct
point. The catch is: how do find out whether you are within the body of
a loop (while, for) and how far deep? If for example, a while statement is
nested within another while statement, then a \fBbreak\fP statement will
cause exit from the inner while loop. The \fBbreak\fP statement will get
translated into a goto statement. But where is this goto going to go? The point
is that along with code generation, you will need to write code that
remembers all the loops that were seen and the labels associated with the
loops. Any time a \fBbreak\fP or a \fBnext\fP statement is seen, this
code could be called upon to determine whether you are in a loop and what
labels have been generated for the loop. A convenient way to do this is
to use a
.ul
stack
that stacks the loop type and the associated label. Here is how it may
work. Every time you see a while or a for loop, stack it with the label
that preceeds the condition
"if(.not.(.."
In the example above, such labels are 23003 and 23005. If while parsing
the statement that forms the body of the loop you see a \fBnext\fP or
\fBbreak\fP, peek at the top element of the stack. If there is one then
look at the associated label. Let's call this label L. In case of
a while loop, the next statement will imply that you generate 
"$ goto L" and break will generate "$ goto L+1". If the stack
element is a for loop then next will generate "$ goto L+1" and break
will generate "$ goto L+2". Notice that you will peek at the top element
not pop it. When you parsed the body of the loop, pop the top element of
the stack.
.PP
The function 
.ul
whilestmt()
could be coded as follows.
.DS
whilestmt()
    {
        extern int stmt(), push(), pop()
        extern int genlab(), errmsg();
        extern int emitlabel(), emitstring();
	int lab, looptype, clabel, junk;
	scan();  /* skip the key word */
	lab = genlab();  /* get a unique label */
	/*
	  The target of the goto after the condition
	  part will be label lab+1. Need to increment
	  the next available label from genlab() such
	  lab+1 is also used up.
	*/
	junk = genlab();
	/* 
	  first char on the line has to be $
	*/
	emitstring("$ ");
	/*
	  emitlab will output "xxxxx: " where xxxxx is the
	  value passed thru lab.
	*/
	emitlabel( lab );
	emitstring(" if(.not.(" ); /* the condition part */
	/*
	  get the condition part within parenthesis.
	*/
	if (tokencode == OPAREN )
	    scan();
	else
	    errmsg("Missing ( in while condition" );
	condition();
	/*
	  the token should a close paren at this point.
	*/
	if (tokencode == CPAREN ){
	    scan();   /* skip it */
	else
	    errmsg("Missing ) following condition in while");
.DE
.DS
	/* emit "then goto lab+1" any way */
	emitstring( "))then goto " );
	emitlabel( lab+1 );
	emitstring("\en");  /* newline to finish off line */
	/* push a new entry onto the loopstack for this loop */
	push( WHILE, lab );
	/* and now parse the loop body */
	stmt();
	/* things at the bottom of the loop */
	emitstring( "$ goto "); emitlabel( lab ); 
	emitstring( "\en" ); emitstring("$ ");
	emitlabel ( lab+1 ); emitstring("\en");
	/* pop the stack. Notice that the addresses are being
	   passed so that pop() can return values through its
	   parameters.
        */
	pop( &looptype, &clabel ); /* values not used here */
    }
.DE
.PP
Develop a separate module that implements the stack routines. You would
need 
.ul
push,
.ul
pop,
and
.ul
peek.
There are a number of ways to define the data structure for the stack.
However, I am going to ask you to set up the stack as singly-linked list.
Here are the types and variable definitions that will do the job.
.DS
struct nodetype {
    int looptype;      /* WHILE or FOR */
    int label;         /* the value of associated label */
    struct nodetype *prev; /* pointer to the previous node */
};
/*
 keep stack private global. top=0 means empty stack
*/
static struct nodetype *top = 0;
/*
 The following defines a macro the yields the size of
 structure "nodetype".
*/
#define NODESIZE  sizeof( struct nodetype )
.DE
The function push and pop can be coded as follows.
.DS
push( ltype, looplab )
    int ltype, looplab;
    {
	struct nodetype *ptr;
	extern char *malloc();
	/*
	 just like pascal's NEW procedure, need to
	 allocate storage for the node and have "ptr"
	 point to it. Notice that malloc returns pointer
	 to char. Need to coerce into a pointer to
	 struct nodetype.
	*/
	ptr = (struct nodetype *)malloc( NODESIZE );
	if( ptr == 0 ){
	    errmsg("Out of memory in stack push - abort");
	    exit(1);
	}
	/*
	 store values and set up the links.
	*/
	ptr->looptype = ltype;
	ptr->label    = looplab;
	ptr->prev     = top;
	top           = ptr;  /* the new top elt */
    }
pop( pltype, plab )
    int *pltype, *plab;
    {
	struct nodetype *ptr;
	if ( top == 0 ){  /* empty stack */
	    errmsg("Attempt to pop empty stack" );
	    return( 0 );  /* return false */
	}
	*pltype = top->looptype;
	*plab   = top->label;
	ptr = top; /* save current top for deallocation */
	top = top->prev; /* the new top elt */
	free( (char *)ptr );  /* free up memory */
	return( 1 );  /* return true */
    }
.DE
I am assuming that you can write the function "errmsg" which will print
out its argument followed by newline. Make sure that you understand all the
casting that is going on to convert pointers to the correct type. The
function 
.ul
malloc
and
.ul
free
are part of the standard C library.  Malloc's argument is an 
.ul
unsigned int
which indicates the number of bytes you want allocated. It returns a
char pointer to the first byte. If these bytes are going to used for other
than storing characters (a character takes up a byte) you will have to
coerce the pointer returned. The function 
.ul
free
takes a character pointer and frees the memory pointed. 
.NH 1
Continuation in s-dcl code
.PP
We should provide a facility in s-dcl such that source code lines can be
continued on more than one line by using a continuation character. The
scheme used in C pre-processor, for example, is that if a line ends with
a backslash (\e), then the next line is taken to be part of the current
line. Have your parser follow the same convention, i.e, ignore the newline
character at the end of a line of input if the last token on the line is
a backslash. The use of this feature might be something like this.
.DS
/* example of a long s-dcl line */
if( p1 .eq. p2 ){
    mail/subject="news on ada compiler for vax"\e
	 user3$:[engr.thesis.ada]news.txt\e
	 opus::vaxa::clipper::sysmanager
}
.DE
The source code line that begins with "mail" will be classified as
\fBother\fP. The code to be generated is simply the source line
copied out to output (with the $ of course) but because the token
immediately before the newline is a backslash, your parser will simply
retrieve the newline and discard it. 
.PP
Please make the addition to your lex routine to return backslash as a
separate token instead of claiming that it is just one of the single
character token.
.NH 1
Output Routines
.PP
You noticed that I used calls to routines
.ul
emitlabel
and
.ul
emitstring
to generate the output code. Having a set of routines to generate the
actual output is always beneficial in that all the nitty gritty details
of output formatting can be hidden away. The parser does not need to worry
about things like whether a line of output has gone past 80th column, how
an integer label will be put as a string of characters with a colon appended.
Create a module with a set of output routines. The arrangement best suited
for such a module will be to have the output lines collect character at a
time in char array and a pointer that acts as an index into this array.
Here is how the module may be coded.
.DS
/*
    OutputRoutines. This module exports routines that handle the
    actual text that is destined to be the output code.
*/
#define MAXCOL   80
#define MAXBUF   MAXCOL+1    /* one more for NULL */
static char outbuf[MAXBUF]   /* output lines collected here */
static int  outp = (-1);     /* last position filled in outbuf */
outdone()
    /*
    outdone flushes outbuf and resets outp to -1. It is
    called at the end of various statements, and is also
    called by outch to flush a line prior to starting a 
    continuation. outdone is the only routine that actually
    produces output.
    */
    {
	outbuf[++outp] = '\en';
	outbuf[++outp] = '\e0';
	printf("%s", outbuf );
	outp = (-1);
    }
outch(c)
    char c;
    /*
    outch enters characters into outbuf. It is responsible 
    for handling continuation lines according to DCL
    conventions. If the character c happens to newline,
    outch calls outdone to flush the current line.
    */
    {
	if( c == '\en' )
	    outdone();
	else if( outp == MAXCOL-2 ){  /* 0..78 filled */
	    outbuf[++outp] = '-';     /* DCL continuation */
	    outdone();
	    /*
	    start the next line without the $ sign.
	    */
	    outbuf[++outp] = c;
	}
	else outbuf[++outp] = c;
    }
emitstring(s)
    char *s;
    /*
    output the entire string s by repeated calls to
    outch.
    */
    {
	char c;
	while( (c = *s++ ) )
	    outch(c);
    }
emitlabel( lab )
    int lab;
    /*
    convert integer label to a string of digits, append
    a colon and output the label.
    */
    {
	char s[8]; /* 5 for label, 1 for :, 1 for blank and
		      1 for NULL */
	/*
	call itoa to convert lab to string. We had this
	routine as an example in one of the lectures.
	It is listed on page 60 of the C book by K&R.
	*/
	iota( lab, s );
	/*
	append a colon, blank, and then null terminate.
	*/
	s[5] = ':';
	s[6] = ' ';
	s[7] = '\e0';
	emitstring( s );   /* here goes the label */
    }
.DE
.PP
When emitting a quoted string, it may be desirable that the string is not
broken across the output line. Write a special function
.ul
emitqstring
that will be called to handle quoted strings only. It will check to see
if the quoted string will fit on the current line of output. If not, it
will append "-" to the outbuf array, flush it out through outdone, and
put the quoted string afresh in the character buffer.
.bp
.PP
Here is a listing of "bun.vms" that you will use as input to your
program. You can invent your own input when you are testing your
program.
.DS
/*  Bun -- VMS DCL command procedure to bundle files into     */
/*         distribution package which can then be unbundled   */
/*         using UNIX shell. The output will be placed on the */
/*         on the file given as the arg to this procedure     */
if( p1 .eqs "" ){
    write sys$output\e
    	"Usage: bundle outfile (outfile will receive bundle)"
    exit    /* DCL exit */
}
/* if the file exists, open it, otherwise create it */
open/write/err=out_err fout 'p1'
exist := "TRUE"
out_err:
if( exist .nes. "TRUE" ){
    create 'p1'
    open/write/err=give_up fout 'p1'
}
q := "'"
for( rc = 0; ; ){    /* no condition, no reinit */
    inquire infile "File? "
    if( infile .eqs. "" )
	break        /* time to wrapup */
    open/read/err=infile_err inf 'infile'
    write fout "echo ''infile' 1>&2"
    write fout "cat >''infile' <<''q'END OF ''infile'''q'"
    rc = rc + 2  
    done = 0
    while( done .eq. 0 ){
	read/end=eof inf line
	write       fout line
	rc = rc + 1
    }
.DE
.DS
    eof: close inf
    write fout "END OF ''infile'"
    rc = rc + 1
    next
    /*
     come here if trouble opening 'infile'
    */
    infile_err: write sys$output \e
		   "error opening ''infile'"
}
if( rc .gt. 0 ){
    write sys$output "''rc' records written to ''p1'"
    close fout
}
else
    write sys$output "0 records written out"
exit
.DE