[comp.compilers] Parsing Cobol with yacc and lex

kgs@dvncnms.cnms.dev.unisys.com (Kenneth G. Salter) (05/15/91)

This is my first post.  I'd like to comment on a response by Esmond Pitt
to a post by Carlos E. Galarce.  While building a COBOL compiler is a
large task, prospects are not as bleak as one might think.

In article <9105060154.AA04463@bohra.cpg.oz.au>, ejp@bohra.cpg.oz.au (Esmond Pitt) writes:
> Cobol is neither regular, context-free, nor LR(k) for any k. This makes
> use of lex and yacc highly problematic. Discussion follows.

Ignoring semantics and focusing just on parsing, I think that the
Identification, Environment, and Data divisions are regular, and the
Procedure division is LR(1).

> Lex: At least four scanning modes are required.
> In addition to the default (normal) mode you need:

>     (a) Comment-entry mode. Comment-entries in the Identification Division
>     (AUTHOR etc) have their own lexical rules. ...

>     (b) PICTURE mode. X(120) is either a single PICTURE-string token or
>     4 tokens representing an indexed identifier, depending on context.
>     PICTURE mode is triggered by a preceding PIC(TURE)?(IS)?).

Lex should categorize the words it scans with enough granularity that
yacc is not confused.  Candidate tokens are:  DN_only, PIC_only,
DN_or_PIC, INT_or_PIC, etc.  Lex should recognize these tokens
regardless of context.  Context is a parser function.  Yacc will 
need productions of the form:
	DataName   : DN_only  | DN_or_PIC ;
	PIC_string : PIC_only | DN_or_PIC | INT_or_PIC ;

>     (c) DECIMAL-POINT IS COMMA mode. This phrase changes the format of
>     numeric literals.

Rather than try to hard-code the definition of a number with a regular
expression, a word that looks like a number can trigger a lex function
that scans carefully, being sensitive to whether DECIMAL-POINT IS COMMA
has appeared.

> The whole lexical process is greatly complicated by the rules for
> continued identifiers, numeric literals and alpha literals.  Also, you
> have to lexically ignore sequence-number area and the area to the right of
> margin R (which, incidentally, is undefined except by universal
> agreement).

You could front-end lex with a filter that erases columns 1-6 and 73-80,
and that replaces comment lines by blank lines.  Or, you might write a
input() function for lex to do the above and to merge continued lines into
a single line so that that the parser need not deal with continuation.
This input() might also solve the column sensitivity problem by prefixing
non-blank text appearing in Areas A and C with distinguished characters.
All of these has to be done in a way that does not result in loss of the
actual source line numbers.

> The rules about Area A (indentation) are formally unnecessary except when
> looking for the end of a comment-entry. You still need to enforce these
> rules, however ...

See above.

> COPY REPLACING and REPLACE have their own significant peculiarities.

> Yacc: Cobol is not LR(k) for any fixed k because of:
>     (a) the WITH DATA phrase
>     (b) the NOT {AT END/INVALID KEY/ON OVERFLOW/SIZE ERROR} phrases,
> both of which are scope terminators requiring arbitrary lookahead, ...

These phrases are no more complex than IF ... THEN ... ELSE, and not
much more complicated than parenthesized expressions.

>     (c) the syntax of abbreviated combined relational conditions.
>     In one form of these the noise-word IS becomes syntactically
>     significant, contrary to one of the stated objectives of Cobol-85.

> Other problems:
> 
>     (a) Yes, the grammar is enormous. Cobol-85 has over 400 reserved words.

>     (b) The syntax for the I/O statements (READ, REWRITE, WRITE,
>     DELETE, ...) is dependent on the ACCESS MODE of the file named.
>     Depending on the access mode, either an INVALID KEY or an AT END
>     phrase is the legal syntactic continuation.  This gets important in
>     COBOL-85 with the arbitrary nesting allowed; otherwise your parser
>     will tie e.g.  an INVALID KEY phrase to the closest READ statement
>     instead of the one it really belongs to, and completely mess up the
>     syntactic scope.

Good point.  I'd have missed this.  For initial efforts, I'd ignore it,
reasoning that the COBOL programmer can avoid the problem by delimiting
each READ with an END-READ.  Long term, you'd need yacc actions to
straighten out the parse tree as soon as yacc reduces INVALID KEY, etc.

I haven't taken the time to address all the issues, just the easier ones.
Hope this helps.
--
Kenneth G. Salter, Unisys Corporation

-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

ejp@bohra.cpg.oz.au (Esmond Pitt) (05/17/91)

In article <91-05-082@iecc.cambridge.ma.us> kgs@dvncnms.cnms.dev.unisys.com (Kenneth G. Salter) writes:
[following-up my posting on COBOL]

> Lex should categorize the words it scans with enough granularity that
> yacc is not confused. Candidate tokens are: DN_only, PIC_only,
> DN_or_PIC, INT_or_PIC, etc. Lex should recognize these tokens
> regardless of context. Context is a parser function. Yacc will 
> need productions of the form:
> 	DataName   : DN_only  | DN_or_PIC ;
> 	PIC_string : PIC_only | DN_or_PIC | INT_or_PIC ;

If I have understood correctly, this would mean that the DN_or_PIC rule
has to unpick the single token returned by yylex() into the dataname part,
the left brace, the index, and the right brace, and then parse the result,
all without benefit of lex, yacc, clergy, ... You can do it all right, but
the result is not exactly lex+yacc parsing.

It's more like keeping a dog and barking yourself. I prefer the scanner to
scan and the parser to parse. My own solution to this was to have the
parser switch Lex's start states whenever a picture-string is expected.

>> Yacc: Cobol is not LR(k) for any fixed k because of:
>>     (a) the WITH DATA phrase
>>     (b) the NOT {AT END/INVALID KEY/ON OVERFLOW/SIZE ERROR} phrases,
>
> These phrases are no more complex than IF ... THEN ... ELSE, and not
> much more complicated than parenthesized expressions.

They are not _complicated_ at all, but they all contain optional
noisewords, and they all require lookahead of > 1 token beyond the RHS of
a production. You get reduce/reduce conflicts.

The statement that COBOL is not LR(1) because of these elements was made
to me by a member of the ANSI X3-23.1985 committee. I don't have details
on-line; maybe I'll be able to followup with these later.

There are solutions to these problems; the task has been accomplished
several times. I've done it myself.

My point was that the task is quite a bit more complex than just sitting
down and hacking out a lex & yacc script as you might expect, and as the
comp.compilers monthly message used to say. Most of this is because the
basic structures of the language date from before 1960, i.e. before
compiler theory really got going, and the subsequent revisions to the
standard have not really addressed compiler-theoretic issues.

- -- 
Esmond Pitt, Computer Power Group
ejp@bohra.cpg.oz
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.