kgs@dvncnms.cnms.dev.unisys.com (Kenneth G. Salter) (05/15/91)
This is my first post. I'd like to comment on a response by Esmond Pitt to a post by Carlos E. Galarce. While building a COBOL compiler is a large task, prospects are not as bleak as one might think. In article <9105060154.AA04463@bohra.cpg.oz.au>, ejp@bohra.cpg.oz.au (Esmond Pitt) writes: > Cobol is neither regular, context-free, nor LR(k) for any k. This makes > use of lex and yacc highly problematic. Discussion follows. Ignoring semantics and focusing just on parsing, I think that the Identification, Environment, and Data divisions are regular, and the Procedure division is LR(1). > Lex: At least four scanning modes are required. > In addition to the default (normal) mode you need: > (a) Comment-entry mode. Comment-entries in the Identification Division > (AUTHOR etc) have their own lexical rules. ... > (b) PICTURE mode. X(120) is either a single PICTURE-string token or > 4 tokens representing an indexed identifier, depending on context. > PICTURE mode is triggered by a preceding PIC(TURE)?(IS)?). Lex should categorize the words it scans with enough granularity that yacc is not confused. Candidate tokens are: DN_only, PIC_only, DN_or_PIC, INT_or_PIC, etc. Lex should recognize these tokens regardless of context. Context is a parser function. Yacc will need productions of the form: DataName : DN_only | DN_or_PIC ; PIC_string : PIC_only | DN_or_PIC | INT_or_PIC ; > (c) DECIMAL-POINT IS COMMA mode. This phrase changes the format of > numeric literals. Rather than try to hard-code the definition of a number with a regular expression, a word that looks like a number can trigger a lex function that scans carefully, being sensitive to whether DECIMAL-POINT IS COMMA has appeared. > The whole lexical process is greatly complicated by the rules for > continued identifiers, numeric literals and alpha literals. Also, you > have to lexically ignore sequence-number area and the area to the right of > margin R (which, incidentally, is undefined except by universal > agreement). You could front-end lex with a filter that erases columns 1-6 and 73-80, and that replaces comment lines by blank lines. Or, you might write a input() function for lex to do the above and to merge continued lines into a single line so that that the parser need not deal with continuation. This input() might also solve the column sensitivity problem by prefixing non-blank text appearing in Areas A and C with distinguished characters. All of these has to be done in a way that does not result in loss of the actual source line numbers. > The rules about Area A (indentation) are formally unnecessary except when > looking for the end of a comment-entry. You still need to enforce these > rules, however ... See above. > COPY REPLACING and REPLACE have their own significant peculiarities. > Yacc: Cobol is not LR(k) for any fixed k because of: > (a) the WITH DATA phrase > (b) the NOT {AT END/INVALID KEY/ON OVERFLOW/SIZE ERROR} phrases, > both of which are scope terminators requiring arbitrary lookahead, ... These phrases are no more complex than IF ... THEN ... ELSE, and not much more complicated than parenthesized expressions. > (c) the syntax of abbreviated combined relational conditions. > In one form of these the noise-word IS becomes syntactically > significant, contrary to one of the stated objectives of Cobol-85. > Other problems: > > (a) Yes, the grammar is enormous. Cobol-85 has over 400 reserved words. > (b) The syntax for the I/O statements (READ, REWRITE, WRITE, > DELETE, ...) is dependent on the ACCESS MODE of the file named. > Depending on the access mode, either an INVALID KEY or an AT END > phrase is the legal syntactic continuation. This gets important in > COBOL-85 with the arbitrary nesting allowed; otherwise your parser > will tie e.g. an INVALID KEY phrase to the closest READ statement > instead of the one it really belongs to, and completely mess up the > syntactic scope. Good point. I'd have missed this. For initial efforts, I'd ignore it, reasoning that the COBOL programmer can avoid the problem by delimiting each READ with an END-READ. Long term, you'd need yacc actions to straighten out the parse tree as soon as yacc reduces INVALID KEY, etc. I haven't taken the time to address all the issues, just the easier ones. Hope this helps. -- Kenneth G. Salter, Unisys Corporation -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
ejp@bohra.cpg.oz.au (Esmond Pitt) (05/17/91)
In article <91-05-082@iecc.cambridge.ma.us> kgs@dvncnms.cnms.dev.unisys.com (Kenneth G. Salter) writes: [following-up my posting on COBOL] > Lex should categorize the words it scans with enough granularity that > yacc is not confused. Candidate tokens are: DN_only, PIC_only, > DN_or_PIC, INT_or_PIC, etc. Lex should recognize these tokens > regardless of context. Context is a parser function. Yacc will > need productions of the form: > DataName : DN_only | DN_or_PIC ; > PIC_string : PIC_only | DN_or_PIC | INT_or_PIC ; If I have understood correctly, this would mean that the DN_or_PIC rule has to unpick the single token returned by yylex() into the dataname part, the left brace, the index, and the right brace, and then parse the result, all without benefit of lex, yacc, clergy, ... You can do it all right, but the result is not exactly lex+yacc parsing. It's more like keeping a dog and barking yourself. I prefer the scanner to scan and the parser to parse. My own solution to this was to have the parser switch Lex's start states whenever a picture-string is expected. >> Yacc: Cobol is not LR(k) for any fixed k because of: >> (a) the WITH DATA phrase >> (b) the NOT {AT END/INVALID KEY/ON OVERFLOW/SIZE ERROR} phrases, > > These phrases are no more complex than IF ... THEN ... ELSE, and not > much more complicated than parenthesized expressions. They are not _complicated_ at all, but they all contain optional noisewords, and they all require lookahead of > 1 token beyond the RHS of a production. You get reduce/reduce conflicts. The statement that COBOL is not LR(1) because of these elements was made to me by a member of the ANSI X3-23.1985 committee. I don't have details on-line; maybe I'll be able to followup with these later. There are solutions to these problems; the task has been accomplished several times. I've done it myself. My point was that the task is quite a bit more complex than just sitting down and hacking out a lex & yacc script as you might expect, and as the comp.compilers monthly message used to say. Most of this is because the basic structures of the language date from before 1960, i.e. before compiler theory really got going, and the subsequent revisions to the standard have not really addressed compiler-theoretic issues. - -- Esmond Pitt, Computer Power Group ejp@bohra.cpg.oz -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.