cs326ad@ux1.cso.uiuc.edu (02/24/90)
Here's an interesting question that I pose to people who are familiar with lex and/or flex. I am working on a compiler project, and am attempting to do the work at home instead of on the net. The work is being done using Zortech's C++, but I am having a small problem as follows: In (f)lex definitions, we define whitespace as ws [ \t\n] That is, ws is any one of _blank_, tab, or newline. If lex runs across this in the input file, then it takes an appropriate action (in my case, none.) Now, when porting this code to C++ zortech style, the lex.yy.c program compiles fine (with a few minor changes.) However, the executable does NOT recognize the \n the same as the executable on the net. i.e.: If we input: 1234 to the lex executable on the net, I get the output INT 1234 However, on the zortech implementation, for the same input, I get INT 1 Doing a little debugging on the zortech code, I came to realize that the eoln was ended with a \r\n for carriage return newline. Fine. But for some reason the zortech implementation does not catch the newline at the end of the line, like it does on the net. In addition, I compiled the lex code with Microsoft 5.01, and the program worked fine. Same code, different compiler. There just seems to be an odd quirk with the zortech implementation of eoln (at least in this case) since all the lex code does is scan for certain tokens appearing in the input, and when it matches a token, it performs a given action. So, the bottom line question is why, with the same code from the same lex file, I get two different executables from two different compilers. I hope that this explaination is clear enough. It is kind of a tough problem to explain. Any helpful insight would be appreciated. =========================================================================== = = = Mark Allender University of Illinois = = cs326ad@ux1.cso.uiuc.edu Urbana, Illinois = = ma20449@uxa.cso.uiuc.edu = = = ===========================================================================
bright@Data-IO.COM (Walter Bright) (02/25/90)
In article <6300008@ux1.cso.uiuc.edu> cs326ad@ux1.cso.uiuc.edu writes:
<Now, when porting this code to C++ zortech style, the lex.yy.c program compiles
<fine (with a few minor changes.) However, the executable does NOT recognize
<the \n the same as the executable on the net. i.e.:
<If we input: 1234 to the lex executable on the net, I get the output
<INT 1234
<However, on the zortech implementation, for the same input, I get
<INT 1
<Doing a little debugging on the zortech code, I came to realize that the
<eoln was ended with a \r\n for carriage return newline. Fine. But
<for some reason the zortech implementation does not catch the newline at
<the end of the line, like it does on the net.
I'm afraid I don't understand. Is the *compiler* not working right or the
*run-time library*? Please reduce the problem to a few lines of code
which exhibit the problem. I know very little about lex/flex.csu025@umaxc.weeg.uiowa.edu (Kent Williams) (02/27/90)
>I'm afraid I don't understand. Is the *compiler* not working right or the >*run-time library*? Please reduce the problem to a few lines of code >which exhibit the problem. I know very little about lex/flex. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Yipes! If you don't mind my asking, do you USE any automated compiler writing tools? Maybe I got into the game a little later that you did, but a compiler author who knows little about lex is a little like a auto mechanic who doesn't know about socket wrenches -- I've no doubt he can do his job, but isn't he working a little bit too hard?
meissner@osf.org (Michael Meissner) (02/27/90)
In article <741@ns-mx.uiowa.edu> csu025@umaxc.weeg.uiowa.edu (Kent Williams) writes: | >I'm afraid I don't understand. Is the *compiler* not working right or the | >*run-time library*? Please reduce the problem to a few lines of code | >which exhibit the problem. I know very little about lex/flex. | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! | | Yipes! | If you don't mind my asking, do you USE any automated compiler writing | tools? Maybe I got into the game a little later that you did, but a | compiler author who knows little about lex is a little like a auto mechanic | who doesn't know about socket wrenches -- I've no doubt he can do his | job, but isn't he working a little bit too hard? I have supported three compilers in the last 10 years (DG/L, Data General C, and now GNU C), and none of them use any automatic tools for lexing. First of all, UNIX lex is too general purpose, and too slow to be used in a production compiler (yes I know flex fixes at least some of the speed problems). You don't need backtracing, or even generalized regular expressions, and such for most computer languages to lex stuff, just simple one character lookahead. Second of all, lex-ing is fairly simple. It took me a day or so, to write from scratch the lexer for Data General C. It would have taken more time to come up to speed with lex (ignoring the problem that we didn't have lex at the time) than to just write the code. IMHO if you think writing a lexer is too hard, you have no business writing compilers. -- Michael Meissner email: meissner@osf.org phone: 617-621-8861 Open Software Foundation, 11 Cambridge Center, Cambridge, MA Catproof is an oxymoron, Childproof is nearly so
maxsmith@athena.mit.edu (Samuel M Druker) (02/27/90)
Lex generated scanners are slower than hand crafted ones. Doubly so for yacc and its output and yacc it self can't produce a C++ parser, something to do with the language not conforming to LALR(1) and all that. If you've only got a specialized compiler market, especially a cutthroat one like the DOS world, lex and yacc aren't vital. Samuel Druker
voss@sunb6.cs.uiuc.edu (03/01/90)
> maxsmith@athena.mit.edu writes: > Lex generated scanners are slower than hand crafted ones. Doubly so for > yacc and its output and yacc it self can't produce a C++ parser, something > to do with the language not conforming to LALR(1) and all that. I saw that comment about C++ not being LALR(1) in the g++ distribution, and at the time believed it. However, my biggest thrill at OOPSLA '89 was sitting at the same table as Bjarne Stroustrup for a few hours drinking free beer. He said that he IS USING A LALR(1) GRAMMER for C++ 2.0. He appeared sober & honest, therefore I assume that C++ is LALR(1), but non-trivial. ============================================================================= First, let me make it clear that I am by no means an expert at writing compilers (<grin> yet). I am a graduate student in CS, currently taking my first (much overdue) compiler course. We are using the text _Compilers_Principles_Techniques_and_Tools_ by Aho, Sethi, and Ullman. The programming portion of the class involves using lex/flex and yacc/byacc/bison to implement a subset of Pascal. Now my questions: > meissner@osf.org > Second of all, lex-ing is fairly simple. It took me a day or so, to > write from scratch the lexer for Data General C. It would have taken > more time to come up to speed with lex (ignoring the problem that we > didn't have lex at the time) than to just write the code. IMHO if you > think writing a lexer is too hard, you have no business writing > compilers. No, I don't think writing a lexer is hard. I did it many times before I even heard of lex (though I didn't know I was writing "lexers.") However, I can now write and debug a lexical analyzer in half a day max using flex, and anyone knowing lex/flex could easily maintain my code. Therefore I doubt I will ever write another "lexer" by hand. I think the surprise is that a "commercial compiler writer" would not be familiar with flex/lex. If you don't mind my asking, when did you study compilers, and what text did you use? ============================================================================== > schmidt@zola.ics.uci.edu writes: > Actually, GNU C uses a reserved word recognizer function generated > automatically by my GPERF program. But I understand your general > point. then goes on to give performance figures showing GPERF is much faster than flex/lex. I am not familiar with GPERF. What did you give up to get the speed advantage? Is GPERF available via anonymous ftp? Where? thanks, Bill Voss Email: billvoss@uiuc.edu or voss@cs.uiuc.edu
schmidt@zola.ics.uci.edu (Doug Schmidt) (03/01/90)
There are a number of confusions in your post. Let me see if I can shed some light on this. In article <24800002@sunb6>, voss@sunb6 writes: >I saw that comment about C++ not being LALR(1) in the g++ distribution, >and at the time believed it. However, my biggest thrill at OOPSLA '89 >was sitting at the same table as Bjarne Stroustrup for a few hours drinking >free beer. He said that he IS USING A LALR(1) GRAMMER for C++ 2.0. >He appeared sober & honest, therefore I assume that C++ is LALR(1), but >non-trivial. Your statement confuses the difference between an LALR(1) language (which C++ is *not*) and a particular set of tools used to recognize a language. What Bjarne probably meant (judging from the cfront 2.0 sources) is that he was using the LALR(1) parser generated by yacc to recognize C++. What you probably don't know is that cfront 2.0 also utilizes an infinite-lookahead lexical analyzer (hand-coded, BTW) that enables it to handle the non-LALR(1) parts of the language, e.g. disambiguating between certain type-decls and expressions. Incidentally, Jim Roskind (ileaf!io!bronx!jar@EDDIE.MIT.EDU) has released a beta version of his yacc-based grammar that accepts C++ 2.0 syntax *without* using an infinite lookahead lexer. As far as I know this grammar has not yet been incorporated in any production-quality C++ compiler or translator. Comments Jim? > No, I don't think writing a lexer is hard. I did it many times >before I even heard of lex (though I didn't know I was writing "lexers.") >However, I can now write and debug a lexical analyzer in half a day max >using flex, and anyone knowing lex/flex could easily maintain my code. Have you ever tried writing a f?lex scanner for full ANSI C? There are some surprising subtleties. Henry Spencer posted one to the net a while back. Let me know if you'd like a copy to peruse. It's got `easy to maintain' regular expressions like: L?\'([^'\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|[xX][0-9a-fA-F]+))+\' L?\"([^"\\\n]|\\(['"?\\abfnrtv\n]|[0-7]{1,3}|[xX][0-9a-fA-F]+))*\" >Therefore I doubt I will ever write another "lexer" by hand. Well, if you don't end up working with language processing tools then the point is moot... ;-) On the other hand, what if you write portable programs for systems that lack f?lex? What if you are trying to write a fast lexer in order to gain market share for your product? f?lex are extremely useful for certain purposes, but there are often pragmatic reasons why they aren't used in every circumstance. > I think the surprise is that a "commercial compiler writer" >would not be familiar with flex/lex. Why is that surprising? f?lex scanners are *not* generally used for production (i.e., commercial) compilers. And certainly not in a highly competitive market like MS-DOS! > I am not familiar with GPERF. What did you give up to >get the speed advantage? The trade-off is generality versus specificity. GPERF is a great solution for a particular (and limited) problem domain, i.e., implementing recognizers for static search sets. It doesn't provide anywhere *near* the flexibility of f?lex. De gustibus non disputandum est... >Is GPERF available via anonymous ftp? Where? You bet! (glad you asked... ;-)). The C++ and C version of GPERF is available via anonymous ftp from ics.uci.edu (128.195.1.1) in the ~ftp/pub directory: ---------------------------------------- -rw-r--r-- 1 ftp 120483 Feb 15 23:30 gperf-2.3.tar.Z -rw-r--r-- 1 ftp 97187 Nov 11 14:15 cperf-2.1.tar.Z ---------------------------------------- I'm also presenting a paper on GPERF at the upcoming USENIX C++ Conference in San Francisco. A preliminary draft is also available via ftp from ics.uci.edu: ---------------------------------------- -rw-r--r-- 1 ftp 43820 Feb 18 17:10 gperf.tex ---------------------------------------- I'd be interested in any comments people might have (there's still time to make revisions!) Doug -- The official language of San Marcos is Swedish. All boys | schmidt@ics.uci.edu under the age of sixteen years old are now sixteen years | office (714) 856-4043 old. Underwear must be changed every half hour. It will +---------------------- be worn on the outside, so we can check. -- `Bananas' by Woody Allen
jbuck@henry.berkeley.edu (Joe Buck) (03/01/90)
In article <25EC5CBF.26673@paris.ics.uci.edu> schmidt@zola.ics.uci.edu (Doug Schmidt) writes: >> No, I don't think writing a lexer is hard. I did it many times >>before I even heard of lex (though I didn't know I was writing "lexers.") >>However, I can now write and debug a lexical analyzer in half a day max >>using flex, and anyone knowing lex/flex could easily maintain my code. >Have you ever tried writing a f?lex scanner for full ANSI C? There >are some surprising subtleties. Henry Spencer posted one to the net a >while back. Let me know if you'd like a copy to peruse. It's got >`easy to maintain' regular expressions like: >L?\'([^'\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|[xX][0-9a-fA-F]+))+\' >L?\"([^"\\\n]|\\(['"?\\abfnrtv\n]|[0-7]{1,3}|[xX][0-9a-fA-F]+))*\" This is as ugly as it is because Henry failed to use definitions. Any experienced user of flex would define a symbol to represent letters, a symbol to represent numbers, a symbol to recognize backslash escapes, etc, and build the above expressions out of the symbols. Only someone trying to slam the lex approach would write the above monstrosities. >>Therefore I doubt I will ever write another "lexer" by hand. >Well, if you don't end up working with language processing tools then >the point is moot... ;-) On the other hand, what if you write portable >programs for systems that lack f?lex? Do you have C? Flex generates C code, and is written in C, and is freely redistributable. > What if you are trying to write >a fast lexer in order to gain market share for your product? You certainly wouldn't want to use lex then, but flex is far faster, and other lexer generators are faster still. Check out Van Jacobsen's paper from the 1986 Winter Usenix, where he produced a lexer generator as fast (almost) as cat, and got the inner loop down to one 68000 instruction per character processed (anyone know if this version is ever going to be public?). Beat that with a hand-written lexer. Granted, Vern Paxton's flex program is not this good, but many people got turned off to lexer generators because lex is so horrible. flex is far superior to lex. The "f" stands for "fast", not free. For that reason, writing "f?lex" is unfair to flex, lumping it in with a very inefficient program. -- -- Joe Buck jbuck@janus.berkeley.edu {uunet,ucbvax}!janus.berkeley.edu!jbuck
rfg@ics.uci.edu (Ronald Guilmette) (03/01/90)
In article <25EC5CBF.26673@paris.ics.uci.edu> schmidt@zola.ics.uci.edu (Doug Schmidt) writes: >There are a number of confusions in your post. Let me see if I can >shed some light on this. > >In article <24800002@sunb6>, voss@sunb6 writes: >>I saw that comment about C++ not being LALR(1) in the g++ distribution, >>and at the time believed it. However, my biggest thrill at OOPSLA '89 >>was sitting at the same table as Bjarne Stroustrup for a few hours drinking >>free beer. He said that he IS USING A LALR(1) GRAMMER for C++ 2.0. >>He appeared sober & honest, therefore I assume that C++ is LALR(1), but >>non-trivial. > >Your statement confuses the difference between an LALR(1) language >(which C++ is *not*) and a particular set of tools used to recognize a >language. Doug, I afraid that I have to point out that *your* statement confuses the difference between an LALR(1) language and the patently false assertions that some people have made vis a vis the supposed non-LALR(1)'ness of C++. As was pointed out to me a couple of years ago by a fellow who really should know something about the subject (i.e. Tom Pennello) virtually any language can be considered to be trivially LALR(1) in the sense that the grammar can always be reduced to V* which will accept all legal inputs in the language. Rejection of illegal inputs with such a grammar would necessarily be strictly a matter for semantic analysis. Now it is quite obvious that the use of V* as a grammar to guide an LALR(1) parser would not allow the resulting parser to detect many sorts of errors which we would normally think of as being "syntatic" errors, (like say none) but the division between syntatic and semantic errors is an artificial one and such a grammar, when mashed into pasring tables and combined with a general LALR(1) parser driver, would in fact successfully (if trivially) parse C++, thus clearly indicating that C++ *is* LALR(1). The fact of the matter is C++ is only asserted to be be non-LALR(1) by persons who want to see more constructs in the language be disambiguated via easy-to-implement LALR(1) syntax rules rather than via much harder to implement semantic rules. There is nothing wrong with such a goal. In fact I would say that this is a very important goal. Anybody who has ever had to build a recognizer for C or C++ or FORTRAN or Ada will certainly tell you that it is one hell of a lot easier to disambiguate things via context free syntax rules than it is to do the same job via context sensitive semantic rules. In particular, typedefs have kept many C compiler writers awake at nights thinking about the torturous things they do to feed information backwards from semantic anaysis so that it can be subsequently considered by their (more convenient and more regular) LALR(1) syntax analyzers. So the goal is good. What is *not* good is muddling the issue with meaningless catch phrases like "C++ is not LALR(1)". In order to move the discussion onto a higher plain, let's try to be accurate and just say that it would be nice if an LALR(1) grammar for C++ could allow for much more disambiguation via easy-to-implement LALR(1) syntax rules. >What Bjarne probably meant (judging from the cfront 2.0 >sources) is that he was using the LALR(1) parser generated by yacc to >recognize C++. Free advice: Second guessing Bjarne's intended meaning based on sketchy second-hand accounts is probably inadvisable. // Ron Guilmette (rfg@ics.uci.edu) // C++ Entomologist // Motto: If it sticks, force it. If it breaks, it needed replacing anyway.
bright@Data-IO.COM (Walter Bright) (03/02/90)
In article <741@ns-mx.uiowa.edu> csu025@umaxc.weeg.uiowa.edu.UUCP (Kent Williams) writes:
<<I know very little about lex/flex.
<If you don't mind my asking, do you USE any automated compiler writing
<tools?
In a word, no. lex is good for a lot of applications, but where maximum
performance is required, a hand-built scanner is necessary. It continually
surprises people that Zortech C compiles as fast as Turbo C, even though
ZTC is a 2 pass compiler and TC is 1 pass! The secret to speed is
in the scanner (not the symbol table!).
Other less significant reasons are:
1. When I started, in 1983, no lexes were available for MSDOS.
2. I don't care for copyright problems. I don't want to hire a lawyer
to interpret the licensing agreement.
3. The compiler has been occasionally ported to other platforms for
debugging purposes. Being constrained by the availability of a
particular version of lex is unacceptable.
4. It turns out that it's *very* useful to have complete control over
the compilation process from source code to EXE. Depending on
some other vendor's program which generates a critical section of
the compiler causes problems if the source to it is not available.
(See the cfront people with their problems with identifier length
in MSC!)
So-called 'compiler compilers' are also not terribly useful to me. They
are more appropriately called 'parser generators'. The parsing code itself
is a minor part of a real compiler. The 'compiler compiler' still requires
you to write symbol table, semantic analysis, optimization, and code
generation, which is over 95% of the compiler.
For programs where scanning and parsing are most of the job, like in
a tool like indent, lex and yacc are highly useful. Those tools are
also very nice for prototyping a new language, because it's easy to
drastically change the syntax of what you're compiling.shankar@hpclscu.HP.COM (Shankar Unni) (03/02/90)
> I saw that comment about C++ not being LALR(1) in the g++ distribution, > and at the time believed it. However, my biggest thrill at OOPSLA '89 > was sitting at the same table as Bjarne Stroustrup for a few hours drinking > free beer. He said that he IS USING A LALR(1) GRAMMER for C++ 2.0. > He appeared sober & honest, therefore I assume that C++ is LALR(1), but > non-trivial. No, it becomes LALR(1) only by a trick: a feedback mechanism by which the lexer can: (a) consult a symbol table to see if an identifier is really a typedef name, and (b) an ambiguity resolver called by the parser to handle any remaining ambiguities. The whole thing is done by adding another layer between the lexer and the parser, with these smarts built in. Using the classical model of a table-driven scanner and parser, it is impossible to write a LALR(1) description for C++ (or for that matter, C). It could be LR(2), though; I can't be sure.. ---- Shankar Unni.
bs@alice.UUCP (Bjarne Stroustrup) (03/02/90)
Ron Guilmette (our resident C++ Entomologist) writes: > So the goal is good. What is *not* good is muddling the issue with meaningless > catch phrases like "C++ is not LALR(1)". In order to move the discussion onto > a higher plain, let's try to be accurate and just say that it would be nice > if an LALR(1) grammar for C++ could allow for much more disambiguation via > easy-to-implement LALR(1) syntax rules. > >What Bjarne probably meant (judging from the cfront 2.0 > >sources) is that he was using the LALR(1) parser generated by yacc to > >recognize C++. Exactly. Plus an oversmart ``lexical'' analyser. > Free advice: Second guessing Bjarne's intended meaning based on sketchy > second-hand accounts is probably inadvisable. Very nice advice. There can be no simple LALR(1) parser for C++. The reason is examples like: // let T be some type f() { T (*p)[7]; // declaration or expression? // declaration! T (*p)[7]->m = 7; // declaration or expression? // expression! } The rule is that anything that (syntactically) looks like a declaration is a declaration (this ensures C compatibility). What isn't might be an expression. This rule is simple, but requires a backtracking parser to cope with ``cute'' examples like the ones above. Had I had 20/20 foresight I would probably have chosen an LALR(1) grammar for C++. However, I did not (and still don't) consider syntax issues to be anywhere near as important as issues of type checking and access control so I underestimated people's desire/need for syntax based solutions.
ned@pebbles.cad.mcc.com (Ned Nowotny) (03/05/90)
In article <10541@alice.UUCP> bs@alice.UUCP (Bjarne Stroustrup) writes: >Had I had 20/20 foresight I would probably have chosen an LALR(1) grammar >for C++. However, I did not (and still don't) consider syntax issues to be >anywhere near as important as issues of type checking and access control >so I underestimated people's desire/need for syntax based solutions. Given the imprecise nature of the C++ specification, I am more than a little disappointed that the language does not have a grammar expressible in BNF form that will accept all legal strings and reject all illegal strings. (That, by the way, is what I consider to be the test of whether a given language is specifiable by a given grammar. While it may be true that some LALR(1) grammar will accept all legal C++ strings, the fact that the grammar must also accept illegal strings makes the claim that C++ is somehow a LALR(1) language something less than useful.) For example, when trying to resolve whether bugs in my code are the result of an incorrect understanding of the language on my part or a flaw in a given compiler, I would like to turn to an authoritative source to determine whether a given string I have written is even a legal construct. Unfortunately, the grammar given in the various reference manuals are not helpful precisely because they are not accompanied by a formal specification of the "smarts" built into the lexical analyzer. In fact, the specifications would not be as precise as a LALR(1) grammar in any case. The current (and, quite probably, permanent) situation is of little assistance to compiler writers and users of C++ alike. (For example, consider const friend functions: Everything I have read, including Lippman and the Release 2.0 Reference manual, SEEM to indicate that only member functions may be declared const functions. However, Keith Gorlen, for one, has declared friend functions to be const in his NIH Class Library code. This appears to be a reasonable thing to do. The grammar given in the 2.0 reference manual does not appear to disallow this, there is no semantic specification provided which disallows it, the Cfront 2.0 compiler which is the current reference implementation apparently accepts it, and it appears on its face to be a reasonable thing to do. A precise LALR(1) grammar would, at least, disambiguate the current legality of the construct while greater minds considered its usefulness and hazards.) Ned Nowotny, MCC CAD Program, Box 200195, Austin, TX 78720 Ph: (512) 338-3715 ARPA: ned@mcc.com UUCP: ...!cs.utexas.edu!milano!cadillac!ned ------------------------------------------------------------------------------- "We have ways to make you scream." - Intel advertisement in the June 1989 DDJ.
rfg@ics.uci.edu (Ronald Guilmette) (03/10/90)
In article <10541@alice.UUCP> bs@alice.UUCP (Bjarne Stroustrup) writes: >... There can be no simple LALR(1) parser for C++. Gosh darn it Bjarne! Here I am trying to dispel this inexact notion, and here you are perpetuating it! Of course in any *practical* sense, you are correct. Any real compiler or translator will probably try to use LALR(1) grammar rules to parse and disambiguate as much of that language as possible, but don't you agree that V* qualifies as an LALR(1) grammar and that this "grammar" accepts all valid sentences in C++? >Had I had 20/20 foresight I would probably have chosen an LALR(1) grammar >for C++. Don't you mean to say that you would have designed a language in which more cases could be determined to be either legal or illegal via a "simple" LALR(1) parse (without lexical tricks)? Isn't that more accurate? // Ron Guilmette (rfg@ics.uci.edu) // C++ Entomologist // Motto: If it sticks, force it. If it breaks, it needed replacing anyway.
rfg@ics.uci.edu (Ronald Guilmette) (03/10/90)
In article <6553@cadillac.CAD.MCC.COM> ned@MCC.COM (Ned Nowotny) writes: > >Given the imprecise nature of the C++ specification, I am more than a >little disappointed that the language does not have a grammar expressible >in BNF form that will accept all legal strings and reject all illegal >strings. As Joe English also points out, this is utter nonsense. In all of the useful languages that I know of, the correctness of statements is based upon both syntax *and* semantics. Neither one can be used in isolation to decide on the correctness of statements. >... While it may be >true that some LALR(1) grammar will accept all legal C++ strings, the fact >that the grammar must also accept illegal strings makes the claim that >C++ is somehow a LALR(1) language something less than useful. I agree that in practice, nobody actually is stupid enough to use V* as their LALR(1) grammar for parsing C++ or other languages, but just because the statement "C++ is LALR(1)" may not be "useful" does not mean that it is not 100% accurate. In fact, I believe that it is. // Ron Guilmette (rfg@ics.uci.edu) // C++ Entomologist // Motto: If it sticks, force it. If it breaks, it needed replacing anyway.
sakkinen@tukki.jyu.fi (Markku Sakkinen) (03/14/90)
In article <25F8C92A.9349@paris.ics.uci.edu> rfg@ics.uci.edu (Ronald Guilmette) writes: >In article <6553@cadillac.CAD.MCC.COM> ned@MCC.COM (Ned Nowotny) writes: >> >>Given the imprecise nature of the C++ specification, I am more than a >>little disappointed that the language does not have a grammar expressible >>in BNF form that will accept all legal strings and reject all illegal >>strings. > >As Joe English also points out, this is utter nonsense. [...] >>... While it may be >>true that some LALR(1) grammar will accept all legal C++ strings, the fact >>that the grammar must also accept illegal strings makes the claim that >>C++ is somehow a LALR(1) language something less than useful. > >I agree that in practice, nobody actually is stupid enough to use V* as their >LALR(1) grammar for parsing C++ or other languages, but just because the >statement "C++ is LALR(1)" may not be "useful" does not mean that it is >not 100% accurate. In fact, I believe that it is. Sorry, Mr. Guilmette, I think it is _you_ who are repeating the same utter nonsense too many times in this group. Please stand back for a while and reflect on how vacuous any grammatical classification of languages becomes is one takes your attitude. Alternatively, go post your opinion to the comp.theory group (or comp.compilers) and get your knuckles rapped on by real experts. Otherwise, you seem to be one of the most prolific and useful contributors in this group. Markku Sakkinen Department of Computer Science University of Jyvaskyla (a's with umlauts) Seminaarinkatu 15 SF-40100 Jyvaskyla (umlauts again) Finland SAKKINEN@FINJYU.bitnet (alternative network address)
rfg@ics.uci.edu (Ronald Guilmette) (03/21/90)
In article <3742@tukki.jyu.fi> sakkinen@jytko.jyu.fi (Markku Sakkinen) writes: >In article <25F8C92A.9349@paris.ics.uci.edu> rfg@ics.uci.edu (Ronald Guilmette) writes: >>In article <6553@cadillac.CAD.MCC.COM> ned@MCC.COM (Ned Nowotny) writes: >>> >>>Given the imprecise nature of the C++ specification, I am more than a >>>little disappointed that the language does not have a grammar expressible >>>in BNF form that will accept all legal strings and reject all illegal >>>strings. >> >>As Joe English also points out, this is utter nonsense. [...] >>>... While it may be >>>true that some LALR(1) grammar will accept all legal C++ strings, the fact >>>that the grammar must also accept illegal strings makes the claim that >>>C++ is somehow a LALR(1) language something less than useful. >> >>I agree that in practice, nobody actually is stupid enough to use V* as their >>LALR(1) grammar for parsing C++ or other languages, but just because the >>statement "C++ is LALR(1)" may not be "useful" does not mean that it is >>not 100% accurate. In fact, I believe that it is. > >Sorry, Mr. Guilmette, I think it is _you_ who are repeating the same >utter nonsense too many times in this group. Please stand back for >a while and reflect on how vacuous any grammatical classification >of languages becomes is one takes your attitude. Alternatively, >go post your opinion to the comp.theory group (or comp.compilers) >and get your knuckles rapped on by real experts. I believe that I failed to make my point clearly enough. I used the (admitedly extreme) example of V* as a grammar for C++ just to make it quite evident that a C++ grammar could be LALR(1). I only did this for the sake of argument. As should be demonstrated by Jim Roskind's grammar, you need not devolve the grammar all the way down to V* to get it to be LALR(1). Quite the contrary. If you are willing to do just a few more things with semantic analysis rather than syntax analysis (relative to what current C++ implementors seem to be doing) then you can in fact have a useful and practical LALR(1) grammar for C++. That was my point. I just stated it badly. // Ron Guilmette (rfg@ics.uci.edu) // C++ Entomologist // Motto: If it sticks, force it. If it breaks, it needed replacing anyway.
pcg@odin.cs.aber.ac.uk (Piercarlo Grandi) (03/23/90)
In article <26073795.10139@paris.ics.uci.edu> rfg@ics.uci.edu (Ronald Guilmette) writes: If you are willing to do just a few more things with semantic analysis rather than syntax analysis (relative to what current C++ implementors seem to be doing) then you can in fact have a useful and practical LALR(1) grammar for C++. That was my point. I just stated it badly. Sorry Ron, but we really still disagree on a question of terminology here; a LALR(1) grammar belongs to the context free class. If you need semantic information to parse C++ (and by all means you need it) then your intelligent lexer is also a translator from a context sensitive C++ to another language that is indeed LALR(1), BUT IS NO LONGER C++. In particular this language has typedef names as tokens/keywords for the scopes to which they apply, as in a sense when you define a class name in C++ you are creating a new scoped reserved word, the class name, and certain phrases can only be parsed if you know that the given string is such a token. Your point is that you can 'have a useful and practical LALR(1) PARSER for A C++ COMPILER'; such a *parser* however does not parse C++ itself, but a C++ derivative, generated by the intelligent lexer. C++ itself does not admit a context free *grammar*, much less a LALR(1) one. I could say that C++ is a *family* of LALR(1) languages, each of which differs from the others by a different set of typedef names in different scopes, and an intelligent lexer filters out the particular language belonging to this family and passes it to the LALR(1) parser. Typedef names do not turn Classic C itself into a context sensitive language by the slimmest of margins. Incidentally, you can easily see that the requirement for a leading 'struct' or 'enum' or 'union' before a tag in a declaration was designed to help the parser be LL(1). Too bad that 'typedef' was later introduced as a *storage class* and not as new type of tag, e.g. not something like 'mode <name> { <abstract declarator> } ....', which could be likened to a one field struct. In C the problem does not occur because you can distinguish declarations that include a typedef name from a statement in a context free way because you need a '=' in an initialization and you cannot intermix declarations with statements. The '=' for initialization is particularly crucial; it used to be that Classic C did not require it (as in 'int i 3;'), but it was introduced later on because, as Ritchie put it, certain things looked so much like procedure declarations that the parser got confused (as in 'int j 1; int k (j)'). Stroustrup has reintroduced the old bugaboo. Those that do not know history... :-). Note that this aspect of C++, that implies dynamic augmentation of the set of keywords for a given scope, is strongly reminiscent of Algol 68, and its (in)famous ambiguity between mode and operator denotations, and I think that Stroustrup was more familiar with the fine details of Algol 68 than of C at some point in his life. In particular, while C++ surely does not admit a context free LALR(1) grammar, it does admit description with a two level grammar (which is context sensitive), Algol 68 style, and the second level grammar can well be LALR(1) (those that do know two level grammars please excuse the poetic license here :->). -- Piercarlo "Peter" Grandi | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcvax!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
rfg@ics.uci.edu (Ronald Guilmette) (03/24/90)
In article <PCG.90Mar23120003@odin.cs.aber.ac.uk> pcg@odin.cs.aber.ac.uk (Piercarlo Grandi) writes: > >In article <26073795.10139@paris.ics.uci.edu> rfg@ics.uci.edu (Ronald >Guilmette) writes: > > If you are willing to do just a few more things with semantic analysis > rather than syntax analysis (relative to what current C++ implementors > seem to be doing) then you can in fact have a useful and practical LALR(1) > grammar for C++. That was my point. I just stated it badly. > >Sorry Ron, but we really still disagree on a question of terminology >here; a LALR(1) grammar belongs to the context free class. If you need >semantic information to parse C++ (and by all means you need it) then >your intelligent lexer is also a translator from a context sensitive >C++ to another language that is indeed LALR(1), BUT IS NO LONGER C++. > >In particular this language has typedef names as tokens/keywords for the That (I claim) is a design decision on the part of the implementor. (And we are now getting down to the meat of my point.) I fully admit that C and C++ are easier to "compile" if you have separate token types for "identifier" and for "type name", but is this absolutely manditory? I think not. Without having such distinctions made at the lexical level (or the basis of feedback from semantic analysis), you will of course end up with excessively general non-terminals like (for instance): declarative_or_executable_statement: whose corresponding rules will allow stuff like: F ( Y ) = Z; even in cases where (semantically) the context demands that F be treated as the name of a function, but (as I said before) that just means that you gotta do more work in the semantic analysis to properly detect and flag such errors. (Footnote: Actually, that may be a bad example, because I think that even if "F" is the name of a function, the above statement may be legal, e.g. in cases where F returns a reference.) // Ron Guilmette (rfg@ics.uci.edu) // C++ Entomologist // Motto: If it sticks, force it. If it breaks, it needed replacing anyway.
pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) (03/28/90)
In article <260A9B1D.17399@paris.ics.uci.edu> rfg@ics.uci.edu (Ronald Guilmette) writes: I fully admit that C and C++ are easier to "compile" if you have separate token types for "identifier" and for "type name", but is this absolutely manditory? I think not. Without having such distinctions made at the lexical level (or the basis of feedback from semantic analysis), you will of course end up with excessively general non-terminals like (for instance): declarative_or_executable_statement: But this is the meat of _my_ argument: this is no longer a grammar for C++, it is a grammar for a family of languages of which C++ is a member, and you use semantic (context dependent) rules to translate/disambiguate it to a C++ grammar. This is a fine implementation technique; you parse a more general language, and rely on non-grammar rules to narrow it down. Whether the *language* admits a context free, LALR(1) grammar (it does not) is one issue, whether the *compiler* can make use of a LALR(1) parser generator plus semantic analysis (it can) is an issue of how to engineer a parser for a non context free language using a mixed bag of tools. Essentially, using a LALR(1) parser generator aided by semantic analysis is hand crafting an attribute grammar parser. -- Piercarlo "Peter" Grandi | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcvax!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk