[comp.lang.c++] zortech problem with lex

cs326ad@ux1.cso.uiuc.edu (02/24/90)

Here's an interesting question that I pose to people who are familiar
with lex and/or flex.

I am working on a compiler project, and am attempting to do the work at
home instead of on the net.  The work is being done using Zortech's C++, but
I am having a small problem as follows:

In (f)lex definitions, we define whitespace as

ws	[ \t\n]

That is, ws is any one of _blank_, tab, or newline.  If lex runs across this
in the input file, then it takes an appropriate action (in my case, none.)
Now, when porting this code to C++ zortech style, the lex.yy.c program compiles
fine (with a few minor changes.)  However, the executable does NOT recognize
the \n the same as the executable on the net.  i.e.:

If we input:   1234     to the lex executable on the net, I get the output

INT	1234

However, on the zortech implementation, for the same input, I get

INT	1

Doing a little debugging on the zortech code, I came to realize that the
eoln was ended with a   \r\n   for carriage return newline.  Fine.  But
for some reason the zortech implementation does not catch the newline at
the end of the line, like it does on the net.

In addition, I compiled the lex code with Microsoft 5.01, and the program
worked fine.  Same code, different compiler.  There just seems to be an odd
quirk with the zortech implementation of eoln (at least in this case) since
all the lex code does is scan for certain tokens appearing in the input, and
when it matches a token, it performs a given action.

So, the bottom line question is why, with the same code from the same lex
file, I get two different executables from two different compilers.

I hope that this explaination is clear enough.  It is kind of a tough problem
to explain.  Any helpful insight would be appreciated.



===========================================================================
=                                                                         =
=    Mark Allender                               University of Illinois   =
=    cs326ad@ux1.cso.uiuc.edu                    Urbana, Illinois         =
=    ma20449@uxa.cso.uiuc.edu                                             =
=                                                                         =
===========================================================================

bright@Data-IO.COM (Walter Bright) (02/25/90)

In article <6300008@ux1.cso.uiuc.edu> cs326ad@ux1.cso.uiuc.edu writes:
<Now, when porting this code to C++ zortech style, the lex.yy.c program compiles
<fine (with a few minor changes.)  However, the executable does NOT recognize
<the \n the same as the executable on the net.  i.e.:
<If we input:   1234     to the lex executable on the net, I get the output
<INT	1234
<However, on the zortech implementation, for the same input, I get
<INT	1
<Doing a little debugging on the zortech code, I came to realize that the
<eoln was ended with a   \r\n   for carriage return newline.  Fine.  But
<for some reason the zortech implementation does not catch the newline at
<the end of the line, like it does on the net.

I'm afraid I don't understand. Is the *compiler* not working right or the
*run-time library*? Please reduce the problem to a few lines of code
which exhibit the problem. I know very little about lex/flex.

csu025@umaxc.weeg.uiowa.edu (Kent Williams) (02/27/90)

>I'm afraid I don't understand. Is the *compiler* not working right or the
>*run-time library*? Please reduce the problem to a few lines of code
>which exhibit the problem. I know very little about lex/flex.
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                            !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Yipes!
If you don't mind my asking, do you USE any automated compiler writing
tools? Maybe I got into the game a little later that you did, but a
compiler author who knows little about lex is a little like a auto mechanic
who doesn't know about socket wrenches -- I've no doubt he can do his
job, but isn't he working a little bit too hard?

meissner@osf.org (Michael Meissner) (02/27/90)

In article <741@ns-mx.uiowa.edu> csu025@umaxc.weeg.uiowa.edu (Kent
Williams) writes:

| >I'm afraid I don't understand. Is the *compiler* not working right or the
| >*run-time library*? Please reduce the problem to a few lines of code
| >which exhibit the problem. I know very little about lex/flex.
|                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|                             !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
| 
| Yipes!
| If you don't mind my asking, do you USE any automated compiler writing
| tools? Maybe I got into the game a little later that you did, but a
| compiler author who knows little about lex is a little like a auto mechanic
| who doesn't know about socket wrenches -- I've no doubt he can do his
| job, but isn't he working a little bit too hard?

I have supported three compilers in the last 10 years (DG/L, Data
General C, and now GNU C), and none of them use any automatic tools
for lexing.

First of all, UNIX lex is too general purpose, and too slow to be used
in a production compiler (yes I know flex fixes at least some of the
speed problems).  You don't need backtracing, or even generalized
regular expressions, and such for most computer languages to lex
stuff, just simple one character lookahead.

Second of all, lex-ing is fairly simple.  It took me a day or so, to
write from scratch the lexer for Data General C.  It would have taken
more time to come up to speed with lex (ignoring the problem that we
didn't have lex at the time) than to just write the code.  IMHO if you
think writing a lexer is too hard, you have no business writing
compilers.
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA

Catproof is an oxymoron, Childproof is nearly so

maxsmith@athena.mit.edu (Samuel M Druker) (02/27/90)

Lex generated scanners are slower than hand crafted ones.  Doubly so for
yacc and its output and yacc it self can't produce a C++ parser, something
to do with the language not conforming to LALR(1) and all that.

If you've only got a specialized compiler market, especially a cutthroat one
like the DOS world, lex and yacc aren't vital.

Samuel Druker

voss@sunb6.cs.uiuc.edu (03/01/90)

> maxsmith@athena.mit.edu writes:
> Lex generated scanners are slower than hand crafted ones.  Doubly so for
> yacc and its output and yacc it self can't produce a C++ parser, something
> to do with the language not conforming to LALR(1) and all that.

I saw that comment about C++ not being LALR(1) in the g++ distribution,
and at the time believed it.  However, my biggest thrill at OOPSLA '89
was sitting at the same table as Bjarne Stroustrup for a few hours drinking
free beer.  He said that he IS USING A LALR(1) GRAMMER for C++ 2.0.  
He appeared sober & honest, therefore I assume that C++ is LALR(1), but 
non-trivial.
=============================================================================
	First, let me make it clear that I am by no means an expert
at writing compilers (<grin> yet).  I am a graduate student in CS, 
currently taking my first (much overdue) compiler course.  We are using 
the text _Compilers_Principles_Techniques_and_Tools_ by Aho, Sethi, and Ullman.
The programming portion of the class involves using lex/flex and 
yacc/byacc/bison to implement a subset of Pascal.  Now my questions:

> meissner@osf.org
> Second of all, lex-ing is fairly simple.  It took me a day or so, to
> write from scratch the lexer for Data General C.  It would have taken
> more time to come up to speed with lex (ignoring the problem that we
> didn't have lex at the time) than to just write the code.  IMHO if you
> think writing a lexer is too hard, you have no business writing
> compilers.

	No, I don't think writing a lexer is hard.  I did it many times
before I even heard of lex (though I didn't know I was writing "lexers.")
However, I can now write and debug a lexical analyzer in half a day max
using flex, and anyone knowing lex/flex could easily maintain my code.
Therefore I doubt I will ever write another "lexer" by hand.

	I think the surprise is that a "commercial compiler writer"
would not be familiar with flex/lex.  If you don't mind my asking,
when did you study compilers, and what text did you use?

==============================================================================
> schmidt@zola.ics.uci.edu writes:
> Actually, GNU C uses a reserved word recognizer function generated
> automatically by my GPERF program.  But I understand your general
> point.
then goes on to give performance figures showing GPERF is much faster
than flex/lex.

	I am not familiar with GPERF.  What did you give up to
get the speed advantage?  Is GPERF available via anonymous ftp? Where?

						thanks,
						Bill Voss
Email: billvoss@uiuc.edu or voss@cs.uiuc.edu

schmidt@zola.ics.uci.edu (Doug Schmidt) (03/01/90)

There are a number of confusions in your post.  Let me see if I can
shed some light on this.

In article <24800002@sunb6>, voss@sunb6 writes:
>I saw that comment about C++ not being LALR(1) in the g++ distribution,
>and at the time believed it.  However, my biggest thrill at OOPSLA '89
>was sitting at the same table as Bjarne Stroustrup for a few hours drinking
>free beer.  He said that he IS USING A LALR(1) GRAMMER for C++ 2.0.
>He appeared sober & honest, therefore I assume that C++ is LALR(1), but
>non-trivial.

Your statement confuses the difference between an LALR(1) language
(which C++ is *not*) and a particular set of tools used to recognize a
language.  What Bjarne probably meant (judging from the cfront 2.0
sources) is that he was using the LALR(1) parser generated by yacc to
recognize C++.

What you probably don't know is that cfront 2.0 also utilizes an
infinite-lookahead lexical analyzer (hand-coded, BTW) that enables it
to handle the non-LALR(1) parts of the language, e.g.  disambiguating
between certain type-decls and expressions.

Incidentally, Jim Roskind (ileaf!io!bronx!jar@EDDIE.MIT.EDU) has
released a beta version of his yacc-based grammar that accepts C++ 2.0
syntax *without* using an infinite lookahead lexer.  As far as I know
this grammar has not yet been incorporated in any production-quality
C++ compiler or translator.  Comments Jim?

>	No, I don't think writing a lexer is hard.  I did it many times
>before I even heard of lex (though I didn't know I was writing "lexers.")
>However, I can now write and debug a lexical analyzer in half a day max
>using flex, and anyone knowing lex/flex could easily maintain my code.

Have you ever tried writing a f?lex scanner for full ANSI C?  There
are some surprising subtleties.  Henry Spencer posted one to the net a
while back.  Let me know if you'd like a copy to peruse.  It's got
`easy to maintain' regular expressions like:

L?\'([^'\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|[xX][0-9a-fA-F]+))+\'
L?\"([^"\\\n]|\\(['"?\\abfnrtv\n]|[0-7]{1,3}|[xX][0-9a-fA-F]+))*\"

>Therefore I doubt I will ever write another "lexer" by hand.

Well, if you don't end up working with language processing tools then
the point is moot... ;-) On the other hand, what if you write portable
programs for systems that lack f?lex?  What if you are trying to write
a fast lexer in order to gain market share for your product?  f?lex
are extremely useful for certain purposes, but there are often
pragmatic reasons why they aren't used in every circumstance.

>	I think the surprise is that a "commercial compiler writer"
>would not be familiar with flex/lex.

Why is that surprising?  f?lex scanners are *not* generally used for
production (i.e., commercial) compilers.  And certainly not in a
highly competitive market like MS-DOS!

>	I am not familiar with GPERF.  What did you give up to
>get the speed advantage?

The trade-off is generality versus specificity.  GPERF is a great
solution for a particular (and limited) problem domain, i.e.,
implementing recognizers for static search sets.  It doesn't provide
anywhere *near* the flexibility of f?lex.  De gustibus non disputandum
est...

>Is GPERF available via anonymous ftp? Where?

You bet! (glad you asked... ;-)).  The C++ and C version of GPERF is
available via anonymous ftp from ics.uci.edu (128.195.1.1) in the
~ftp/pub directory:

----------------------------------------
-rw-r--r--  1 ftp        120483 Feb 15 23:30 gperf-2.3.tar.Z
-rw-r--r--  1 ftp         97187 Nov 11 14:15 cperf-2.1.tar.Z
----------------------------------------

I'm also presenting a paper on GPERF at the upcoming USENIX C++
Conference in San Francisco.  A preliminary draft is also available
via ftp from ics.uci.edu:

----------------------------------------
-rw-r--r--  1 ftp         43820 Feb 18 17:10 gperf.tex
----------------------------------------

I'd be interested in any comments people might have (there's still
time to make revisions!)

Doug
--
The official language of San Marcos is Swedish.  All boys | schmidt@ics.uci.edu
under the age of sixteen years old are now sixteen years  | office (714) 856-4043
old. Underwear must be changed every half hour.  It will  +----------------------
be worn on the outside, so we can check.   -- `Bananas' by Woody Allen

jbuck@henry.berkeley.edu (Joe Buck) (03/01/90)

In article <25EC5CBF.26673@paris.ics.uci.edu> schmidt@zola.ics.uci.edu (Doug Schmidt) writes:
>>	No, I don't think writing a lexer is hard.  I did it many times
>>before I even heard of lex (though I didn't know I was writing "lexers.")
>>However, I can now write and debug a lexical analyzer in half a day max
>>using flex, and anyone knowing lex/flex could easily maintain my code.

>Have you ever tried writing a f?lex scanner for full ANSI C?  There
>are some surprising subtleties.  Henry Spencer posted one to the net a
>while back.  Let me know if you'd like a copy to peruse.  It's got
>`easy to maintain' regular expressions like:

>L?\'([^'\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|[xX][0-9a-fA-F]+))+\'
>L?\"([^"\\\n]|\\(['"?\\abfnrtv\n]|[0-7]{1,3}|[xX][0-9a-fA-F]+))*\"

This is as ugly as it is because Henry failed to use definitions.
Any experienced user of flex would define a symbol to represent
letters, a symbol to represent numbers, a symbol to recognize
backslash escapes, etc, and build the above expressions out of the
symbols.  Only someone trying to slam the lex approach would write
the above monstrosities.

>>Therefore I doubt I will ever write another "lexer" by hand.
>Well, if you don't end up working with language processing tools then
>the point is moot... ;-) On the other hand, what if you write portable
>programs for systems that lack f?lex?

Do you have C?  Flex generates C code, and is written in C, and is
freely redistributable.
>  What if you are trying to write
>a fast lexer in order to gain market share for your product?

You certainly wouldn't want to use lex then, but flex is far faster,
and other lexer generators are faster still.  Check out Van Jacobsen's
paper from the 1986 Winter Usenix, where he produced a lexer generator
as fast (almost) as cat, and got the inner loop down to one 68000 instruction
per character processed (anyone know if this version is ever going
to be public?).  Beat that with a hand-written lexer.
Granted, Vern Paxton's flex program is not this good, but many people
got turned off to lexer generators because lex is so horrible.  flex
is far superior to lex.  The "f" stands for "fast", not free.  For that
reason, writing "f?lex" is unfair to flex, lumping it in with a very
inefficient program.
--
--
Joe Buck
jbuck@janus.berkeley.edu	 {uunet,ucbvax}!janus.berkeley.edu!jbuck

rfg@ics.uci.edu (Ronald Guilmette) (03/01/90)

In article <25EC5CBF.26673@paris.ics.uci.edu> schmidt@zola.ics.uci.edu (Doug Schmidt) writes:
>There are a number of confusions in your post.  Let me see if I can
>shed some light on this.
>
>In article <24800002@sunb6>, voss@sunb6 writes:
>>I saw that comment about C++ not being LALR(1) in the g++ distribution,
>>and at the time believed it.  However, my biggest thrill at OOPSLA '89
>>was sitting at the same table as Bjarne Stroustrup for a few hours drinking
>>free beer.  He said that he IS USING A LALR(1) GRAMMER for C++ 2.0.
>>He appeared sober & honest, therefore I assume that C++ is LALR(1), but
>>non-trivial.
>
>Your statement confuses the difference between an LALR(1) language
>(which C++ is *not*) and a particular set of tools used to recognize a
>language.

Doug,

I afraid that I have to point out that *your* statement confuses the
difference between an LALR(1) language and the patently false assertions
that some people have made vis a vis the supposed non-LALR(1)'ness of C++.

As was pointed out to me a couple of years ago by a fellow who really should
know something about the subject (i.e. Tom Pennello) virtually any language
can be considered to be trivially LALR(1) in the sense that the grammar can
always be reduced to V* which will accept all legal inputs in the language.
Rejection of illegal inputs with such a grammar would necessarily be strictly
a matter for semantic analysis.

Now it is quite obvious that the use of V* as a grammar to guide an LALR(1)
parser would not allow the resulting parser to detect many sorts of errors
which we would normally think of as being "syntatic" errors, (like say
none) but the division between syntatic and semantic errors is an artificial
one and such a grammar, when mashed into pasring tables and combined with a
general LALR(1) parser driver, would in fact successfully (if trivially) parse
C++, thus clearly indicating that C++ *is* LALR(1).

The fact of the matter is C++ is only asserted to be be non-LALR(1) by
persons who want to see more constructs in the language be disambiguated
via easy-to-implement LALR(1) syntax rules rather than via much harder to
implement semantic rules.

There is nothing wrong with such a goal.  In fact I would say that this is
a very important goal.  Anybody who has ever had to build a recognizer for
C or C++ or FORTRAN or Ada will certainly tell you that it is one hell
of a lot easier to disambiguate things via context free syntax rules than
it is to do the same job via context sensitive semantic rules.  In particular,
typedefs have kept many C compiler writers awake at nights thinking about
the torturous things they do to feed information backwards from semantic
anaysis so that it can be subsequently considered by their (more convenient
and more regular) LALR(1) syntax analyzers.

So the goal is good.  What is *not* good is muddling the issue with meaningless
catch phrases like "C++ is not LALR(1)".  In order to move the discussion onto
a higher plain,  let's try to be accurate and just say that it would be nice
if an LALR(1) grammar for C++ could allow for much more disambiguation via
easy-to-implement LALR(1) syntax rules.

>What Bjarne probably meant (judging from the cfront 2.0
>sources) is that he was using the LALR(1) parser generated by yacc to
>recognize C++.

Free advice:  Second guessing Bjarne's intended meaning based on sketchy
second-hand accounts is probably inadvisable.

// Ron Guilmette (rfg@ics.uci.edu)
// C++ Entomologist
// Motto:  If it sticks, force it.  If it breaks, it needed replacing anyway.

bright@Data-IO.COM (Walter Bright) (03/02/90)

In article <741@ns-mx.uiowa.edu> csu025@umaxc.weeg.uiowa.edu.UUCP (Kent Williams) writes:
<<I know very little about lex/flex.
<If you don't mind my asking, do you USE any automated compiler writing
<tools?

In a word, no. lex is good for a lot of applications, but where maximum
performance is required, a hand-built scanner is necessary. It continually
surprises people that Zortech C compiles as fast as Turbo C, even though
ZTC is a 2 pass compiler and TC is 1 pass! The secret to speed is
in the scanner (not the symbol table!).

Other less significant reasons are:
1. When I started, in 1983, no lexes were available for MSDOS.
2. I don't care for copyright problems. I don't want to hire a lawyer
   to interpret the licensing agreement.
3. The compiler has been occasionally ported to other platforms for
   debugging purposes. Being constrained by the availability of a
   particular version of lex is unacceptable.
4. It turns out that it's *very* useful to have complete control over
   the compilation process from source code to EXE. Depending on
   some other vendor's program which generates a critical section of
   the compiler causes problems if the source to it is not available.
   (See the cfront people with their problems with identifier length
   in MSC!)

So-called 'compiler compilers' are also not terribly useful to me. They
are more appropriately called 'parser generators'. The parsing code itself
is a minor part of a real compiler. The 'compiler compiler' still requires
you to write symbol table, semantic analysis, optimization, and code
generation, which is over 95% of the compiler.

For programs where scanning and parsing are most of the job, like in
a tool like indent, lex and yacc are highly useful. Those tools are
also very nice for prototyping a new language, because it's easy to
drastically change the syntax of what you're compiling.

shankar@hpclscu.HP.COM (Shankar Unni) (03/02/90)

> I saw that comment about C++ not being LALR(1) in the g++ distribution,
> and at the time believed it.  However, my biggest thrill at OOPSLA '89
> was sitting at the same table as Bjarne Stroustrup for a few hours drinking
> free beer.  He said that he IS USING A LALR(1) GRAMMER for C++ 2.0.  
> He appeared sober & honest, therefore I assume that C++ is LALR(1), but 
> non-trivial.

No, it becomes LALR(1) only by a trick: a feedback mechanism by which the
lexer can:

   (a) consult a symbol table to see if an identifier is really a
       typedef name, and
   (b) an ambiguity resolver called by the parser to handle any remaining
       ambiguities.

The whole thing is done by adding another layer between the lexer and the
parser, with these smarts built in.

Using the classical model of a table-driven scanner and parser, it is
impossible to write a LALR(1) description for C++ (or for that matter, C).
It could be LR(2), though; I can't be sure..
----
Shankar Unni.

bs@alice.UUCP (Bjarne Stroustrup) (03/02/90)

Ron Guilmette (our resident C++ Entomologist) writes:

 > So the goal is good.  What is *not* good is muddling the issue with meaningless
 > catch phrases like "C++ is not LALR(1)".  In order to move the discussion onto
 > a higher plain,  let's try to be accurate and just say that it would be nice
 > if an LALR(1) grammar for C++ could allow for much more disambiguation via
 > easy-to-implement LALR(1) syntax rules.

 > >What Bjarne probably meant (judging from the cfront 2.0
 > >sources) is that he was using the LALR(1) parser generated by yacc to
 > >recognize C++.

Exactly. Plus an oversmart ``lexical'' analyser.

 > Free advice:  Second guessing Bjarne's intended meaning based on sketchy
 > second-hand accounts is probably inadvisable.

Very nice advice. There can be no simple LALR(1) parser for C++.
The reason is examples like:

	// let T be some type

	f()
	{
		T (*p)[7];	// declaration or expression?
				// declaration!

		T (*p)[7]->m = 7;	// declaration or expression?
					// expression!

	}

The rule is that anything that (syntactically) looks like a declaration
is a declaration (this ensures C compatibility). What isn't might be an
expression.

This rule is simple, but requires a backtracking parser to cope with
``cute'' examples like the ones above.

Had I had 20/20 foresight I would probably have chosen an LALR(1) grammar
for C++. However, I did not (and still don't) consider syntax issues to be
anywhere near as important as issues of type checking and access control
so I underestimated people's desire/need for syntax based solutions.

ned@pebbles.cad.mcc.com (Ned Nowotny) (03/05/90)

In article <10541@alice.UUCP> bs@alice.UUCP (Bjarne Stroustrup) writes:
>Had I had 20/20 foresight I would probably have chosen an LALR(1) grammar
>for C++. However, I did not (and still don't) consider syntax issues to be
>anywhere near as important as issues of type checking and access control
>so I underestimated people's desire/need for syntax based solutions.

Given the imprecise nature of the C++ specification, I am more than a
little disappointed that the language does not have a grammar expressible
in BNF form that will accept all legal strings and reject all illegal
strings.  (That, by the way, is what I consider to be the test of whether
a given language is specifiable by a given grammar.  While it may be
true that some LALR(1) grammar will accept all legal C++ strings, the fact
that the grammar must also accept illegal strings makes the claim that
C++ is somehow a LALR(1) language something less than useful.)  For example,
when trying to resolve whether bugs in my code are the result of an incorrect
understanding of the language on my part or a flaw in a given compiler, I would
like to turn to an authoritative source to determine whether a given string
I have written is even a legal construct.  Unfortunately, the grammar given
in the various reference manuals are not helpful precisely because they are
not accompanied by a formal specification of the "smarts" built into the
lexical analyzer.  In fact, the specifications would not be as precise as
a LALR(1) grammar in any case.  The current (and, quite probably, permanent)
situation is of little assistance to compiler writers and users of C++ alike.

(For example, consider const friend functions:  Everything I have read,
including Lippman and the Release 2.0 Reference manual, SEEM to indicate
that only member functions may be declared const functions.  However,
Keith Gorlen, for one, has declared friend functions to be const in his
NIH Class Library code.  This appears to be a reasonable thing to do.  The
grammar given in the 2.0 reference manual does not appear to disallow this,
there is no semantic specification provided which disallows it, the Cfront
2.0 compiler which is the current reference implementation apparently
accepts it, and it appears on its face to be a reasonable thing to do.  A
precise LALR(1) grammar would, at least, disambiguate the current legality
of the construct while greater minds considered its usefulness and hazards.)

Ned Nowotny, MCC CAD Program, Box 200195, Austin, TX  78720  Ph: (512) 338-3715
ARPA: ned@mcc.com                   UUCP: ...!cs.utexas.edu!milano!cadillac!ned
-------------------------------------------------------------------------------
"We have ways to make you scream." - Intel advertisement in the June 1989 DDJ.

rfg@ics.uci.edu (Ronald Guilmette) (03/10/90)

In article <10541@alice.UUCP> bs@alice.UUCP (Bjarne Stroustrup) writes:
>... There can be no simple LALR(1) parser for C++.

Gosh darn it Bjarne!  Here I am trying to dispel this inexact notion,
and here you are perpetuating it!

Of course in any *practical* sense, you are correct.  Any real compiler
or translator will probably try to use LALR(1) grammar rules to parse
and disambiguate as much of that language as possible, but don't you
agree that V* qualifies as an LALR(1) grammar and that this "grammar"
accepts all valid sentences in C++?

>Had I had 20/20 foresight I would probably have chosen an LALR(1) grammar
>for C++.

Don't you mean to say that you would have designed a language in which
more cases could be determined to be either legal or illegal via a "simple"
LALR(1) parse (without lexical tricks)?  Isn't that more accurate?

// Ron Guilmette (rfg@ics.uci.edu)
// C++ Entomologist
// Motto:  If it sticks, force it.  If it breaks, it needed replacing anyway.

rfg@ics.uci.edu (Ronald Guilmette) (03/10/90)

In article <6553@cadillac.CAD.MCC.COM> ned@MCC.COM (Ned Nowotny) writes:
>
>Given the imprecise nature of the C++ specification, I am more than a
>little disappointed that the language does not have a grammar expressible
>in BNF form that will accept all legal strings and reject all illegal
>strings.

As Joe English also points out, this is utter nonsense.  In all of the
useful languages that I know of, the correctness of statements is based
upon both syntax *and* semantics.  Neither one can be used in isolation
to decide on the correctness of statements.

>... While it may be
>true that some LALR(1) grammar will accept all legal C++ strings, the fact
>that the grammar must also accept illegal strings makes the claim that
>C++ is somehow a LALR(1) language something less than useful.

I agree that in practice, nobody actually is stupid enough to use V* as their
LALR(1) grammar for parsing C++ or other languages, but just because the
statement "C++ is LALR(1)" may not be "useful" does not mean that it is
not 100% accurate.  In fact, I believe that it is.

// Ron Guilmette (rfg@ics.uci.edu)
// C++ Entomologist
// Motto:  If it sticks, force it.  If it breaks, it needed replacing anyway.

sakkinen@tukki.jyu.fi (Markku Sakkinen) (03/14/90)

In article <25F8C92A.9349@paris.ics.uci.edu> rfg@ics.uci.edu (Ronald Guilmette) writes:
>In article <6553@cadillac.CAD.MCC.COM> ned@MCC.COM (Ned Nowotny) writes:
>>
>>Given the imprecise nature of the C++ specification, I am more than a
>>little disappointed that the language does not have a grammar expressible
>>in BNF form that will accept all legal strings and reject all illegal
>>strings.
>
>As Joe English also points out, this is utter nonsense. [...]
>>... While it may be
>>true that some LALR(1) grammar will accept all legal C++ strings, the fact
>>that the grammar must also accept illegal strings makes the claim that
>>C++ is somehow a LALR(1) language something less than useful.
>
>I agree that in practice, nobody actually is stupid enough to use V* as their
>LALR(1) grammar for parsing C++ or other languages, but just because the
>statement "C++ is LALR(1)" may not be "useful" does not mean that it is
>not 100% accurate.  In fact, I believe that it is.

Sorry, Mr. Guilmette, I think it is _you_ who are repeating the same
utter nonsense too many times in this group. Please stand back for
a while and reflect on how vacuous any grammatical classification
of languages becomes is one takes your attitude. Alternatively,
go post your opinion to the comp.theory group (or comp.compilers)
and get your knuckles rapped on by real experts.

Otherwise, you seem to be one of the most prolific and useful
contributors in this group.

Markku Sakkinen
Department of Computer Science
University of Jyvaskyla (a's with umlauts)
Seminaarinkatu 15
SF-40100 Jyvaskyla (umlauts again)
Finland
          SAKKINEN@FINJYU.bitnet (alternative network address)

rfg@ics.uci.edu (Ronald Guilmette) (03/21/90)

In article <3742@tukki.jyu.fi> sakkinen@jytko.jyu.fi (Markku Sakkinen) writes:
>In article <25F8C92A.9349@paris.ics.uci.edu> rfg@ics.uci.edu (Ronald Guilmette) writes:
>>In article <6553@cadillac.CAD.MCC.COM> ned@MCC.COM (Ned Nowotny) writes:
>>>
>>>Given the imprecise nature of the C++ specification, I am more than a
>>>little disappointed that the language does not have a grammar expressible
>>>in BNF form that will accept all legal strings and reject all illegal
>>>strings.
>>
>>As Joe English also points out, this is utter nonsense. [...]
>>>... While it may be
>>>true that some LALR(1) grammar will accept all legal C++ strings, the fact
>>>that the grammar must also accept illegal strings makes the claim that
>>>C++ is somehow a LALR(1) language something less than useful.
>>
>>I agree that in practice, nobody actually is stupid enough to use V* as their
>>LALR(1) grammar for parsing C++ or other languages, but just because the
>>statement "C++ is LALR(1)" may not be "useful" does not mean that it is
>>not 100% accurate.  In fact, I believe that it is.
>
>Sorry, Mr. Guilmette, I think it is _you_ who are repeating the same
>utter nonsense too many times in this group. Please stand back for
>a while and reflect on how vacuous any grammatical classification
>of languages becomes is one takes your attitude. Alternatively,
>go post your opinion to the comp.theory group (or comp.compilers)
>and get your knuckles rapped on by real experts.

I believe that I failed to make my point clearly enough.  I used the
(admitedly extreme) example of V* as a grammar for C++ just to make
it quite evident that a C++ grammar could be LALR(1).

I only did this for the sake of argument.  As should be demonstrated
by Jim Roskind's grammar, you need not devolve the grammar all the
way down to V* to get it to be LALR(1).  Quite the contrary.
If you are willing to do just a few more things with semantic analysis
rather than syntax analysis (relative to what current C++ implementors
seem to be doing) then you can in fact have a useful and practical LALR(1)
grammar for C++.  That was my point.  I just stated it badly.

// Ron Guilmette (rfg@ics.uci.edu)
// C++ Entomologist
// Motto:  If it sticks, force it.  If it breaks, it needed replacing anyway.

pcg@odin.cs.aber.ac.uk (Piercarlo Grandi) (03/23/90)

In article <26073795.10139@paris.ics.uci.edu> rfg@ics.uci.edu (Ronald
Guilmette) writes:

   If you are willing to do just a few more things with semantic analysis
   rather than syntax analysis (relative to what current C++ implementors
   seem to be doing) then you can in fact have a useful and practical LALR(1)
   grammar for C++.  That was my point.  I just stated it badly.

Sorry Ron, but we really still disagree on a question of terminology
here; a LALR(1) grammar belongs to the context free class. If you need
semantic information to parse C++ (and by all means you need it) then
your intelligent lexer is also a translator from a context sensitive
C++ to another language that is indeed LALR(1), BUT IS NO LONGER C++.

In particular this language has typedef names as tokens/keywords for the
scopes to which they apply, as in a sense when you define a class name
in C++ you are creating a new scoped reserved word, the class name, and
certain phrases can only be parsed if you know that the given string
is such a token.

Your point is that you can 'have a useful and practical LALR(1) PARSER
for A C++ COMPILER'; such a *parser* however does not parse C++ itself,
but a C++ derivative, generated by the intelligent lexer. C++ itself
does not admit a context free *grammar*, much less a LALR(1) one.

I could say that C++ is a *family* of LALR(1) languages, each of which
differs from the others by a different set of typedef names in different
scopes, and an intelligent lexer filters out the particular language
belonging to this family and passes it to the LALR(1) parser.

Typedef names do not turn Classic C itself into a context sensitive
language by the slimmest of margins. Incidentally, you can easily see
that the requirement for a leading 'struct' or 'enum' or 'union' before
a tag in a declaration was designed to help the parser be LL(1). Too bad
that 'typedef' was later introduced as a *storage class* and not as new
type of tag, e.g. not something like 'mode <name> { <abstract
declarator> } ....', which could be likened to a one field struct.

In C the problem does not occur because you can distinguish declarations
that include a typedef name from a statement in a context free way
because you need a '=' in an initialization and you cannot intermix
declarations with statements.

The '=' for initialization is particularly crucial; it used to be that
Classic C did not require it (as in 'int i 3;'), but it was introduced
later on because, as Ritchie put it, certain things looked so much like
procedure declarations that the parser got confused (as in 'int j 1; int
k (j)').

Stroustrup has reintroduced the old bugaboo. Those that do not know
history... :-). Note that this aspect of C++, that implies dynamic
augmentation of the set of keywords for a given scope, is strongly
reminiscent of Algol 68, and its (in)famous ambiguity between mode and
operator denotations, and I think that Stroustrup was more familiar with
the fine details of Algol 68 than of C at some point in his life.

In particular, while C++ surely does not admit a context free LALR(1)
grammar, it does admit description with a two level grammar (which is
context sensitive), Algol 68 style, and the second level grammar can
well be LALR(1) (those that do know two level grammars please excuse the
poetic license here :->).
--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

rfg@ics.uci.edu (Ronald Guilmette) (03/24/90)

In article <PCG.90Mar23120003@odin.cs.aber.ac.uk> pcg@odin.cs.aber.ac.uk (Piercarlo Grandi) writes:
>
>In article <26073795.10139@paris.ics.uci.edu> rfg@ics.uci.edu (Ronald
>Guilmette) writes:
>
>   If you are willing to do just a few more things with semantic analysis
>   rather than syntax analysis (relative to what current C++ implementors
>   seem to be doing) then you can in fact have a useful and practical LALR(1)
>   grammar for C++.  That was my point.  I just stated it badly.
>
>Sorry Ron, but we really still disagree on a question of terminology
>here; a LALR(1) grammar belongs to the context free class. If you need
>semantic information to parse C++ (and by all means you need it) then
>your intelligent lexer is also a translator from a context sensitive
>C++ to another language that is indeed LALR(1), BUT IS NO LONGER C++.
>
>In particular this language has typedef names as tokens/keywords for the

That (I claim) is a design decision on the part of the implementor.  (And
we are now getting down to the meat of my point.)

I fully admit that C and C++ are easier to "compile" if you have separate
token types for "identifier" and for "type name", but is this absolutely
manditory?  I think not.  Without having such distinctions made at the
lexical level (or the basis of feedback from semantic analysis), you will
of course end up with excessively general non-terminals like (for instance):

	declarative_or_executable_statement:

whose corresponding rules will allow stuff like:

	F ( Y ) = Z;

even in cases where (semantically) the context demands that F be treated
as the name of a function, but (as I said before) that just means that
you gotta do more work in the semantic analysis to properly detect and
flag such errors.

(Footnote: Actually, that may be a bad example, because I think that even
if "F" is the name of a function, the above statement may be legal, e.g.
in cases where F returns a reference.)

// Ron Guilmette (rfg@ics.uci.edu)
// C++ Entomologist
// Motto:  If it sticks, force it.  If it breaks, it needed replacing anyway.

pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) (03/28/90)

In article <260A9B1D.17399@paris.ics.uci.edu> rfg@ics.uci.edu (Ronald
Guilmette) writes:

   I fully admit that C and C++ are easier to "compile" if you have separate
   token types for "identifier" and for "type name", but is this absolutely
   manditory?  I think not.  Without having such distinctions made at the
   lexical level (or the basis of feedback from semantic analysis), you will
   of course end up with excessively general non-terminals like (for instance):

	   declarative_or_executable_statement:

But this is the meat of _my_ argument: this is no longer a grammar for
C++, it is a grammar for a family of languages of which C++ is a member,
and you use semantic (context dependent) rules to translate/disambiguate
it to a C++ grammar. This is a fine implementation technique; you parse
a more general language, and rely on non-grammar rules to narrow it
down.

Whether the *language* admits a context free, LALR(1) grammar (it does
not) is one issue, whether the *compiler* can make use of a LALR(1)
parser generator plus semantic analysis (it can) is an issue of how to
engineer a parser for a non context free language using a mixed bag of
tools. Essentially, using a LALR(1) parser generator aided by semantic
analysis is hand crafting an attribute grammar parser.
--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk