acha@CS.CMU.EDU (Anurag Acharya) (01/12/91)
In article <1991Jan11.100048.3121@odin.diku.dk> sestoft@diku.dk (Peter Sestoft) writes:
The main difference between Miranda (TM) and Lazy ML is the syntax.
In Miranda the program lay-out has semantics: the off-side rule
determines where an expression ends, whereas (Lazy) ML has more
Pascal-like conventions, using semicolons similar punctuation. (The
off-side rule was suggested by Peter Landin in the sixties).
What is the justification for this "off-side" rule ? The idea of whitespace
having semantics is a potential source of inscrutable bugs and, frankly
speaking, seems to go against the grain of modern programming language
design. The concrete syntax of such a language would no longer be context-free,
let alone LR(1)/LL(1). In fact, I am hard pressed to conceptualize an
efficient tokenizing algorithm for such languages.
Since this rule has also been incorporated into the design of Haskell, it isn't
just historical and I am sure the designers of Haskell must have had some
reason(s) for this decision. I would like to know what benefit(s) does this
rule provide a language deisgner to trade off against the parsing
inconvenience and inelegance ?
anurag
rh@smds.UUCP (Richard Harter) (01/13/91)
In article <ACHA.91Jan11151418@DRAVIDO.CS.CMU.EDU>, acha@CS.CMU.EDU (Anurag Acharya) writes: > In article <1991Jan11.100048.3121@odin.diku.dk> sestoft@diku.dk (Peter Sestoft) writes: > The main difference between Miranda (TM) and Lazy ML is the syntax. > In Miranda the program lay-out has semantics: the off-side rule > determines where an expression ends, whereas (Lazy) ML has more > Pascal-like conventions, using semicolons similar punctuation. (The > off-side rule was suggested by Peter Landin in the sixties). > What is the justification for this "off-side" rule ? The idea of whitespace > having semantics is a potential source of inscrutable bugs and, frankly > speaking, seems to go against the grain of modern programming language > design. Could some one post a summary explanation of the off-side rule, please? I am working with a language called Lakota which, like ICON, uses indentation for delimiting blocks, so the idea of layout being meaningful is familiar. However this is a syntactical use of white space rather than a semantic use, at least as far as I can see, and is not particularly novel. I would also be interested in hearing what the purported advantages of the off-side rule are. -- Richard Harter, Software Maintenance and Development Systems, Inc. Net address: jjmhome!smds!rh Phone: 508-369-7398 US Mail: SMDS Inc., PO Box 555, Concord MA 01742 This sentence no verb. This sentence short. This signature done.
kinnersley@kuhub.cc.ukans.edu (Bill Kinnersley) (01/14/91)
In article <ACHA.91Jan11151418@DRAVIDO.CS.CMU.EDU>, acha@CS.CMU.EDU (Anurag Acharya) writes:
:
: What is the justification for this "off-side" rule ? The idea of whitespace
: having semantics is a potential source of inscrutable bugs and, frankly
: speaking, seems to go against the grain of modern programming language
: design. The concrete syntax of such a language would no longer be context-free,
: let alone LR(1)/LL(1). In fact, I am hard pressed to conceptualize an
: efficient tokenizing algorithm for such languages.
:
: Since this rule has also been incorporated into the design of Haskell, it isn't
: just historical and I am sure the designers of Haskell must have had some
: reason(s) for this decision. I would like to know what benefit(s) does this
: rule provide a language deisgner to trade off against the parsing
: inconvenience and inelegance ?
:
I don't think it's really as bad an idea as you make it sound.
At least two other languages use it, ABC and occam.
Certainly indentation makes a program more readable, and once you've
indented correctly the brackets and braces are unnecessary clutter.
What's easier to spot, a missing parenthesis or a missing indentation?
It's a lexical problem not a parsing problem, and it's simple. At every
newline you have LEX count the leading whitespace, and emit an indent or
an outdent every time it changes.
--Bill Kinnersley
djohnson@beowulf.ucsd.edu (Darin Johnson) (01/14/91)
>In article <ACHA.91Jan11151418@DRAVIDO.CS.CMU.EDU>, acha@CS.CMU.EDU (Anurag Acharya) writes: >: >: What is the justification for this "off-side" rule ? The idea of whitespace >: having semantics is a potential source of inscrutable bugs and, frankly >: speaking, seems to go against the grain of modern programming language >: design. The concrete syntax of such a language would no longer be >: context-free, >: let alone LR(1)/LL(1). In fact, I am hard pressed to conceptualize an >: efficient tokenizing algorithm for such languages. When you look at how strict and unforgiving Occam is towards spacing, the off-side rule is rather benign. However, there is the major "gotcha" in both languages - tabs count as one character. And unfortunately, back when I used vi, tabs would be inserted automatically and occam would give some meaningless error message that tooks hours to figure out. As far as justifying this, it adds to readability. One of the "goals" of functional programming languages is to be able to have programs look like formal mathematical functions. Block constructs detract from the readability somewhat, especially if you must add begin/end because {}, (), etc are already used. Probably a big reason is that it looks nice esthetically, without lots of filler tokens and symbols. Parsing isn't that bad at all. In fact, the lexical analyzer can handle it all, and the parser need no nothing about spacing. The method I used in a project was to insert a "begin" token whenever the beginning of a block was found (like just after =), and also push onto a stack the current position in the line. Then whenever reading a new line, if the first symbol was less than the position saved on the top of the stack, "end"'s were inserted into the token stream. It was very simple to add, and the parser always saw begin/end markers and was kept happy. [of course, you have to go to reading line by line, but this was done anyway so error messages had line numbers] -- Darin Johnson djohnson@ucsd.edu
farrell@batserver.cs.uq.oz.au (Friendless) (01/14/91)
In <ACHA.91Jan11151418@DRAVIDO.CS.CMU.EDU> acha@CS.CMU.EDU (Anurag Acharya) writes: >What is the justification for this "off-side" rule ? The idea of whitespace >having semantics is a potential source of inscrutable bugs and, frankly >speaking, seems to go against the grain of modern programming language >design. It is unusual, but it works. Since I am an obsessed with correct indentation, it rarely causes me problems - usually only when I have changed an identifier name on the first line of a definition, which moves the text and mucks up the lining up of something below it. However, in those cases Miranda says "Unexpected token: OFFSIDE" and I know what to fix. I find Miranda a delight to program in. John
schwartz@groucho.cs.psu.edu (Scott Schwartz) (01/14/91)
farrell@batserver.cs.uq.oz.au (Friendless) writes:
I find Miranda a delight to program in.
The recent posting on matrix operations did seem to have more Miranda
examples than ML or other ones. Is Miranda really more popular?
kh@cs.glasgow.ac.uk (Kevin Hammond) (01/15/91)
In article <293@smds.UUCP> rh@smds.UUCP (Richard Harter) writes: >In article <ACHA.91Jan11151418@DRAVIDO.CS.CMU.EDU>, acha@CS.CMU.EDU (Anurag Acharya) writes: >> What is the justification for this "off-side" rule ? The idea of whitespace >> having semantics is a potential source of inscrutable bugs and, frankly >> speaking, seems to go against the grain of modern programming language >> design. This is to some extent true of Miranda, but it is less true of Haskell (also mentioned in this thread). The advantage of the approach is that programmers usually treat white-space as if it were significant (vide Pascal programs laid out a line at a time). Capitalising on this tendency helps improve legibility and conciseness, while permitting a certain flexibility in layout. Compare: f 0 = 1 f n = n * f (n-1) main = print(f 100) with: module Main(main) where { f 0 = 1; f n = n * f (n-1); main = print(f 100) } The improvement is rather better with local definitions and longer programs (I'm using Haskell here, Miranda modules have shorter headers). >Could some one post a summary explanation of the off-side rule, please? Simply put, in Miranda all text to the right of a defining operator ("where" or "=") after a newline comprises part of the previous line. Other newlines act as separators. This applies recursively, so you can build up levels of nested definitions structured by their layout. This allows: || Example 1 f x = g + h where g = fac x h = x * 2 || Example 2 f (x:xs) y z = x + y + z g = 100 etc. (The latter is an example of how *not* to use the off-side rule!) Note that Haskell's layout rule is *not identical* to the Miranda off-side rule. In particular: i) It is possible to turn off layout at will; ii) The layout rules do not use "=" as a defining point; iii) Alignment of definitions is significant (ii) is particularly important -- I find my Miranda programs regularly vanish off the RHS of the page (because of the need to insert a fixed amount of space), but this never happens with Haskell. It is also easier to avoid off-side errors in Haskell (to the extent that these almost never occur, even when porting free-form programs). As illustration: f x = g + h where g = fac x h = x * 2 is legal Haskell, but not legal Miranda (the "where" is off-side with respect to the first "="). The Haskell layout rule is fully described in the Haskell report (section 1.5, page 3). I'll post it if there's enough interest (5-6 paragraphs including explanation). Kevin -- This Signature Intentionally Left Blank. E-mail: kh@cs.glasgow.ac.uk
thorinn@diku.dk (Lars Henrik Mathiesen) (01/16/91)
kh@cs.glasgow.ac.uk (Kevin Hammond) writes: >In article <293@smds.UUCP> rh@smds.UUCP (Richard Harter) writes: >>Could some one post a summary explanation of the off-side rule, please? >Simply put, in Miranda all text to the right of a defining operator >("where" or "=") after a newline comprises part of the previous line. >Other newlines act as separators. This applies recursively, so you >can build up levels of nested definitions structured by their layout. This is not quite correct. The offside rule in Miranda applies to certain nonterminals in the syntax such as: each alternative on the righthand side of a variable or function definition, other righthand sides (type definitions, aliases, and specifications), bodies of module inclusion directives. The definition list after "where" does not create a ``offside level'' in itself, it's included in that of the last alternative of the righthand side. Also, the critical indentation is determined by the first token produced by the nonterminal, not by the token before it. >This allows: > || Example 2 > f (x:xs) > y > z > = x + > y + z Therefore, this example gives a "syntax error - unexpected token OFFSIDE". The last line should be indented to the level of the "x" in the line above, not just past the "=". >Note that Haskell's layout rule is *not identical* to the Miranda >off-side rule. In particular: > i) It is possible to turn off layout at will; > ii) The layout rules do not use "=" as a defining point; > iii) Alignment of definitions is significant >(ii) is particularly important -- I find my Miranda programs regularly >vanish off the RHS of the page (because of the need to insert a fixed >amount of space), but this never happens with Haskell. (ii) is incorrect, and the wild indentation can be avoided with an alternative style: function of six rather longwindedly named arguments = right hand side where some other function = is defined You can also break before the "=", at the cost of a tabstop. I usually do that, especially when there are several alternatives: foo a b c = bar a , if b = c = baz (bar b ++ bar c) kvetch , otherwise where kvetch = burble a b c It's not pretty, but it always works (even if you modify the left hand side later) and the indentation only depends on nesting level. The other two points are valid, and very much in Haskell's favor. -- Lars Mathiesen, DIKU, U of Copenhagen, Denmark [uunet!]mcsun!diku!thorinn Institute of Datalogy -- we're scientists, not engineers. thorinn@diku.dk
acha@CS.CMU.EDU (Anurag Acharya) (01/18/91)
Thanks for all the info on the "off-side rule". As far as I could make out, the rule is motivated by a desire to make the syntax of programming languages look as close to that of mathematical functions as possible. As the post on Haskell mentioned, the layout is preprocessed into a set of tokens in a phase between the lexer and the parser. Such functionality would be better placed in an editor rather than in the compiler. We already have the "electric-this", "electric-that" modes in gnu-emacs. It should be a simple job to extend these modes to include the "off-side rule" if so desired. Why complicate the language definition with such inelegant constructs ? Of course, if one is indeed going to depend on the programmers to do the indentation right, one might expect them to put in the delimitors. A point that is made in the defence of the "off-side rule" is that the indentation, if "properly" done, renders the delimitors redundant -- so we should do away with them. This is, however, useful redundancy -- in that it allows the compiler to detect any inadvertent errors that might occur. anurag
spot@CS.CMU.EDU (Scott Draves) (01/18/91)
While it's true that one can use indentation to replace grouping, i think this is quite dangerous. Many programs feel free to change the spacing in a program. emacs will often convert between tabs and spaces. some implementations of cpp converts tabs to spaces. unix make pays attention to whitespace, and it is one of the "features" most regretted by the designers. i don't think the fact that it can be turned off helps, one could still receive someone else's code that uses it, and then you're hosed. it's just plain gross. just my $1/50 -- IBM Scott Draves Intel spot@cs.cmu.edu Microsoft
rh@smds.UUCP (Richard Harter) (01/19/91)
In article <SPOT.91Jan18074828@WOOZLE.GRAPHICS.CS.CMU.EDU>, spot@CS.CMU.EDU (Scott Draves) writes: > While it's true that one can use indentation to replace grouping, i > think this is quite dangerous. Many programs feel free to change the > spacing in a program. emacs will often convert between tabs and > spaces. some implementations of cpp converts tabs to spaces. One has to distinguish between the off-side rule and the use of indentation for grouping. The off-side rule is, from what I can see, a potential source of petty little disasters. On the other hand, using identation as the delimiter for grouping is IMHO *the right thing to do* in block structured languages. The standard practice of using begin/end delimiters, coupled with using indentation (standard programming practice) violates the single control point principle. It is a regular source of annoying little bugs, the "dangling else" being the most notorious example. The argument that many programs change spacing is not to the point; programs that change source code text in an unspecified language using arbitrary assumptions are, to put it mildly, not of professional quality. > i don't think the fact that it can be turned off helps, one could > still receive someone else's code that uses it, and then you're hosed. This is actually a good point. Language features which can be "turned on or off, according to one's taste are bad features. One of the most notorious abuses of cpp is to use macros to make C to look like some other language. > it's just plain gross. May I suggest in turn that the Algol tradition of treating programs as streams of text to be processed by machines, rather than as text to be read and written by programmers is a fundamental conceptual error. It is my impression that a reaction such "it's just plain gross" is just another variant of "This is the way I learned to do it, so it must be the only way to do it." No doubt this is unfair on my part, so accept my apologies in advance. However, having worked with languages that used begin/end and cousins and with languages that used indentation for block structure, I can say I find the latter much more natural, once one overcomes one's initial habits. -- Richard Harter, Software Maintenance and Development Systems, Inc. Net address: jjmhome!smds!rh Phone: 508-369-7398 US Mail: SMDS Inc., PO Box 555, Concord MA 01742 This sentence no verb. This sentence short. This signature done.
sansom@cs.glasgow.ac.uk (Patrick Sansom) (01/21/91)
aipdc@castle.ed.ac.uk (Paul Crowley) writes: >I am both attracted to and repelled from the off-side rule - it provides >potential both for more readable code and for horrendously intractable >bugs. >My question is: why not write the _language_ so that white space is not >significant (except as a separator) but write a _filter_ which turns >"off-side" code into "non-off-side" code. That way, people who don't >like the rule never have to think about it - foreign code can be pushed >through the filter and then edited. >A filter the other way might be nice but wouldn't be essential. >... >\/ o\ Paul Crowley aipdc@uk.ac.ed.castle >/\__/ Trust me, I know what I'm doing. Yes I agree with this _filter_ idea. My honours project in 1989 was looking at initial Haskell specifications and included implementing the formatting rules in a pre-processor / filter which inserted brackets { } and semi-colons where appropriate. The modified source was then given to the compiler. In fact, the specification of the layout rule in the Haskell report is actually specified as a translation on source code. Having the pre-processor as a seperate entity allows any source code to be processed before distribution to remove any possibility of ambiguity. -- Patrick Sansom | E-Mail: sansom@cs.glasgow.ac.uk Computing Science Research Student | University of Glasgow |
db@cs.ed.ac.uk (Dave Berry) (01/22/91)
In article <SPOT.91Jan18074828@WOOZLE.GRAPHICS.CS.CMU.EDU> spot@CS.CMU.EDU (Scott Draves) writes: > >While it's true that one can use indentation to replace grouping, i >think this is quite dangerous. Many programs feel free to change the >spacing in a program. emacs will often convert between tabs and >spaces. some implementations of cpp converts tabs to spaces. I really don't see this as a problem. All the compiler has to do is to convert tabs back to spaces before applying the off-side rule. Many compilers do this anyway to generate error messages that contain the position of the erroneous character(s). The fact that Miranda apparently doesn't do this is to my mind a design error in Miranda. Of course things get more complicated if people set tabs at different intervals than the standard every 8th position. But this can be dealt with, for example by a special declaration to declare tab settings. -- Dave Berry, LFCS, Edinburgh Uni. db%lfcs.ed.ac.uk@nsfnet-relay.ac.uk "The Berlin Wall; the border between East and West Germany. It's very narrow." "*Was* very narrow. Get your tenses right." -- Douglas Adams, THHGTTG.
dtarditi@CS.CMU.EDU (David Tarditi) (01/23/91)
I am astonished to find people defending the off-side rule. Parsing is a solved problem. Why do designers of semantically "clean" languages insist on botching syntax ? We could argue forever about whether making white-space significant is in good taste. Suffice it to say that most real programming languages in widespread use do not make white-space significant. Maybe they just have not seen the light of "conciseness and legibility", or perhaps from previous experience they wanted to get rid of endless, annoying bugs which are difficult to find. It is bad enough having to match up delimiters in long programs without having to worry about indentation. A significant argument against the off-side rule is that it makes languages difficult to parse. The modern approach to parsing is to define the concrete syntax of a language using BNF and regular expressions. The BNF for the language should be defined so that the language can parsed using an LALR(1) parser. This allows the compiler writer to use software tools which generate efficient parsers and lexers, and get on with the more important part of the work (generating correct, efficient code). The Lex and Yacc Unix(tm) tools exemplify the approach. The concrete syntax of languages using the off-side rule cannot be defined using context-free languages or regular expressions. The reason, simply put, is that these languages possess only a limited ability to "count". Consider the following language (where c^k denotes k occurrences of character c): { s | there exists k such that s= c^kxc^kyc^kz} This, I argue, is a simple example of a program using the off-side rule. Let c be a space character. Then we have k spaces, followed by some phrase, followed by k spaces, etc. I claim that this language is not context-free. I don't offer a proof of this here, but an application of the pumping lemma for context-free languages should suffice. Thus, it is impossible to parse this language using modern techniques unless we resort to hacks. Suggested hacks include adding a preprocessor specifically to deal with off-side rule or altering the lexer to count spaces. I am amazed that people ignore all the work that has been done on parsing. The formalization of syntax is considered by most people to be a significant accomplishment: the formalism is useful to programmers using the languages and compiler writers alike. IMHO, people designing concrete syntax for languages should check that the concrete syntax is LALR(1), which ensures that mechanical tools for parser generation can be used if they exist for whatever language you are writing your compiler in. This is quite simple to do: all you need to do is write a grammar which runs through then Unix(tm) Yacc tool without any ambiguities. David Tarditi tarditi@cs.cmu.edu -- David Tarditi School of Computer Science tarditi@cs.cmu.edu Carnegie Mellon University Pittsburgh, PA 15213
eerke@cs.kun.nl (Eerke Boiten) (01/23/91)
In article <DTARDITI.91Jan22185718@LAMBDA.ERGO.CS.CMU.EDU> dtarditi@CS.CMU.EDU (David Tarditi) writes: >A significant argument against the off-side rule is that it makes >languages difficult to parse. The modern approach to parsing is to define >the concrete syntax of a language using BNF and regular expressions. >The BNF for the language should be defined so that the language can >parsed using an LALR(1) parser. [..] This is only important for *prototyping* implementations of functional languages in environments where no other parser generators than YACC are available. Anyone who implemented a functional language, care to comment? >The concrete syntax of languages using the off-side rule cannot be defined >using context-free languages or regular expressions. The reason, simply put, >is that these languages possess only a limited ability to "count". >Consider the following language (where c^k denotes k occurrences of >character c): > { s | there exists k such that s= c^kxc^kyc^kz} >This, I argue, is a simple example of a program using the off-side >rule. [c=space] I claim that this language is not context-free. [..] >Thus, it is impossible to parse this language using modern techniques >unless we resort to hacks. Suggested hacks include adding a preprocessor >specifically to deal with off-side rule or altering the lexer to count >spaces. Two comments: 1. The language above *is* contextfree for limited k (let's say 80, for instance - >80 chars per line is bad programming style, certainly in nice succinct functional languages) 2. Now I should remark that most programming languages are not *context* free (identifiers!), maybe? The most obvious solution for off-side rule languages is already given (and is no less of a hack than symbol tables are): count the spaces in the lexer. The only real problem I can imagine with such languages is with programs that generate code in them. >IMHO, people designing concrete syntax for languages should check that >the concrete syntax is LALR(1) [..] This is quite simple to do: >all you need to do is write a grammar which runs through the Unix(tm) >Yacc tool without any ambiguities. Testing for LALR(1) is simple: either it works and you're finished, or your grammar is ambiguous and you have to rewrite it anyway, or it isn't LALR(1) and not ambiguous either, and you're in trouble. Even if you know "all" about LALR(1), you still have to have an enormous amount of specific intuition to be able to eliminate LALR(1) problems. (Quoted from earlier on:) >I am amazed that people ignore all the work that has been done on parsing. Me too. LALR(1) isn't exactly the hottest thing in parsing, is it? -- Eerke Boiten Department of Informatics (STOP Project), K.U.Nijmegen Toernooiveld, 6525 ED Nijmegen, The Netherlands Tel. +31-80-652236. Email: eerke@cs.kun.nl
bothner@sevenlayer.cs.wisc.edu (Per Bothner) (01/24/91)
>All this to save a few characters? The point, I think, is not save a few characters, but to save a few lines (and saving lines is useful, because it allows you to get more of a program on the screen). Consider C indentation rules: if (test) { statement1; statement2; } Note the final line. This is needed for the final delimiter, but contains no useful information. It wastes cathode rays/paper. It puts extra white space in a useless and ugly place. You could use a Lisp-style of formatting, where the final delimiter is on the last "real" line: if (test) { statement1; statement2;} But now the last statement is different from the other(s), which is esthetically ugly and makes updating error-prone. I find it hard to take seriously people's concerns about tabs. A program with tabs whose intervals are not every 8th character is already non-portable (it will look a mess when viewed on a normal system), and it is in any case an easy problem to fix in your editor, in your communication software, or with an explicit filter. So, I believe it may be reasonable to make syntactic use of indentation. The specifics of Haskell's particular off-side rule is a different issue. I suggest that the rule should be relative indentation of lines, not syntatic units. That is, I would prefer something like: if (test) statement1 statement2 instead of: if (test) statement1 statement2 -- --Per Bothner bothner@cs.wisc.edu Computer Sciences Dept, U. of Wisconsin-Madison
brian@comp.vuw.ac.nz (Brian Boutel) (01/24/91)
A number of people have criticised layout rules. For example, In article <DTARDITI.91Jan22185718@LAMBDA.ERGO.CS.CMU.EDU>, dtarditi@CS.CMU.EDU (David Tarditi) writes: |> |> I am astonished to find people defending the off-side rule. Parsing |> is a solved problem. Why do designers of semantically "clean" |> languages |> insist on botching syntax ? |> |> We could argue forever about whether making white-space significant |> is in good taste. Suffice it to say that most real programming |> languages |> in widespread use do not make white-space significant. Maybe they |> just |> have not seen the light of "conciseness and legibility", or perhaps |> from previous experience they wanted to get rid of endless, annoying |> bugs which are difficult to find. It is bad enough having to match |> up |> delimiters in long programs without having to worry about |> indentation. |> Just a reminder that Haskell is *defined* without reference to any offside rule. Grouping uses { and }, and ; is used for separation. It is entirely practicable and reasonable to write Haskell programs this way. But some people like the "mathematical" appearance of programs written in languages with layout conventions, and they like the uncluttered appearance of programs without lots of explicit delimiters. For this reason, The Haskell Report includes an algorithm for converting programs in this style to the Haskell of the definition. I personally think that the delimited style should always be used for program transfer, but the Haskell report provides for portable programs in layout style by defining the tab conventions that should be assumed when calculating indentations. |> A significant argument against the off-side rule is that it makes |> languages difficult to parse. The modern approach to parsing is to define |> the concrete syntax of a language using BNF and regular expressions. |> The BNF for the language should be defined so that the language can |> parsed using an LALR(1) parser. This allows the compiler writer to use |> software tools which generate efficient parsers and lexers, and get on with |> the more important part of the work (generating correct, efficient code). |> The Lex and Yacc Unix(tm) tools exemplify the approach. |> And these tools are, of course, used in implementing Haskell. |> The concrete syntax of languages using the off-side rule cannot be defined |> using context-free languages or regular expressions. The reason, simply put, |> is that these languages possess only a limited ability to "count". |> Consider the following language (where c^k denotes k occurrences of |> character c): |> |> { s | there exists k such that s= c^kxc^kyc^kz} |> |> This, I argue, is a simple example of a program using the off-side |> rule. Let c be a space character. Then we have k spaces, followed |> by some phrase, followed by k spaces, etc. I claim that this language |> is not context-free. I don't offer a proof of this here, but an application |> of the pumping lemma for context-free languages should suffice. |> |> Thus, it is impossible to parse this language using modern techniques |> unless we resort to hacks. Suggested hacks include adding a preprocessor |> specifically to deal with off-side rule or altering the lexer to count |> spaces. |> Even such common language features as requiring declarations to appear before uses make languages not context-free, and if dealing with this by means of symbol table processing is "resorting to hacks", then most language processors are in trouble. |> I am amazed that people ignore all the work that has been done on |> parsing. The formalization of syntax is considered by most people |> to be a significant accomplishment: the formalism is useful to programmers |> using the languages and compiler writers alike. |> |> IMHO, people designing concrete syntax for languages should check that |> the concrete syntax is LALR(1), which ensures that mechanical tools |> for parser |> generation can be used if they exist for whatever language you are |> writing your compiler in. This is quite simple to do: all you need |> to do |> is write a grammar which runs through then Unix(tm) Yacc tool without |> any |> ambiguities. |> There is a big difference between "ignoring all the work that has been done" and using the tools plus other methods when the tools do not go far enough. Let me give another example from Haskell. Most people give ambiguous grammars to Yacc and provide additional disambiguating information, for example, associativity and precedence info for infix operators. A problem with this is that you cannot in Yacc specify operators of different associativities at the same precedence, and yet this may be a necessary feature (e.g. when the operators are defined by different people in separate modules and imported into the same module). Haskell defines what is legal and how it is to be parsed by requiring that infix expressions be fully parenthesised except in a list of cases where normal precedence and associativity considerations make this unnecessary. Now this is still context free, but since the rules cannot be expressed using the Yacc %left and %right syntax, the grammar needed to describe it is very large. So what should be done? a) Change the language? b) Live with the inefficiencies of the big parser that is generated by the big grammar. c) Recognise that Yacc is not a good way to solve this particular problem and use a mixed strategy of LALR(1) grammar/Yacc for the rest of the language and a precedence grammar/parser for infix expressions? The art is to find the right tools for the task at hand, not to reject all tasks for which your favourite tools are not ideal. --brian -- Internet: brian@comp.vuw.ac.nz Postal: Brian Boutel, Computer Science Dept, Victoria University of Wellington, PO Box 600, Wellington, New Zealand Phone: +64 4 721000
jrk@information-systems.east-anglia.ac.uk (Richard Kennaway CMP RA) (01/24/91)
In <2676@wn1.sci.kun.nl> eerke@cs.kun.nl (Eerke Boiten) writes: >In article <DTARDITI.91Jan22185718@LAMBDA.ERGO.CS.CMU.EDU> dtarditi@CS.CMU.EDU (David Tarditi) writes: >>I am amazed that people ignore all the work that has been done on parsing. >Me too. LALR(1) isn't exactly the hottest thing in parsing, is it? On the other hand, YACC is there. Is any comparable tool readily available that embodies more up-to-date parsing technology? -- Richard Kennaway SYS, University of East Anglia, Norwich, U.K. Internet: jrk@sys.uea.ac.uk uucp: ...mcsun!ukc!uea-sys!jrk "Gentlemen do not discuss concrete syntax." -- Stefan Sokolowski
kh@cs.glasgow.ac.uk (Kevin Hammond) (01/24/91)
In article <2676@wn1.sci.kun.nl> eerke@cs.kun.nl (Eerke Boiten) writes: >In article <DTARDITI.91Jan22185718@LAMBDA.ERGO.CS.CMU.EDU> dtarditi@CS.CMU.EDU (David Tarditi) writes: >>A significant argument against the off-side rule is that it makes >>languages difficult to parse. The modern approach to parsing is to define >>the concrete syntax of a language using BNF and regular expressions. >>The BNF for the language should be defined so that the language can >>parsed using an LALR(1) parser. [..] > >This is only important for *prototyping* implementations of functional >languages in environments where no other parser generators than YACC are >available. Anyone who implemented a functional language, care to comment? For what it's worth, the Haskell off-side rule requires very few modifications to a yacc parser (one extra rule, I think). This assumes that the lexer is counting spaces. The lexer is rather more complicated, but can still be written using lex/C without too much trouble. The only real problems are keeping track of space in strings (which may extend across lines in Haskell) and remembering to adjust the space count when backtracking. Kevin -- This Signature Intentionally Left Blank. E-mail: kh@cs.glasgow.ac.uk