[comp.lang.functional] "Off-side rule"

acha@CS.CMU.EDU (Anurag Acharya) (01/12/91)

In article <1991Jan11.100048.3121@odin.diku.dk> sestoft@diku.dk (Peter Sestoft) writes:
   The main difference between Miranda (TM) and Lazy ML is the syntax.
   In Miranda the program lay-out has semantics: the off-side rule
   determines where an expression ends, whereas (Lazy) ML has more
   Pascal-like conventions, using semicolons similar punctuation.  (The
   off-side rule was suggested by Peter Landin in the sixties).  

What is the justification for this "off-side" rule ? The idea of whitespace 
having semantics is a potential source of inscrutable bugs and, frankly 
speaking, seems to go against the grain of modern programming language
design. The concrete syntax of such a language would no longer be context-free,
let alone LR(1)/LL(1). In fact, I am hard pressed to conceptualize an
efficient tokenizing algorithm for such languages. 

Since this rule has also been incorporated into the design of Haskell, it isn't
just historical and I am sure the designers of Haskell must have had some
reason(s) for this decision. I would like to know what benefit(s) does this
rule provide a language deisgner to trade off against the parsing 
inconvenience and inelegance ? 


anurag

rh@smds.UUCP (Richard Harter) (01/13/91)

In article <ACHA.91Jan11151418@DRAVIDO.CS.CMU.EDU>, acha@CS.CMU.EDU (Anurag Acharya) writes:
> In article <1991Jan11.100048.3121@odin.diku.dk> sestoft@diku.dk (Peter Sestoft) writes:
>    The main difference between Miranda (TM) and Lazy ML is the syntax.
>    In Miranda the program lay-out has semantics: the off-side rule
>    determines where an expression ends, whereas (Lazy) ML has more
>    Pascal-like conventions, using semicolons similar punctuation.  (The
>    off-side rule was suggested by Peter Landin in the sixties).  

> What is the justification for this "off-side" rule ? The idea of whitespace 
> having semantics is a potential source of inscrutable bugs and, frankly 
> speaking, seems to go against the grain of modern programming language
> design.

Could some one post a summary explanation of the off-side rule, please?
I am working with a language called Lakota which, like ICON, uses indentation
for delimiting blocks, so the idea of layout being meaningful is familiar.
However this is a syntactical use of white space rather than a semantic
use, at least as far as I can see, and is not particularly novel.
I would also be interested in hearing what the purported advantages of
the off-side rule are.

-- 
Richard Harter, Software Maintenance and Development Systems, Inc.
Net address: jjmhome!smds!rh Phone: 508-369-7398 
US Mail: SMDS Inc., PO Box 555, Concord MA 01742
This sentence no verb.  This sentence short.  This signature done.

kinnersley@kuhub.cc.ukans.edu (Bill Kinnersley) (01/14/91)

In article <ACHA.91Jan11151418@DRAVIDO.CS.CMU.EDU>, acha@CS.CMU.EDU (Anurag Acharya) writes:
: 
: What is the justification for this "off-side" rule ? The idea of whitespace 
: having semantics is a potential source of inscrutable bugs and, frankly 
: speaking, seems to go against the grain of modern programming language
: design. The concrete syntax of such a language would no longer be context-free,
: let alone LR(1)/LL(1). In fact, I am hard pressed to conceptualize an
: efficient tokenizing algorithm for such languages. 
: 
: Since this rule has also been incorporated into the design of Haskell, it isn't
: just historical and I am sure the designers of Haskell must have had some
: reason(s) for this decision. I would like to know what benefit(s) does this
: rule provide a language deisgner to trade off against the parsing 
: inconvenience and inelegance ? 
: 
I don't think it's really as bad an idea as you make it sound.
At least two other languages use it, ABC and occam.

Certainly indentation makes a program more readable, and once you've
indented correctly the brackets and braces are unnecessary clutter.
What's easier to spot, a missing parenthesis or a missing indentation?

It's a lexical problem not a parsing problem, and it's simple.  At every
newline you have LEX count the leading whitespace, and emit an indent or
an outdent every time it changes.

--Bill Kinnersley

djohnson@beowulf.ucsd.edu (Darin Johnson) (01/14/91)

>In article <ACHA.91Jan11151418@DRAVIDO.CS.CMU.EDU>, acha@CS.CMU.EDU (Anurag Acharya) writes:
>: 
>: What is the justification for this "off-side" rule ? The idea of whitespace 
>: having semantics is a potential source of inscrutable bugs and, frankly 
>: speaking, seems to go against the grain of modern programming language
>: design. The concrete syntax of such a language would no longer be 
>: context-free,
>: let alone LR(1)/LL(1). In fact, I am hard pressed to conceptualize an
>: efficient tokenizing algorithm for such languages. 

When you look at how strict and unforgiving Occam is towards spacing,
the off-side rule is rather benign.  However, there is the major "gotcha"
in both languages - tabs count as one character.  And unfortunately, back
when I used vi, tabs would be inserted automatically and occam would give
some meaningless error message that tooks hours to figure out.

As far as justifying this, it adds to readability.  One of the "goals" of
functional programming languages is to be able to have programs look
like formal mathematical functions.  Block constructs detract from
the readability somewhat, especially if you must add begin/end because
{}, (), etc are already used.  Probably a big reason is that it looks
nice esthetically, without lots of filler tokens and symbols.

Parsing isn't that bad at all.  In fact, the lexical analyzer can handle it
all, and the parser need no nothing about spacing.  The method I used in a
project was to insert a "begin" token whenever the beginning of a block was
found (like just after =), and also push onto a stack the current position
in the line.  Then whenever reading a new line, if the first symbol was
less than the position saved on the top of the stack, "end"'s were inserted
into the token stream.  It was very simple to add, and the parser always saw
begin/end markers and was kept happy.  [of course, you have to go to reading
line by line, but this was done anyway so error messages had line numbers]
-- 
Darin Johnson
djohnson@ucsd.edu

farrell@batserver.cs.uq.oz.au (Friendless) (01/14/91)

In <ACHA.91Jan11151418@DRAVIDO.CS.CMU.EDU> acha@CS.CMU.EDU (Anurag Acharya) writes:
>What is the justification for this "off-side" rule ? The idea of whitespace 
>having semantics is a potential source of inscrutable bugs and, frankly 
>speaking, seems to go against the grain of modern programming language
>design. 

  It is unusual, but it works. Since I am an obsessed with correct
indentation, it rarely causes me problems - usually only when I have changed
an identifier name on the first line of a definition, which moves the text and
mucks up the lining up of something below it. However, in those cases Miranda
says "Unexpected token: OFFSIDE" and I know what to fix. I find Miranda a
delight to program in.


John

schwartz@groucho.cs.psu.edu (Scott Schwartz) (01/14/91)

farrell@batserver.cs.uq.oz.au (Friendless) writes:
   I find Miranda a delight to program in.

The recent posting on matrix operations did seem to have more Miranda
examples than ML or other ones.  Is Miranda really more popular?

kh@cs.glasgow.ac.uk (Kevin Hammond) (01/15/91)

In article <293@smds.UUCP> rh@smds.UUCP (Richard Harter) writes:
>In article <ACHA.91Jan11151418@DRAVIDO.CS.CMU.EDU>, acha@CS.CMU.EDU (Anurag Acharya) writes:
>> What is the justification for this "off-side" rule ? The idea of whitespace 
>> having semantics is a potential source of inscrutable bugs and, frankly 
>> speaking, seems to go against the grain of modern programming language
>> design.

This is to some extent true of Miranda, but it is less true of Haskell
(also mentioned in this thread).  

The advantage of the approach is that programmers usually treat
white-space as if it were significant (vide Pascal programs laid out a
line at a time).  Capitalising on this tendency helps improve
legibility and conciseness, while permitting a certain flexibility in
layout.  Compare:

	f 0 = 1
	f n = n * f (n-1)

	main = print(f 100)

with:

	module Main(main) where
	{ 
	  f 0 = 1;
	  f n = n * f (n-1);

	  main = print(f 100)
	}

The improvement is rather better with local definitions and longer
programs (I'm using Haskell here, Miranda modules have shorter
headers).

>Could some one post a summary explanation of the off-side rule, please?

Simply put, in Miranda all text to the right of a defining operator
("where" or "=") after a newline comprises part of the previous line.
Other newlines act as separators.  This applies recursively, so you
can build up levels of nested definitions structured by their layout.
This allows:

	|| Example 1
	f x = g + h
	      where g = fac x
                    h = x * 2


	|| Example 2
	 f (x:xs)
	   y
             z
          =  x + 
           y + z

	g = 100

etc.  (The latter is an example of how *not* to use the off-side rule!)


Note that Haskell's layout rule is *not identical* to the Miranda
off-side rule.  In particular:

	i)	It is possible to turn off layout at will;
	ii)	The layout rules do not use "=" as a defining point;
	iii)	Alignment of definitions is significant

(ii) is particularly important -- I find my Miranda programs regularly
vanish off the RHS of the page (because of the need to insert a fixed
amount of space), but this never happens with Haskell.  It is also
easier to avoid off-side errors in Haskell (to the extent that these
almost never occur, even when porting free-form programs).  As illustration:

	f x = g + h
         where
	   g = fac x
	   h = x * 2

is legal Haskell, but not legal Miranda (the "where" is off-side with
respect to the first "=").

The Haskell layout rule is fully described in the Haskell report
(section 1.5, page 3).  I'll post it if there's enough interest 
(5-6 paragraphs including explanation).

Kevin




-- 
This Signature Intentionally Left Blank.	E-mail: kh@cs.glasgow.ac.uk

thorinn@diku.dk (Lars Henrik Mathiesen) (01/16/91)

kh@cs.glasgow.ac.uk (Kevin Hammond) writes:
>In article <293@smds.UUCP> rh@smds.UUCP (Richard Harter) writes:
>>Could some one post a summary explanation of the off-side rule, please?

>Simply put, in Miranda all text to the right of a defining operator
>("where" or "=") after a newline comprises part of the previous line.
>Other newlines act as separators.  This applies recursively, so you
>can build up levels of nested definitions structured by their layout.

This is not quite correct. The offside rule in Miranda applies to
certain nonterminals in the syntax such as: each alternative on the
righthand side of a variable or function definition, other righthand
sides (type definitions, aliases, and specifications), bodies of
module inclusion directives. The definition list after "where" does
not create a ``offside level'' in itself, it's included in that of the
last alternative of the righthand side.

Also, the critical indentation is determined by the first token
produced by the nonterminal, not by the token before it.

>This allows:

>	|| Example 2
>	 f (x:xs)
>	   y
>             z
>          =  x + 
>           y + z

Therefore, this example gives a "syntax error - unexpected token
OFFSIDE". The last line should be indented to the level of the "x" in
the line above, not just past the "=".

>Note that Haskell's layout rule is *not identical* to the Miranda
>off-side rule.  In particular:

>	i)	It is possible to turn off layout at will;
>	ii)	The layout rules do not use "=" as a defining point;
>	iii)	Alignment of definitions is significant

>(ii) is particularly important -- I find my Miranda programs regularly
>vanish off the RHS of the page (because of the need to insert a fixed
>amount of space), but this never happens with Haskell.

(ii) is incorrect, and the wild indentation can be avoided with an
alternative style:

	function of six rather longwindedly named arguments =
		right hand side
		where some other function =
			is defined

You can also break before the "=", at the cost of a tabstop. I usually
do that, especially when there are several alternatives:

	foo a b c
	  = bar a		, if b = c
	  = baz (bar b ++ bar c)
		kvetch		, otherwise
	    where
	      kvetch
		= burble a b c

It's not pretty, but it always works (even if you modify the left hand
side later) and the indentation only depends on nesting level.

The other two points are valid, and very much in Haskell's favor.

--
Lars Mathiesen, DIKU, U of Copenhagen, Denmark      [uunet!]mcsun!diku!thorinn
Institute of Datalogy -- we're scientists, not engineers.      thorinn@diku.dk

acha@CS.CMU.EDU (Anurag Acharya) (01/18/91)

Thanks for all the info on the "off-side rule". As far as I could make out,
the rule is motivated by a desire to make the syntax of programming languages
look as close to that of mathematical functions as possible. As the post
on Haskell mentioned, the layout is preprocessed into a set of tokens
in a phase between the lexer and the parser.
Such functionality would be better placed in an editor rather than in the 
compiler. We already have the "electric-this", "electric-that" modes in 
gnu-emacs. It should be a simple job to extend these modes to include the 
"off-side rule" if so desired. Why complicate the language definition
with such inelegant constructs ?

Of course, if one is indeed going to depend on the programmers to do the 
indentation right, one might expect them to put in the delimitors. A point
that is made in the defence of the "off-side rule" is that the indentation,
if "properly" done, renders the delimitors redundant -- so we should do away 
with them. This is, however, useful redundancy -- in that it allows the
compiler to detect any inadvertent errors that might occur.

anurag

spot@CS.CMU.EDU (Scott Draves) (01/18/91)

While it's true that one can use indentation to replace grouping, i
think this is quite dangerous.  Many programs feel free to change the
spacing in a program.  emacs will often convert between tabs and
spaces.  some implementations of cpp converts tabs to spaces.

unix make pays attention to whitespace, and it is one of the
"features" most regretted by the designers.

i don't think the fact that it can be turned off helps, one could
still receive someone else's code that uses it, and then you're hosed.

it's just plain gross.

just my $1/50
--

			IBM
Scott Draves		Intel
spot@cs.cmu.edu		Microsoft

rh@smds.UUCP (Richard Harter) (01/19/91)

In article <SPOT.91Jan18074828@WOOZLE.GRAPHICS.CS.CMU.EDU>, spot@CS.CMU.EDU (Scott Draves) writes:

> While it's true that one can use indentation to replace grouping, i
> think this is quite dangerous.  Many programs feel free to change the
> spacing in a program.  emacs will often convert between tabs and
> spaces.  some implementations of cpp converts tabs to spaces.

One has to distinguish between the off-side rule and the use of indentation
for grouping.  The off-side rule is, from what I can see, a potential
source of petty little disasters.  On the other hand, using identation as
the delimiter for grouping is IMHO *the right thing to do* in block 
structured languages.

The standard practice of using begin/end delimiters, coupled with using
indentation (standard programming practice) violates the single control
point principle.  It is a regular source of annoying little bugs, the
"dangling else" being the most notorious example.

The argument that many programs change spacing is not to the point;
programs that change source code text in an unspecified language using
arbitrary assumptions are, to put it mildly, not of professional quality.

> i don't think the fact that it can be turned off helps, one could
> still receive someone else's code that uses it, and then you're hosed.

This is actually a good point.  Language features which can be "turned
on or off, according to one's taste are bad features.  One of the most
notorious abuses of cpp is to use macros to make C to look like some
other language.

> it's just plain gross.

May I suggest in turn that the Algol tradition of treating programs as
streams of text to be processed by machines, rather than as text to be
read and written by programmers is a fundamental conceptual error.   It
is my impression that a reaction such "it's just plain gross" is just
another variant of "This is the way I learned to do it, so it must be
the only way to do it."  No doubt this is unfair on my part, so accept
my apologies in advance.  However, having worked with languages that
used begin/end and cousins and with languages that used indentation for
block structure, I can say I find the latter much more natural, once
one overcomes one's initial habits.
-- 
Richard Harter, Software Maintenance and Development Systems, Inc.
Net address: jjmhome!smds!rh Phone: 508-369-7398 
US Mail: SMDS Inc., PO Box 555, Concord MA 01742
This sentence no verb.  This sentence short.  This signature done.

sansom@cs.glasgow.ac.uk (Patrick Sansom) (01/21/91)

aipdc@castle.ed.ac.uk (Paul Crowley) writes:

>I am both attracted to and repelled from the off-side rule - it provides
>potential both for more readable code and for horrendously intractable
>bugs.

>My question is: why not write the _language_ so that white space is not
>significant (except as a separator) but write a _filter_ which turns
>"off-side" code into "non-off-side" code.  That way, people who don't
>like the rule never have to think about it - foreign code can be pushed
>through the filter and then edited.

>A filter the other way might be nice but wouldn't be essential.

>...

>\/ o\ Paul Crowley aipdc@uk.ac.ed.castle
>/\__/ Trust me, I know what I'm doing.


Yes I agree with this _filter_ idea. My honours project in 1989 was
looking at initial Haskell specifications and included implementing
the formatting rules in a pre-processor / filter which inserted
brackets { } and semi-colons where appropriate. The modified source
was then given to the compiler.

In fact, the specification of the layout rule in the Haskell report is
actually specified as a translation on source code.

Having the pre-processor as a seperate entity allows any source code
to be processed before distribution to remove any possibility of
ambiguity.

-- 
Patrick Sansom                      | E-Mail:    sansom@cs.glasgow.ac.uk
Computing Science Research Student  | 
University of Glasgow               |

db@cs.ed.ac.uk (Dave Berry) (01/22/91)

In article <SPOT.91Jan18074828@WOOZLE.GRAPHICS.CS.CMU.EDU> spot@CS.CMU.EDU (Scott Draves) writes:
>
>While it's true that one can use indentation to replace grouping, i
>think this is quite dangerous.  Many programs feel free to change the
>spacing in a program.  emacs will often convert between tabs and
>spaces.  some implementations of cpp converts tabs to spaces.

I really don't see this as a problem.  All the compiler has to do is to
convert tabs back to spaces before applying the off-side rule.  Many
compilers do this anyway to generate error messages that contain the
position of the erroneous character(s).  The fact that Miranda apparently
doesn't do this is to my mind a design error in Miranda.

Of course things get more complicated if people set tabs at different
intervals than the standard every 8th position.  But this can be dealt
with, for example by a special declaration to declare tab settings.

--
 Dave Berry, LFCS, Edinburgh Uni.      db%lfcs.ed.ac.uk@nsfnet-relay.ac.uk

"The Berlin Wall; the border between East and West Germany.  It's very narrow."
"*Was* very narrow.  Get your tenses right."	-- Douglas Adams, THHGTTG.

dtarditi@CS.CMU.EDU (David Tarditi) (01/23/91)

I am astonished to find people defending the off-side rule.  Parsing
is a solved problem.  Why do designers of semantically "clean" languages
insist on botching syntax ?

We could argue forever about whether making white-space significant 
is in good taste.  Suffice it to say that most real programming languages
in widespread use do not make white-space significant.  Maybe they just
have not seen the light of "conciseness and legibility", or perhaps
from previous experience they wanted to get rid of endless, annoying
bugs which are difficult to find.   It is bad enough having to match up
delimiters in long programs without having to worry about indentation.

A significant argument against the off-side rule is that it makes 
languages difficult to parse.  The modern approach to parsing is to define
the concrete syntax of a language using BNF and regular expressions.
The BNF for the language should be defined so that the language can
parsed using an LALR(1) parser.  This allows the compiler writer to use
software tools which generate efficient parsers and lexers, and get on with
the more important part of the work (generating correct, efficient code).
The Lex and Yacc Unix(tm) tools exemplify the approach.

The concrete syntax of languages using the off-side rule cannot be defined
using context-free languages or regular expressions. The reason, simply put,
is that these languages possess only a limited ability to "count".
Consider the following language (where c^k denotes k occurrences of
character c):

       { s | there exists k such that s= c^kxc^kyc^kz}

This, I argue, is a simple example of a program using the off-side
rule.  Let c be a space character.  Then we have k spaces, followed
by some phrase, followed by k spaces, etc.  I claim that this language
is not context-free. I don't offer a proof of this here, but an application
of the pumping lemma for context-free languages should suffice.

Thus, it is impossible to parse this language using modern techniques
unless we resort to hacks.  Suggested hacks include adding a preprocessor
specifically to deal with off-side rule or altering the lexer to count
spaces.

I am amazed that people ignore all the work that has been done on
parsing.  The formalization of syntax is considered by most people
to be a significant accomplishment: the formalism is useful to programmers
using the languages and compiler writers alike.

IMHO, people designing concrete syntax for languages should check that 
the concrete syntax is LALR(1), which ensures that mechanical tools for parser
generation can be used if they exist for whatever language you are
writing your compiler in.  This is quite simple to do: all you need to do
is write a grammar which runs through then Unix(tm) Yacc tool without any
ambiguities.
   
   David Tarditi
   tarditi@cs.cmu.edu

	                              

    



--

   David Tarditi                School of Computer Science
   tarditi@cs.cmu.edu           Carnegie Mellon University
	                        Pittsburgh, PA 15213

	                              

eerke@cs.kun.nl (Eerke Boiten) (01/23/91)

In article <DTARDITI.91Jan22185718@LAMBDA.ERGO.CS.CMU.EDU> dtarditi@CS.CMU.EDU (David Tarditi) writes:
>A significant argument against the off-side rule is that it makes 
>languages difficult to parse.  The modern approach to parsing is to define
>the concrete syntax of a language using BNF and regular expressions.
>The BNF for the language should be defined so that the language can
>parsed using an LALR(1) parser. [..]

This is only important for *prototyping* implementations of functional
languages in environments where no other parser generators than YACC are
available. Anyone who implemented a functional language, care to comment?

>The concrete syntax of languages using the off-side rule cannot be defined
>using context-free languages or regular expressions. The reason, simply put,
>is that these languages possess only a limited ability to "count".
>Consider the following language (where c^k denotes k occurrences of
>character c):
>       { s | there exists k such that s= c^kxc^kyc^kz}
>This, I argue, is a simple example of a program using the off-side
>rule. [c=space] I claim that this language is not context-free. [..]
>Thus, it is impossible to parse this language using modern techniques
>unless we resort to hacks.  Suggested hacks include adding a preprocessor
>specifically to deal with off-side rule or altering the lexer to count
>spaces.

Two comments:
1. The language above *is* contextfree for limited k (let's say 80, for
instance - >80 chars per line is bad programming style, certainly in nice
succinct functional languages)
2. Now I should remark that most programming languages are not *context*
free (identifiers!), maybe? The most obvious solution for off-side rule
languages is already given (and is no less of a hack than symbol tables are):
count the spaces in the lexer.
The only real problem I can imagine with such languages is with programs that
generate code in them.

>IMHO, people designing concrete syntax for languages should check that 
>the concrete syntax is LALR(1) [..] This is quite simple to do:
>all you need to do is write a grammar which runs through the Unix(tm)
>Yacc tool without any ambiguities.

Testing for LALR(1) is simple: either it works and you're finished, or
your grammar is ambiguous and you have to rewrite it anyway, or it isn't
LALR(1) and not ambiguous either, and you're in trouble. Even if you know
"all" about LALR(1), you still have to have an enormous amount of
specific intuition to be able to eliminate LALR(1) problems.

(Quoted from earlier on:)
>I am amazed that people ignore all the work that has been done on parsing.

Me too. LALR(1) isn't exactly the hottest thing in parsing, is it?

--
Eerke Boiten
Department of Informatics (STOP Project), K.U.Nijmegen
Toernooiveld, 6525 ED Nijmegen, The Netherlands
Tel. +31-80-652236.	Email: eerke@cs.kun.nl

bothner@sevenlayer.cs.wisc.edu (Per Bothner) (01/24/91)

>All this to save a few characters?

The point, I think, is not save a few characters,
but to save a few lines (and saving lines is useful,
because it allows you to get more of a program
on the screen).

Consider C indentation rules:
	if (test) {
	    statement1;
	    statement2;
	}

Note the final line. This is needed for the final delimiter,
but contains no useful information. It wastes cathode rays/paper.
It puts extra white space in a useless and ugly place.

You could use a Lisp-style of formatting, where the final
delimiter is on the last "real" line:
	if (test) {
	    statement1;
	    statement2;}
But now the last statement is different from the other(s),
which is esthetically ugly and makes updating error-prone.

I find it hard to take seriously people's concerns about
tabs. A program with tabs whose intervals are not every
8th character is already non-portable (it will look a mess
when viewed on a normal system), and it is in any case
an easy problem to fix in your editor, in your
communication software, or with an explicit filter.

So, I believe it may be reasonable to make syntactic use of
indentation. The specifics of Haskell's particular off-side
rule is a different issue. I suggest that the rule
should be relative indentation of lines, not syntatic units.
That is, I would prefer something like:
	if (test)
	    statement1
	    statement2
instead of:
	if (test) statement1
	          statement2
-- 
	--Per Bothner
bothner@cs.wisc.edu Computer Sciences Dept, U. of Wisconsin-Madison

brian@comp.vuw.ac.nz (Brian Boutel) (01/24/91)

A number of people have criticised layout rules. For example,

In article <DTARDITI.91Jan22185718@LAMBDA.ERGO.CS.CMU.EDU>,
dtarditi@CS.CMU.EDU (David Tarditi) writes:
|> 
|> I am astonished to find people defending the off-side rule.  Parsing
|> is a solved problem.  Why do designers of semantically "clean"
|> languages
|> insist on botching syntax ?
|> 
|> We could argue forever about whether making white-space significant 
|> is in good taste.  Suffice it to say that most real programming
|> languages
|> in widespread use do not make white-space significant.  Maybe they
|> just
|> have not seen the light of "conciseness and legibility", or perhaps
|> from previous experience they wanted to get rid of endless, annoying
|> bugs which are difficult to find.   It is bad enough having to match
|> up
|> delimiters in long programs without having to worry about
|> indentation.
|>

Just a reminder that Haskell is *defined* without reference to any
offside rule. Grouping uses { and }, and ; is used for separation. 

It is entirely practicable and reasonable to write Haskell programs this way.
But some people like the "mathematical" appearance of programs written
in languages with layout conventions, and they like the uncluttered
appearance of programs without lots of explicit delimiters. For this
reason, The Haskell Report includes an algorithm for converting programs
in this style to the Haskell of the definition. 

I personally think that the delimited style should always be used for
program transfer, but the Haskell report provides for portable programs
in layout style by defining the tab conventions that should be assumed
when calculating indentations.


 
|> A significant argument against the off-side rule is that it makes 
|> languages difficult to parse.  The modern approach to parsing is to define
|> the concrete syntax of a language using BNF and regular expressions.
|> The BNF for the language should be defined so that the language can
|> parsed using an LALR(1) parser.  This allows the compiler writer to use
|> software tools which generate efficient parsers and lexers, and get on with
|> the more important part of the work (generating correct, efficient code).
|> The Lex and Yacc Unix(tm) tools exemplify the approach.
|> 

And these tools are, of course, used in implementing Haskell.


|> The concrete syntax of languages using the off-side rule cannot be defined
|> using context-free languages or regular expressions. The reason, simply put,
|> is that these languages possess only a limited ability to "count".
|> Consider the following language (where c^k denotes k occurrences of
|> character c):
|> 
|>        { s | there exists k such that s= c^kxc^kyc^kz}
|> 
|> This, I argue, is a simple example of a program using the off-side
|> rule.  Let c be a space character.  Then we have k spaces, followed
|> by some phrase, followed by k spaces, etc.  I claim that this language
|> is not context-free. I don't offer a proof of this here, but an application
|> of the pumping lemma for context-free languages should suffice.
|> 
|> Thus, it is impossible to parse this language using modern techniques
|> unless we resort to hacks.  Suggested hacks include adding a preprocessor
|> specifically to deal with off-side rule or altering the lexer to count
|> spaces.
|> 

Even such common language features as requiring declarations to appear
before uses make languages not context-free, and if dealing with this by
means of symbol table processing is "resorting to hacks", then most
language processors are in trouble.




|> I am amazed that people ignore all the work that has been done on
|> parsing.  The formalization of syntax is considered by most people
|> to be a significant accomplishment: the formalism is useful to programmers
|> using the languages and compiler writers alike.
|> 
|> IMHO, people designing concrete syntax for languages should check that 
|> the concrete syntax is LALR(1), which ensures that mechanical tools
|> for parser
|> generation can be used if they exist for whatever language you are
|> writing your compiler in.  This is quite simple to do: all you need
|> to do
|> is write a grammar which runs through then Unix(tm) Yacc tool without
|> any
|> ambiguities.
|>    

There is a big difference between "ignoring all the work that has been
done"  and using the tools plus other methods when the tools do not go
far enough. 

Let me give another example from Haskell. Most people give ambiguous
grammars to Yacc and provide additional disambiguating information, for
example, associativity and precedence info for infix operators. A
problem with this is that you cannot in Yacc specify operators of
different associativities at the same precedence, and yet this may be a
necessary feature (e.g. when the operators are defined by different
people in separate modules and imported into the same module).

Haskell defines what is legal and how it is to be parsed by requiring
that infix expressions be fully parenthesised except in a list of cases
where normal precedence and associativity considerations make this
unnecessary. Now this is still context free, but since the rules cannot
be expressed using the Yacc %left and %right syntax, the grammar needed
to describe it is very large. 

So what should be done? 
a) Change the language? 
b) Live with the inefficiencies of the big parser that is generated by
the big grammar.
c) Recognise that Yacc is not a good way to solve this particular
problem and use a mixed strategy of LALR(1) grammar/Yacc for the rest of
the language and a precedence grammar/parser for infix expressions? 

The art is to find the right tools for the task at hand, not to reject
all tasks for which your favourite tools are not ideal.

--brian
-- 
Internet: brian@comp.vuw.ac.nz
Postal: Brian Boutel, Computer Science Dept, Victoria University of Wellington,
        PO Box 600, Wellington, New Zealand
Phone: +64 4 721000

jrk@information-systems.east-anglia.ac.uk (Richard Kennaway CMP RA) (01/24/91)

In <2676@wn1.sci.kun.nl> eerke@cs.kun.nl (Eerke Boiten) writes:

>In article <DTARDITI.91Jan22185718@LAMBDA.ERGO.CS.CMU.EDU> dtarditi@CS.CMU.EDU (David Tarditi) writes:
>>I am amazed that people ignore all the work that has been done on parsing.

>Me too. LALR(1) isn't exactly the hottest thing in parsing, is it?

On the other hand, YACC is there.  Is any comparable tool readily
available that embodies more up-to-date parsing technology?

--
Richard Kennaway          SYS, University of East Anglia, Norwich, U.K.
Internet:  jrk@sys.uea.ac.uk		uucp:  ...mcsun!ukc!uea-sys!jrk

"Gentlemen do not discuss concrete syntax."  -- Stefan Sokolowski

kh@cs.glasgow.ac.uk (Kevin Hammond) (01/24/91)

In article <2676@wn1.sci.kun.nl> eerke@cs.kun.nl (Eerke Boiten) writes:
>In article <DTARDITI.91Jan22185718@LAMBDA.ERGO.CS.CMU.EDU> dtarditi@CS.CMU.EDU (David Tarditi) writes:
>>A significant argument against the off-side rule is that it makes 
>>languages difficult to parse.  The modern approach to parsing is to define
>>the concrete syntax of a language using BNF and regular expressions.
>>The BNF for the language should be defined so that the language can
>>parsed using an LALR(1) parser. [..]
>
>This is only important for *prototyping* implementations of functional
>languages in environments where no other parser generators than YACC are
>available. Anyone who implemented a functional language, care to comment?

For what it's worth, the Haskell off-side rule requires very few
modifications to a yacc parser (one extra rule, I think).  This
assumes that the lexer is counting spaces.  The lexer is rather more
complicated, but can still be written using lex/C without too much
trouble.  The only real problems are keeping track of space in strings
(which may extend across lines in Haskell) and remembering to
adjust the space count when backtracking.

Kevin

-- 
This Signature Intentionally Left Blank.	E-mail: kh@cs.glasgow.ac.uk