[comp.lang.misc] Indentation for parsing

norvell@utcsri.UUCP (Theodore Stevens Norvell) (04/28/88)

In article <3925@killer.UUCP> elg@killer.UUCP (Eric Green) writes:
[By the way, are you the Eric Green who used to live in Halifax?]
>I have used a language that uses indentation as its block denotation (Promal).
>It gets ugly, folks. That, instead of the formalism objections, should be what
>[...] Free-form languages simply offer too many advantages to go back to
>fixed-format languages (COBOL!!! GARGH!).

David Gast writes
>   It might also be possible, to use the Grammar Rule Notation, a la
>   Prolog to describe the indentation levels.  I do not know (because
>   I have not really considered it). The GNR is more powerful than
>   a CFG.  An extra argument to the grammar rules could indicate the
>   level of indentation.
>

The discussion about using indentation to indicate program structure is
interesting to me as I have designed a language which does exactly this
and written a compiler for it.  My feeling is that the parser should
read programs (approximately) the same way that people do.  In particular
the most important information used by people to understand the control
structure of programs -- namely the indentation structure -- should
not be simply thrown out by the compiler.

Many of the objections raised by people turn out not to be serious.
First, the lack of formal notation is not serious. David Gast is
right, a simple Definite Clause Grammar captures it in a fairly
readable way.

Let me give an example.  I am assuming that tokens are records with at
least two fields, one to identify the kind of token and one with the
column that the first character of the token appeared in. The
end-of-file token always starts in column 0.  A special nonterminal
called lookahead(T) recognizes the empty string iff T is the next
token in the input string.

statement(Indentation) --> [token(name, Indentation)],
			   [token( := , _)],
			   expression.
statement(Indentation) --> [token(while, Indentation)],
			   expression,
			   block(Indentation).
statement(Indentation) --> [token(if, Indentation)],
			   expression,
			   block(Indentation),
			   moreif(Indentation).

moreif(Indentation) --> [token(else, Indentation)],
			block(Indentation).
moreif(Indentation) --> [token(elsif, Indentation)],
			expression,
			block(Indentation),
			moreif(Indentation).
moreif(Indentation) --> lookahead(token(_, Ind)), {Ind <= Indentation}.

block(Indentation) --> statement(Ind), {Ind > Indentation},
		       block(Indentation).
block(Indentation) --> lookahead(token(_, Ind)), {Ind <= Indentation}.

program --> block(0), [token(end-of-file,0)].

This says (i) that each statement in the body of a while loop must be
indented more than the word "while", (ii) that each statement in an
alternative of an if statement must be indented more than the
word "if", and (iii) that the words "else" and "elsif" must come directly
beneath the corresponding "if".  (The last restriction is stronger than
need be, but solves the else attachment problem cleanly and provides some
redundancy required for error-detection/recovery).

It is not quite a "free form" language, but it allows any style at all
which would be considered "readable" by most people.  It allows statements
to be split across lines; it doesn't need statement separators or
terminators;  it allows more than one statement per line.

It is no burden to the lexer to produce the column number of each token,
as it should do so anyway for accurate error reporting.

It is easy to translate such a grammar into a recursive decent parser
or an explicit-stack top-down parser.  It is probably not hard to
translate it into a table driven top-down parser (akin to LL(k)).
It is likely that there are bottom-up techniques too, but I haven't
thought about it.

The point of all this, is that many of the objections to languages
with significant indentation raised are due to misconceptions
(misconceptions that are understandable, given that most people
haven't used such languages or have used languages, like Promal,
with lots of other semi-related restrictions.)

There are still problems with such a language.  One is readability:
It requires a lot of indentation to make it easy, e.g.,  to match
else's to if's.  Another is writablity: you must be careful not
to accidentally over-indent the lines which (you think) follow an if
or while statement, or they won't.  A third is error recovery:
but that is always tricky.  A solution to all of these is to
make the language less free form by insisting that all 
statements in the same block start at the same indentation level.

Finally, such specification and parsing techniques are useful even
in a compiler (or lint) for a "conventional" free form language.
They can be used to warn the programmer that his indentation
structure does not match his syntactic structure and they can
be used to guide error recovery.

Theodore Norvell (poor earnest academician)
BITNET,CSNET:   norvell@csri.toronto.edu
CDNNET:         norvell@csri.toronto.cdn
UUCP:           {alegra,cornell,decvax,linus,utzoo}!utcsri!norvell

nevin1@ihlpf.ATT.COM (00704a-Liber) (05/04/88)

In article <5992@utcsri.UUCP> norvell@utcsri.UUCP (Theodore Stevens Norvell) writes:

>The discussion about using indentation to indicate program structure is
>interesting to me as I have designed a language which does exactly this
>and written a compiler for it.  My feeling is that the parser should
>read programs (approximately) the same way that people do.  In particular
>the most important information used by people to understand the control
>structure of programs -- namely the indentation structure -- should
>not be simply thrown out by the compiler.

This kind of stuff is better enforced by language-sensitive editors and
pretty-printers, not compilers.  I don't mind if there is some lint-type
program that gives me warnings if I make an indentation mistake, but I
don't want the compiler choking on it.  It is a hard job to define what
'readability' is, and I really don't want to slow my compiler down by
putting some unnecessary restrictions on indentation.  And without
language-sensitive editors and/or pretty-printers, it is very hard always
getting the indentation right.
-- 
 _ __			NEVIN J. LIBER	..!ihnp4!ihlpf!nevin1	(312) 510-6194
' )  )				"The secret compartment of my ring I fill
 /  / _ , __o  ____		 with an Underdog super-energy pill."
/  (_</_\/ <__/ / <_	These are solely MY opinions, not AT&T's, blah blah blah

carroll@snail.CS.UIUC.EDU (05/05/88)

/* Written  6:44 pm  May  3, 1988 by nevin1@ihlpf.ATT.COM in snail:comp.lang.misc */
/* ---------- "Re: Indentation for parsing (was Re" ---------- */
( ... )  And without
language-sensitive editors and/or pretty-printers, it is very hard always
getting the indentation right.
 _ __			NEVIN J. LIBER	..!ihnp4!ihlpf!nevin1	(312) 510-6194
/* End of text from snail:comp.lang.misc */

	Interesting. I have no problem getting my indentation correct
all the time, and I use standard vi, which is not language sensitive.
I wouldn't like an indentation driven compiler, however, because that
would enforce someone else's style on me, and readability is in the
eye of the reader.

Alan M. Carroll		amc@woodshop.cs.uiuc.edu	carroll@s.cs.uiuc.edu
Grad Student (TA) / U of Ill - Urbana ...{ihnp4,convex}!uiucdcs!woodshop!amc
	"Too many fools who don't think twice
	 Too many ways to pay the price"  - AP & EW

dg@lakart.UUCP (David Goodenough) (05/07/88)

From article <4625@ihlpf.ATT.COM>, by nevin1@ihlpf.ATT.COM (00704a-Liber):
> In article <5992@utcsri.UUCP> norvell@utcsri.UUCP (Theodore Stevens Norvell) writes:
>>The discussion about using indentation to indicate program structure is
>>interesting to me as I have designed a language which does exactly this
>>and written a compiler for it. .....
> 
> This kind of stuff is better enforced by language-sensitive editors and
> pretty-printers, not compilers. .....

And then there's always OCCAM

> ..... And without
> language-sensitive editors and/or pretty-printers, it is very hard always
> getting the indentation right.

This last comment almost deserves a flame. What is needed to produce
correctly indented source code with any editor (be it vi, VEDIT or whatever)
is *_DISCIPLINE_*. In 1980 at university, I decided on a coding style:

	if (test)
	{
	    statement;
	    statement;
	    for (xyz; abc; foo)
		statement;
	}

And I stick to it militantly - 8 years worth of it. It is now so ingrained
that I do it as a reflex. ALWAYS produces readable code, and I can do a

'egrep "[{}]" \!*'

to look for lines that contain {s and }s, and spot mismatced {}s in a twinkle.
This is one of my objections to the other style:

	if (test) {
	    statement; .....

It is too easy to mismatch, as you have to hunt for the {

I've got my asbestos suit ready for the inevitable flames :-) :-) :-)
-- 
	dg@lakart.UUCP - David Goodenough		+---+
							| +-+-+
	....... !harvard!adelie!cfisun!lakart!dg	+-+-+ |
						  	  +---+

nevin1@ihlpf.ATT.COM (00704a-Liber) (05/24/88)

In article <94@lakart.UUCP> dg@lakart.UUCP (David Goodenough) writes:

>This last comment almost deserves a flame. What is needed to produce
>correctly indented source code with any editor (be it vi, VEDIT or whatever)
>is *_DISCIPLINE_*.

[stuff deleted about Mr. Goodenough's coding style]

I agree that, for *new*, *original* source code, discipline is adequate
for producing correctly indented source code.

However, much of my job entails maintaining old code where I was not the
original programmer.  It is in this code where a smart editor would be
valuable.  I would like to be able to look at another person's code with my
indentation style, make the changes I need to make, and save it in such a
way so that if someone else looks at the source it appears to them in their
style.  All too often unreadability can be attributed to too many people
modifying the same pieces of code using different styles (although if a
routine has to be changed many times it probably should be rewritten, but
that is a different issue).  It should be possible to separate things like
indentation style from coding.

Another thing:  Suppose you have a little routine which you decide to put
inside an 'if' clause.  Usually this would require indenting each line
manually.  With a smart editor, this should happen automatically.
-- 
 _ __			NEVIN J. LIBER	..!ihnp4!ihlpf!nevin1	(312) 510-6194
' )  )				"The secret compartment of my ring I fill
 /  / _ , __o  ____		 with an Underdog super-energy pill."
/  (_</_\/ <__/ / <_	These are solely MY opinions, not AT&T's, blah blah blah