mcdaniel@adi.com (Tim McDaniel) (08/29/90)
The idea of "low-rent syntax", avoid semicolons and other "noise tokens", sounds appealing, but it turns out to be difficult in practice. I'm having trouble working out some issues. I'm designing a simple language, much like a simple shell. The language has assignment statements and the datatype "list of strings". The syntax I'd like is, e. g. a = 3 b = 1 2 (a) 4 5 "a" has a one-element value, "3", while "b" has 5 strings in its value: "1", "2", "3", "4", "5". Whitespace separates words, like in REXX or a UNIX shell. I'd like to provide string concatenation. The syntax I like is just abutment without whitespace in between: in c = 12(a)45 "c"'s value would be a one-element list, with "12345" as the only element. The Bourne-shell analogue is c=12${a}45 But there are other syntactic structures in the language, and I'd like to use a lex- or flex-like lexer with a bison-like grammar. If I have just the non-terminals EOL (end of line), LPAREN, RPAREN, ASSIGN, and TEXT, I can't distinguish c = 12(a)45 from c = 12 (a) 45 because the lexer would return TEXT ASSIGN TEXT LPAREN TEXT RPAREN TEXT EOL in both cases. Here are my ideas: - The lexer returns WS for whitespace. However, my grammar would get WS cropping up all over, as in null ::= ws ::= null | ws WS word ::= TEXT | LPAREN TEXT RPAREN list ::= word | list ws word assignment ::= TEXT ws ASSIGN ws list ws EOL This seems ugly. - The lexer returns CONCAT as the "implicit concatenation" operator. If the previous token and the next one are not separated by whitespace, return CONCAT as the current token, and return the next source token at the next call instead. This seems kludgy. A problem is that the lexer may not be able to tell if the next token would be whitespace. In the statement a = b//comment stuff after returning TEXT for "b", the lexer can only see "/" -- or can a flex lexer have more lookahead? It doesn't know that it's a start-of-comment. One workaround is to force a comment-start to be surrounded by whitespace: a = b // comment stuff which is better-looking anyway. - The user has to explicitly enter a concatenation operator. Unfortunately, that makes one more character "special" and makes it have to be quoted. It also clutters the statement: c = 12:(a):45 or c = 12 : (a) : 45 looks messier. - Using a hand-coded lexer or parser is not an option in my shop, alas. Another problem I've had with low-rent syntax is how to tell the lexer/parser to continue a source line. Two approaches: - Use "&" or "," or some other character as a "continue this line" indicator: a = 1 2 3 4 5 & // comment text 6 7 8 9 10 11 But that makes another character be special, and I'd like to avoid that, because I'd like to make "&" and "," available for future use. - "\" followed by newline is removed, as in C. However what does this mean? a = 1 2 3 \// comment ? Or what about a = 1 2 3 // comment\ ? Does it continue the line? Does it continue the comment? (The latter is gross: a = 1 2 3 4 // comment\ 5 6 7 8 would silently comment out the second line!) Don't allow a line with a line comment to be continued? Any other ideas? Which looks best? Low-rent syntax is a nice idea, but it's got some subtle problems in certain cases, no? -- Tim McDaniel Applied Dynamics Int'l.; Ann Arbor, Michigan, USA Work phone: +1 313 973 1300 Home phone: +1 313 677 4386 Internet: mcdaniel@adi.com UUCP: {uunet,sharkey}!amara!mcdaniel -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
adamsf@turing.cs.rpi.edu (Frank Adams) (08/31/90)
In article <MCDANIEL.90Aug28144647@dolphin.adi.com> mcdaniel@adi.com (Tim McDaniel) writes: >Another problem I've had with low-rent syntax is how to tell the >lexer/parser to continue a source line. > >... > >- "\" followed by newline is removed, as in C. However what does this > mean? > a = 1 2 3 \// comment > ? Or what about > a = 1 2 3 // comment\ I have thought about this problem. My conclusion is that the continuation character should go on the *next* line (like FORTRAN!). You can choose between putting the continuation character as the first character of the line, or the as the first non-blank character. And, yes, a = 1 2 3 // comment \ 4 5 6 would continue the statement, not the comment. -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
ok@goanna.cs.rmit.OZ.AU (Richard A. O'Keefe) (08/31/90)
In article <MCDANIEL.90Aug28144647@dolphin.adi.com>, mcdaniel@adi.com (Tim McDaniel) writes: > But there are other syntactic structures in the language, and I'd like > to use a lex- or flex-like lexer with a bison-like grammar. If I have > just the non-terminals EOL (end of line), LPAREN, RPAREN, ASSIGN, and > TEXT, I can't distinguish > c = 12(a)45 > from > c = 12 (a) 45 > because the lexer would return > TEXT ASSIGN TEXT LPAREN TEXT RPAREN TEXT EOL > in both cases. Frankly, I think this is more than somewhat ugly. Having used SNOBOL, it is obvious to me that "c = 12 (a) 45" is concatenating the strings "12", the value of a, and "45". AWK uses the SNOBOL convention here; try the AWK program BEGIN { a = "--" } END { print 12 (a) 45 } so a lot of UNIX hackers may be very surprised by your syntax. If you want to distingiush "12(a)45" from "12 (a) 45", surely the simplest way is to make the brackets different: /{LAYOUT}(/ --> LEFT_PLAIN /){LAYOUT}/ --> RIGHT_PLAIN /(/ --> LEFT_CONCAT /)/ --> RIGHT_CONCAT I haven't bothered to check how exactly you would say this in Lex, but the "longest-match" rule would make it work. You would hallucinate a newline at the beginning of the file and treat newline as layout, so that (c) = 12 would be tokenised as LEFT_PLAIN TEXT RIGHT_PLAIN ASSIGN TEXT EOL > ? Does it continue the line? Does it continue the comment? > (The latter is gross: > a = 1 2 3 4 // comment\ > 5 6 7 8 > would silently comment out the second line!) The Bourne shell equivalent of this (use # instead of //) _neither_ continues the line _nor_ continues the comment (the \ is swallowed by the comment). The ANSI C rule about \<newline> is that those two characters disappear very early in processing, my interpretation is that if \\ were added to ANSI C it would have to comment out the next line when used like this. Surely the simplest rule would be to make your end of line comment marker a single character (and it would be consistent with most UNIX tools if that character were '#') and then you could easily handle \#<comment><newline> as if it were \<newline>. -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (09/04/90)
In article <=M~%JG&@rpi.edu> adamsf@turing.cs.rpi.edu (Frank Adams) writes: > In article <MCDANIEL.90Aug28144647@dolphin.adi.com> mcdaniel@adi.com (Tim McDaniel) writes: [ // comments to end of line; newline used for statement terminator: ] [ should continuation character be before comments or before newline? ] [ put continuation character after newline, as in Fortran ] Better is TeX's intuitive solution. The comment marker always joins the lines around it. You don't need another continuation character. ---Dan -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.