[comp.lang.misc] Concrete Syntax in ML

csnjr@its63b.ed.ac.uk (Nick Rothwell) (05/10/88)

From article <7756@mcdchg.UUCP>, by mauny@inria.UUCP (Michel Mauny):
> CAML, a functional programmimg language of the ML family developed
> at INRIA, is now available on SUN-3 and VAX11 under Unix BSD 4.2 and 4.3.
> CAML (an acronym for Categorical Abstract Machine Language) is close
> in functionality to Standard ML (although release 2.5 does not have modules
> yet). It differs mostly in its syntax (closer to the syntax of LCF ML), and
> in its ability to manipulate concrete syntaxes through an interface with Yacc.

Actually, we're working on a parser generator interface for Standard ML as
well, complete with antiquotation (concrete syntaxes) and so on. Our system
doesn't rely on yacc syntax, and doesn't generate a parser in source form
(I don't recall how the CAML parser is generated and loaded). Instead, the
grammar is presented as a data structure (in fact, as an environment with
the various types for terminals, nonterminals and the associated actions),
and the parser generator returns a new environment containing a parsing
function.
   Writing grammars in this way is a little tedious, but the whole idea is
that a simple grammar is written to generate a parser which understands
grammars. Then, full grammars can be bootstrapped into this system by
antiquoting. The strength of the system is that grammars can be quite
flexible, since they are data structures which can be transformed into the
form that a LR parser generator wants. I am (at this very moment) putting the
finishing touches to an interface which lets you specify productions
containing regular expressions. Coupled with the antiquoting mechanism (not
yet implemented) you can say things rather like:

	structure Grammar =
	   struct
	      ...
	      val Gram = << StartSymbol -> Expr
	                    Expr        -> "nil"
	                                 | Expr "::" Expr
	                                 | "[" (Expr / ",")* "]" {Sexy, eh?}
	                                 | ...
	                 >>
	   end

	structure Parser = ParserGen(structure G = Grammar)

	fun parse() = Parser.parse(lexer);

There's a *lot* of polishing still to do, but the basic mechanisms work.

		Nick.
-- 
Nick Rothwell,	Laboratory for Foundations of Computer Science, Edinburgh.
		nick%lfcs.ed.ac.uk@nss.cs.ucl.ac.uk
		<Atlantic Ocean>!mcvax!ukc!lfcs!nick
~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~
...while the builders of the cages sleep with bullets, bars and stone,
they do not see your road to freedom that you build with flesh and bone.

mauny@inria.UUCP (Michel Mauny) (05/11/88)

>Actually, we're working on a parser generator interface for Standard ML as
>well, complete with antiquotation (concrete syntaxes) and so on. Our system
>doesn't rely on yacc syntax, and doesn't generate a parser in source form
>(I don't recall how the CAML parser is generated and loaded). Instead, the
>grammar is presented as a data structure (in fact, as an environment with
>the various types for terminals, nonterminals and the associated actions),
>and the parser generator returns a new environment containing a parsing
>function.

In CAML, the Yacc interface doesn't take care of the parser produced by
yacc: it is only interested in parsing tables which are used by
the CAML parser itself (and parsing tables of CAML's syntax are produced in
this way).

We plan also to give a better syntax to a grammar definition and to
typecheck (statically when possible) grammar definitions.

- What about "lexer" (are they user-definable, hand-written or mechanically
  produced in lex-style)?
- Do you typecheck grammars?
- Do you have a complete antiquotation mechanism (with escapes to the
  meta-language)?

csnjr@its63b.ed.ac.uk (Nick Rothwell) (05/13/88)

From article <689@inria.UUCP>, by mauny@inria.UUCP (Michel Mauny):
> In CAML, the Yacc interface doesn't take care of the parser produced by
> yacc: it is only interested in parsing tables which are used by
> the CAML parser itself (and parsing tables of CAML's syntax are produced in
> this way).

That's quite a nice solution. I aim to have a single parsing "engine" both
for ML and for any concrete quotations. Different instances of this parser
wil be used with different tables and lexers.

> We plan also to give a better syntax to a grammar definition and to
> typecheck (statically when possible) grammar definitions.
> 
> - What about "lexer" (are they user-definable, hand-written or mechanically
>   produced in lex-style)?

I knocked up a lexer generator a couple of days ago. Here (roughly) is the
interface:

signature LEX_SPEC =
   sig
      type LexClass
      type Terminal
      type Value

      val whiteSpace: LexClass
      val follows: LexClass -> LexClass lib.set

      val lexClass: string -> LexClass

      val eof: Terminal
      val VOID: Value

      val makeTerminal: string -> Terminal * Value
   end

You specify some LexClass type, give a function from characters to LexClasses,
and say which LexClasses can "stick" to others (eg, LexClass = LETTER | DIGIT,
"a" is a LETTER, "0" is a DIGIT, and LETTER can be followed by {LETTER,DIGIT}.
What you get back is a (side-effecting) lexing function

	fun lex: (unit -> string) -> (unit -> Terminal * Value)

This takes a string "generator" as argument (which may read from a file,
or whatever), and returns a terminal (and value) generator. This isn't as
powerful as lex, but serves my purposes for the present.

> - Do you typecheck grammars?

Yes, they're ML data structures, so ML typechecks them. The problem with this
is that every production has to have the same type of attribute, so
attribute functions range over some "union" datatype, and tend to have rather
complex arguments. I have ideas for making this easier.

> - Do you have a complete antiquotation mechanism (with escapes to the
>   meta-language)?

We're implementing a "brute-force" approach in the first instance. The
ML lexer will be able to associate any token (such as "<<") with a
parser (of type instream -> string). When it sees a "<<", it calls the
parser on its current input stream, waits to get the text back, and
then carries on lexing from this. Hence, the concrete syntax parser
must eventually return ML source text. I don't see any reason why a
user-generated lexer for the concrete syntax shouldn't be able to do
something similar, and call the ML parser to get back to the
meta-language. My thoughts are a little unclear on this at the moment.

I'll write a report on all of this in a month or two, when my ideas are more
settled, and send you a copy, if you're interested.

		Nick.
-- 
Nick Rothwell,	Laboratory for Foundations of Computer Science, Edinburgh.
		nick%lfcs.ed.ac.uk@nss.cs.ucl.ac.uk
		<Atlantic Ocean>!mcvax!ukc!lfcs!nick
~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~
...while the builders of the cages sleep with bullets, bars and stone,
they do not see your road to freedom that you build with flesh and bone.