johnl@ima.UUCP (01/26/87)
Is it possible to use YACC to write a parser for a simple language whose gramatic may dynamically change, namely defining new operators? In this interpreter, new operators (unary or binary) can be added by the user dynamically, e.g. define the binary operator ++ with a certain precedence and then the system can recognize a ++ b. I have no experiences with YACC, I see it in the following way: YACC generates some tables which correspond to the grammar. If a new operator is added to the grammar, it should be possible to dynamically update the tables in order to recognize the new item. If this is not possible, is there any other way? --Micha ARPA: micha%ecrcvax.UUCP@seismo.CSS.gov micha%ecrcvax.UUCP%Germany.CSNET@CSNET-RELAY CSNET: micha%ecrcvax.UUCP@Germany.CSNET UUCP: unido!ecrcvax!micha [I doubt it -- the calculation that produces the parsing tables from the input grammar is pretty complicated. If you only allow people to define new operators that have the same precedence as existing ones, your job is pretty easy since you haven't really changed the syntax, just added new flavors of existing tokens from the point of view of the parser. If you get more general than that, you can add ambiguties and you probably need the full extremely slow glory of Earley's algorithm, which is a long way from the efficient LALR approach of yacc. -John] -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
johnl@ima.UUCP (01/27/87)
It looks hard. One problem is that the tables that Yacc produces are the result of global analysis of the grammar--single entries in the tables aren't direct reflections of single grammar rules. But dynamically adding operators is easier than dynamic grammars in general, since only a small and fairly well-defined set of parser table entries deal with this particular situation. Unfortunately, after Yacc is done constructing the "theoretical" tables, it compresses them to take up much less room. Compression is likely to make things considerably harder. It sounds to me like a very interesting project. I would suggest that you read the appropriate sections of Aho, Sethi, and Ullman's "Compiler Design" (second edition of the Dragon Book) to find out as much as you can about the theory, and then get a copy of the source for the parser driver that Yacc uses (we've got a copy, I can send it to you), so you can see how the tables are actually stored. This should give you some ideas about what is involved in dynamically modifying the parser. Good luck! Please keep me posted. Dale -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
johnl@ima.UUCP (02/05/87)
For a simple dynamic grammar in YACC: Define operator tokens as follows: %left OL1 %right OR1 %left OL2 %right OR2 %left OL3 %right OR3 etc. The lexer for 'C' already has to lookup names in the symbol table to see if they are typedef_names or identifiers. You will now lookup operator names also to see what class (OL1, etc.) they are currently. The symbol table is loaded initially with standard operators and precedences. The currently defined class is returned as the lexical token for yacc, and the actual operator name is returned as the value (for semantic analysis). For even more generality, the lexer should recognize three character classes: alpha, whitespace and operator. Name tokens are alpha characters seperated by whitespace or operators. Operator strings are sequences of operators seperated by alpha. Operator strings are divided into tokens by finding the largest left substring which is currently in the symbol table (repeat as required). All tokens are looked up in the symbol table to decide if they are a) reserved names (e.g. register) or operators (e.g. '{'); return appropriate token. b) type names; return TYPENAME token. c) identifiers; return IDENTIFIER token. d) operators (both names and operators!); return class as token. The symbol table is preloaded (from a control file, perhaps) with the standard 'C' reserved words, types, and operator hierarchy. The user can #include custom grammars for various applications: complex arithmetic complex polynomial arithmetic string processing etc. Even reserved words can be redefined! The compiler need only hardwire a minimal subset of reserved words. Operators can be bound to functions or 'C-macros' based on operator name and operand types. Standard types are symbolic and used only for binding purposes. C-macros are machine dependent functions expanded inline by the code-generator, all other operators are function calls (or inline expansions). A function call can be replaced by a C-macro in a specific implementation for efficiency purposes. The first cut of a code generator can do everything except function calls with function calls! The various operators can then be added as C-macros as time permits. (Notice that argument passing for function calls needs to recognize types.) A similar approach will allow the user to define his own control structures. The key performance issue here is the symbol table - it has got to be fast, because it gets consulted by the lexer for every token! Unless the programmer excerises some restraint, a language of this nature will result in cryptic code! Although having the lexer check the symbol table for every token is more expensive than a fixed table driven lexer, notice that the preprocessor has to look up every token anyway. By combining the 'C' preprocessor with lexical analysis, you can provide a dynamic grammar and speed up compilation at the same time! Every token would be looked up by the lexer to determine if it is a) a preprocessor symbol b) a typedef name c) a reserved word d) an identifier e) an operator and apropriate tokens passed to YACC. -- Stuart D. Gathman <..!seismo!dgis!bms-at!stuart> [It's certainly true that you can set up a yacc-based parser quite easily that lets you add new tokens that behave syntactically just like other existing tokens using this scheme. I believe that the intent of the original question, though, was to add whole new productions to the grammar on the fly as is possible with a parser using Earley's algorithm. The consensus seems to be that since the yacc grammar tables are constructed only after an extensive global analysis of the grammar, incremental changes are unlikely to be possible without remaking the tables from scratch. -John] -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request