[comp.compilers] YACC, going the other way

elk@cblpn.att.com (Edwin Lewis King +1 614 860 3394) (04/22/91)

I'm interesting in generating strings that are described by a BNF (OK,
YACC) grammar.

Has anyone actually done this before?  Is there a YACC-ish tool
available for such a thing?

Done anyone have a nice idea for how to do this cleanly and
efficiently?

Thanks!

Ed King
elk@cblpn.att.com
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

zeil@cs.odu.edu (Steven J. Zeil) (04/23/91)

In article <1991Apr23.140427.5416@iecc.cambridge.ma.us> elk@cblpn.att.com (Edwin Lewis King +1 614 860 3394) writes:
>I'm interesting in generating strings that are described by a BNF (OK,
>YACC) grammar.

I have seen references to this as a means of describing and generating
test data.  This method can be applied to many programs with heavily
structured input data, including programs that are not themselves
compilers/translators.

A reference that comes to mind is Duncan & Hutchison, "Using Atributed
Grammars to Test Designs and Implementations", in the 5th International
Conference on Software Engineering.

                                                  Steve Z
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

carlton@aldebaran.Berkeley.EDU (Mike Carlton) (04/24/91)

In article <1991Apr23.140427.5416@iecc.cambridge.ma.us> elk@cblpn.att.com (Edwin Lewis King +1 614 860 3394) writes:
>I'm interesting in generating strings that are described by a BNF (OK,
>YACC) grammar.

How proficient are you in Prolog?  Prolog is a very nice environment to
do this sort of thing.

I recently took a bnf specification (written in prolog) of an intermediate 
language an modified it to be a generator for the same language.  The 
specification was an executable program which succeeded iff the input was 
in the language given by the bnf.  It was a simple matter to replace a few
low level test rules such as 
	register(r(I)) :- integer(I). 
with generators such as 
	register(r(0)).
	register(r(1)).

With a suitable set of generators, I was able to generate all interesting
intermediate language statements and then run the output through the backend
for testing.

cheers,
--mike
-- 
Mike Carlton	carlton@cs.berkeley.edu
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

wunder@hpsdel.sde.hp.com (Walter Underwood) (04/25/91)

   I'm interesting in generating strings that are described by a BNF (OK,
   YACC) grammar.

A program to generate random sequences from a BNF grammar is described
in Chapter 17 of "The Icon Programming Language" by Griswold and
Griswold.  An implementation of that comes with the Icon distribution.
It is called "rsg" for Random Sentence Generation.  I've appended the
documentation from rsg.

Icon is available for anonymous FTP from cs.arizona.edu.

wunder

----------
############################################################################
#
#	Name:	rsg.icn
#
#	Title:	Generate randomly selected sentences from a grammar
#
#	Author:	Ralph E. Griswold
#
#	Date:	June 10, 1988
#
############################################################################
#  
#     This program generates randomly selected strings (``sen-
#  tences'') from a grammar specified by the user.  Grammars are
#  basically context-free and resemble BNF in form, although there
#  are a number of extensions.
#  
#     The program works interactively, allowing the user to build,
#  test, modify, and save grammars. Input to rsg consists of various
#  kinds of specifications, which can be intermixed:
#  
#     Productions define nonterminal symbols in a syntax similar to
#  the rewriting rules of BNF with various alternatives consisting
#  of the concatenation of nonterminal and terminal symbols.  Gen-
#  eration specifications cause the generation of a specified number
#  of sentences from the language defined by a given nonterminal
#  symbol.  Grammar output specifications cause the definition of a
#  specified nonterminal or the entire current grammar to be written
#  to a given file.  Source specifications cause subsequent input to
#  be read from a specified file.
#  
#     In addition, any line beginning with # is considered to be a
#  comment, while any line beginning with = causes the rest of that
#  line to be used subsequently as a prompt to the user whenever rsg
#  is ready for input (there normally is no prompt). A line consist-
#  ing of a single = stops prompting.
#  
#  Productions: Examples of productions are:
#  
#          <expr>::=<term>|<term>+<expr>
#          <term>::=<elem>|<elem>*<term>
#          <elem>::=x|y|z|(<expr>)
#  
#  Productions may occur in any order. The definition for a nonter-
#  minal symbol can be changed by specifying a new production for
#  it.
#  
#     There are a number of special devices to facilitate the defin-
#  ition of grammars, including eight predefined, built-in nontermi-
#  nal symbols:
#     symbol   definition
#     <lb>     <
#     <rb>     >
#     <vb>     |
#     <nl>     newline
#     <>       empty string
#     <&lcase> any single lowercase letter
#     <&ucase> any single uppercase letter
#     <&digit> any single digit
#  
#  In addition, if the string between a < and a > begins and ends
#  with a single quotation mark, it stands for any single character
#  between the quotation marks. For example,
#  
#          <'xyz'>
#  
#  is equivalent to
#  
#          x|y|z
#  
#  Generation Specifications: A generation specification consists of
#  a nonterminal symbol followed by a nonnegative integer. An exam-
#  ple is
#  
#          <expr>10
#  
#  which specifies the generation of 10 <expr>s. If the integer is
#  omitted, it is assumed to be 1. Generated sentences are written
#  to standard output.
#  
#  Grammar Output Specifications: A grammar output specification
#  consists of a nonterminal symbol, followed by ->, followed by a
#  file name. Such a specification causes the current definition of
#  the nonterminal symbol to be written to the given file. If the
#  file is omitted, standard output is assumed. If the nonterminal
#  symbol is omitted, the entire grammar is written out. Thus,
#  
#          ->
#  
#  causes the entire grammar to be written to standard output.
#  
#  Source Specifications: A source specification consists of @ fol-
#  lowed by a file name.  Subsequent input is read from that file.
#  When an end of file is encountered, input reverts to the previous
#  file. Input files can be nested.
#  
#  Options: The following options are available:
#  
#       -s n Set the seed for random generation to n.  The default
#            seed is 0.
#  
#       -l n Terminate generation if the number of symbols remaining
#            to be processed exceeds n. The default is limit is 1000.
#  
#       -t   Trace the generation of sentences. Trace output goes to
#            standard error output.
#  
#  Diagnostics: Syntactically erroneous input lines are noted but
#  are otherwise ignored.  Specifications for a file that cannot be
#  opened are noted and treated as erroneous.
#  
#     If an undefined nonterminal symbol is encountered during gen-
#  eration, an error message that identifies the undefined symbol is
#  produced, followed by the partial sentence generated to that
#  point. Exceeding the limit of symbols remaining to be generated
#  as specified by the -l option is handled similarly.
#  
#  Caveats: Generation may fail to terminate because of a loop in
#  the rewriting rules or, more seriously, because of the progres-
#  sive accumulation of nonterminal symbols. The latter problem can
#  be identified by using the -t option and controlled by using the
#  -l option. The problem often can be circumvented by duplicating
#  alternatives that lead to fewer rather than more nonterminal sym-
#  bols. For example, changing
#  
#          <term>::=<elem>|<elem>*<term>
#  
#  to
#  
#          <term>::=<elem>|<elem>|<elem>*<term>
#  
#  increases the probability of selecting <elem> from 1/2 to 2/3.
#  
#     There are many possible extensions to the program. One of the
#  most useful would be a way to specify the probability of select-
#  ing an alternative.
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

zvr@ntua.gr (Alexios Zavras) (04/25/91)

In article <1991Apr23.140427.5416@iecc.cambridge.ma.us>, elk@cblpn.att.com (Edwin Lewis King +1 614 860 3394) writes:
> I'm interesting in generating strings that are described by a BNF (OK,
> YACC) grammar.
> Has anyone actually done this before?  Is there a YACC-ish tool
> available for such a thing?

    I've seen this question in the past and I replied by mail, but
since people always ask, our moderator should keep an answer as well.

    Check out kafka, a tool that does exactly what you want !
The source was posted on the net (altough many years ago),
so it must be archived somewhere.

    It is the base on which the games insult, babble and flame are
based: each of them is just a grammar specification.
>From the original README:
> Copyright (c) 1985 Wayne A. Christopher
> Contained herein is the source for kafka, which is a program that
> translates BNF grammars into C programs to generate strings in the
> grammar, and files for three useful kafka programs, insult, flame,
> and babble. I don't have a manual page for kafka because I have
> been too lazy to write one, but if you read the .k files you will
> probably be able to figure out how things work.

    The program is *very* yacc-like (even has kkoutput() functions :-).
Here are some sample lines from flame.k:

<start> =       <flame>;
<flame> =       "I can't believe how " <adjective> " you are." { CH(1,100); } ;
<flame> =       "I firmly believe that" <statement> "." { CH(1,80); } ;
<adjective> =   ignorant { CH(1, 100); } ;
<adjective> =   as <adjective> "as a" <thing> { CH(1, 60); } ;
<thing> =       <adjective> <thing> { CH(1,30); } ;
<thing> =       dog { CH(1,60); } ;
<statement> =   most <group-adj> "people have" <thing> "#s" { CH(1, 60); } ;

etc. etc.

Oh, here's a sample output: :-)
> zvr@phgasos> flame
> I'll bet you think that dogs are nasty. I can't believe how hungry
> like a wolf you are. Furthermore, you jerk, you make me sick.

A very useful program !!!

-- zvr --
Alexios Zavras (-zvr-)
zvr@ntua.gr
zvr@theseas.ntua.gr
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

jimad@microsoft.UUCP (Jim ADCOCK) (04/27/91)

In article <1991Apr23.140427.5416@iecc.cambridge.ma.us> elk@cblpn.att.com (Edwin Lewis King +1 614 860 3394) writes:
>I'm interesting in generating strings that are described by a BNF (OK,
>YACC) grammar.

I don't know aobut "cleanly and efficiently", but, in theory, it's a lot
easier to generate examples of strings from a grammar than to recognize
them.  Below is a fragment out of a program I wrote to generate "C++" -
like strings [I wrote one version of the program based on Stroustrup's
published grammar, and one version based on Roskind's]

The only mysterious part of the fragment is the switch statements

switch(W0) and
switch(R5)

"W0" for example is a macro that expands to randomly generate a 1/0 int
-- weighted to more likely choose 0 than 1.

"R5" is a macro that expands to one of the numbers 0-5, equally weighted.

The real problem in randomly generating strings like this is that the
production rules tend to form "feedback loops" such that some productions
are much more likely to be produced repeatedly than others.  Hence the
hand generated "tweaks" to try to weight some cases more than others.

Also, remember than generating some strings based on grammars used in C++
compilers is a very very different thing than trying to automatically
generate "C++ code."  Only part of the problem relates to the use of
fed-back type info in C/C++.  Here's some examples of such productions:

....

/***********************/
struct  TSH ;

/***********************/
volatile &  Tq ::
 Tm :: operator
 Tm * const ( ) { auto operator
 Tr ,  ih ; struct  Tr extern typedef
; int  io ( ( signed )
- :: new float * ++ - ( const volatile
 Td
volatile
) short
( ~ +
( volatile ) -- sizeof ++ --
sizeof
sizeof
sizeof
&  TN ( * sizeof
( void short
void

....

----------- a fragment from the producing program:



....

void conditional_expression()
{
	switch(W0)
	{
		case 0: logical_or_expression(); break;
		case 1: 
			logical_or_expression(); 
			Q();
			expression();
			COLON();
			conditional_expression();
			break;
	}
}

void logical_or_expression()
{
	switch(W0)
	{
		case 0: logical_and_expression(); break;
		case 1: 
			logical_or_expression(); 
			OROR();
			logical_and_expression();
			break;
	}
}

void logical_and_expression()
{
	switch(W0)
	{
		case 0: inclusive_or_expression(); break;
		case 1: 
			logical_and_expression();
			ANDAND();
			inclusive_or_expression();
			break;
	}
}

void inclusive_or_expression()
{
	switch(W0)
	{
		case 0: exclusive_or_expression(); break;
		case 1: 
			inclusive_or_expression();
			OR();
			exclusive_or_expression();
			break;
	}
}

void exclusive_or_expression()
{
	switch(W0)
	{
		case 0: and_expression(); break;
		case 1: 
			exclusive_or_expression(); 
			XOR();
			and_expression();
			break;
	}
}

void and_expression()
{
	switch(W0)
	{
		case 0: equality_expression(); break;
		case 1: 
			and_expression(); 
			AND();
			equality_expression();
			break;
	}
}

void equality_expression()
{
	switch(R5)
	{
		case 0: 
		case 3: 
		case 4: 
			relational_expression(); break;
		case 1: 
			equality_expression(); 
			EQEQ();
			relational_expression();
			break;
		case 2: 
			equality_expression(); 
			NOTEQ();
			relational_expression();
			break;
	}
}

....
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

ressler@cs.cornell.edu (Gene Ressler) (05/01/91)

In article <1991Apr23.140427.5416@iecc.cambridge.ma.us> elk@cblpn.att.com (Edwin Lewis King +1 614 860 3394) writes:
>I'm interesting in generating strings that are described by a BNF (OK,
>YACC) grammar.

Many decidablility results rely on Turing machines enumerating CFLs, so
we'd better be able to do it without random numbers!  I know this may not
be what you want for testing, but I think it's interesting anyway.
Following is a rough hack in Common Lisp (runs under Lucid) that keeps a
queue of sentential forms sorted by length, pulling off the shortest one
to expand next.  `Expand' means replace each non-terminal A by the rhs of
each A-production in all combinations.  If a resulting form has only
non-terminals, print it; otherwise queue it for deeper expansion.  If you
have no epsilon productions, this generates strings of strictly
non-decreasing length.

Gene
----

; Enumerate strings generated by a CFL.
; Warning: Rough and probably buggy code.

(in-package 'user)

(let ((look-ahead nil)
      (stream t)
      (rules (make-hash-table))
      (start nil)
      (q nil))

(defmacro lex ()
  `(setq look-ahead (read stream nil)))

(defun parse (in-stream)
  ; set up
  (clrhash rules)
  (setq stream in-stream)
  (lex)
  ; assume first sym is start sym
  (setq start look-ahead)
  (loop
    ; done when lex returns nil for eof
    (unless look-ahead
      (return))
    ; look ahead is lhs. gather rhs.
    (let ((prod look-ahead)
          (rhs nil))
      (lex)
      (unless (eq look-ahead '->)
        (error "missing ->"))
      (lex)
      (loop
        ; gather up to ! or !! (`or' or end of prod)
        (loop
          (when (member look-ahead '(! !! ->))
	    (return))
          (push look-ahead rhs)
          (lex))
        ; put production in rules hash table indexed by lhs
	; (prod) so each entry is a list of rhs's
        (push (reverse rhs) (gethash prod rules))
        (case look-ahead
	  (!! (lex) (return))
	  (! (lex))
	  (t (error "unexpected ~A" look-ahead)))
	; start new rhs for same lhs (prod)
        (setq rhs nil))))
  ; check for undefined symbols
  (maphash #'(lambda (prod rhss)
	       (declare (ignore prod))
	       (dolist (rhs rhss)
		 (dolist (x rhs)
		   (unless (or (stringp x)
			       (gethash x rules))
		     (error "~A undefined" rhs)))))
           rules)
  ; return start symbol
  start)

; don't put any sentential form longer than this on the queue
(defparameter *cutoff-length* 10)

; insert form in queue sorted ascending by length
(defun enq (s-form)
  (when (< (list-length s-form) *cutoff-length*)
    (setq q (merge 'list
		   (list s-form) q
		     #'(lambda (x y)
		         (< (list-length x)
			    (list-length y)))))))

; get shortest sentential form from queue
(defmacro qpop () `(pop q))

; expand rhs every way that is possible by
; expanding each non-terminal exactly once.
; accumulate result in `sofar'.  when rhs
; is gone, look at `sofar' to see if it's all
; terminals (strings).  if so, print it; if not,
; queue it for deeper expansion.
(defun expand (rhs sofar)
  (cond
    ((null rhs)
       (if (every #'stringp sofar)
         (format t "~&~A"
		 (reduce #'(lambda (x y)
			     (concatenate 'string x y))
			 (reverse sofar)))
	 (enq (reverse sofar))))
    ((stringp (car rhs))
       (expand (cdr rhs) (cons (car rhs) sofar)))
    (t (dolist (rrhs (gethash (car rhs) rules))
	 (expand (cdr rhs) (revappend rrhs sofar))))))

; expand start symbol, iterate (expand)
; until the queue is empty.
(defun enum (start)
  (setq q nil)
  (dolist (rhs (gethash start rules))
    (expand rhs nil))
  (loop
    (let ((rhs (qpop)))
      (unless rhs (return))
      (expand rhs nil))))

) ; end (let (look-ahead ...

; tester assumes a grammar is in file "grammar"
(defun test ()
  (with-open-file (grammar "grammar" :direction :input)
    (enum (parse grammar))))

; a sample grammar.
; if you need epsilon, just say "".

#|

S -> expr
  !!

expr -> term
     !  expr "+" term
     !!

term -> factor
     !  term "*" factor
     !!

factor -> "1"
       !  "a"
       !  "(" expr ")"
       !!

|#
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.