[comp.lang.forth] writing a C interpreter in Forth?

adamk@mit-amt.MEDIA.MIT.EDU (Adam Kao) (01/06/90)

I've been seriously considering extending Forth into a C interpreter.
I should be careful to distinguish this from writing a traditional
interpreter/compiler, such as Dojun Yoshikami and Ian Green seem to
be discussing.  A traditional compiler is usually broken into modules
something like the following:

	a lexical scanner that tokenizes the input
	a grammar/parser that places the tokens in a syntax tree
	a typechecker that looks at nodes in the syntax tree
	a code generator that looks at nodes in the syntax tree

For details see Aho, Sethi, and Ullman's "Dragon Book" (not its real name).

Since Dojun Yoshikami and Ian Green are using these terms, I assume
they are using the standard compiler architecture as a starting point.

But Forth's philosophy is that words do their own work, even compiling
words.  In my opinion this feature is not nearly emphasized enough.
It is truly a unique feature to Forth; the only similar feature I know
of is the continuation in Scheme.

The Forth compiler shows us how to implement traditional control
structures with words that execute themselves, using the current state
of the stacks to record where they are.  There is no monolithic
compiler.  Compiling words "know" what they do.  The traditional
compiler distinction of compiler as procedures versus source code as
data does not exist in Forth.

Therefore to extend Forth into a C interpreter means to use the Forth
philosophy while writing a C compiler, defining words like if, {, goto,
and so on.  Ideally every word would look at the current state
(current expression, unprocessed input stream, etc.) and perform the
correct actions possible up to that point.

There are some interesting issues that I have been thinking about:

1.  The Forth "scanner" is rudimentary.  Essentially it does only
three things: interpret numbers, use whitespace as a separator, and
make up words out of everything else.  C has certain features that
depend on a traditional scanner, most notably that non-alphanumeric
tokens (eg +, ->, {) are automatically separated.  Thus we must have a
traditional scanner.

2.  C uses infix notation.  The Forth-style infix operators I have
seen were all unwieldy or non-intuitive (I would love to be corrected
on this).  Thus, for expressions at the very least we will need a
traditional parser.

3.  Actually it is misleading to try and match C tokens with Forth
words.  Incomplete expressions generally cannot be parsed.  There is a
much better match between C statements and Forth words.  Both specify
complete actions and so can be compiled as a unit.

This leads us to some simple design decisions.  We can view ; and
perhaps { } and , as fundamental separators.  Upon reaching one of
these, our more sophisticated scanner can eliminate whitespace and
form a token list.  We may still be able to associate actions with
each token in this list, since they can now look ahead to the end of
the statement and be guaranteed a complete expression.  Thus we
execute each token, with words like + leaving a reminder on some
stack.

I believe C control structures can be direct copies of Forth control
structures, although I have not thought deeply about it.

This is all preliminary speculation; I have not written a single line
of code yet.  I want to hear ideas and discussion, I want to know if
anyone has tried this before.  The most helpful responses will be
those that start "The right way to do <x> is . . . "

Every reader of this group should know what advantages Forth would
bring to C.  The advantages of C for Forth are somewhat harder to
quantify.  Forth users may benefit from access to existing C code.
Perhaps C programmers could be "lured" into a Forth environment this
way, and then learn about Forth's advantages.  And following the Forth
tradition, I intend to make this interpreter half as big and twice as
fast as any existing C interpreter or compiler! :-)

But after all, the only real justification is that it is an
interesting problem.  I hope it interests you, as well.

Adam

wmb@SUN.COM (Mitch Bradley) (01/07/90)

I wrote a C lexical scanner in Forth.  It was fairly easy and quite
small.  Scanning is easy; parsing is a bit harder.  Probably the
easiest way to do a C parser in Forth would be to take a yacc grammar
of C and run it through yacc to get the state tables.  Then duplicate
the effect of the (quite small) yacc table walking machine in Forth.

Then you have to decide what to do for code generation.  That's where
the real engineering skill comes into play.  You have to worry about
how to implement things like type promotion and storage allocation
and switch statements (decent C compilers automatically generate jump
tables if the cases are densely packed and if else if else ... chains
if the cases are sparse).  Doing a complete job will take a lot of work,
so be prepared for it.

By the way, Tayfun Kocaoglu, who works with me at Sun, did a subset C
compiler in Forth for his master's thesis at University of Florida.
Tayfun wrote a scanner and a code generator, and he used yacc for the
parser.

Mitch

sabbagh@acf5.NYU.EDU (sabbagh) (01/09/90)

In article <9001070938.AA07304@jade.berkeley.edu> Forth Interest Group International List <FIGI-L%SCFVM.bitnet@jade.berkeley.edu> writes:
>I wrote a C lexical scanner in Forth.  It was fairly easy and quite
>small.  Scanning is easy; parsing is a bit harder.  Probably the
>easiest way to do a C parser in Forth would be to take a yacc grammar
>of C and run it through yacc to get the state tables.  Then duplicate
>the effect of the (quite small) yacc table walking machine in Forth.

Why not implement it as recursive descent?  Need to left factor the
grammar, but Forth is inherently suitable, since the word definitions
could be made to look like the grammar.  I wrote a small interpreter
this way a while back.

The C language is well-suited to this approach; the PCC compiler (Steve
Johnson?) used it in parsing.

Hadil G. Sabbagh
E-mail:		sabbagh@csd27.nyu.edu
Voice:		(212) 998-3125
Snail:		Courant Institute of Math. Sci.
		251 Mercer St.
		New York,NY 10012

186,282 miles per second -- it's not just a good idea, it's the law!

wmb@SUN.COM (Mitch Bradley) (01/09/90)

> Why not implement [a C parser] as recursive descent?

No ironclad reason; it just seemed like yacc would be easier, since
a yacc grammar for C is floating around, and yacc parse tables
are compact, easy, and fast to interpret.

Mitch

peter@ficc.uu.net (Peter da Silva) (01/09/90)

> >easiest way to do a C parser in Forth would be to take a yacc grammar
> >of C and run it through yacc to get the state tables.  Then duplicate

> Why not implement it as recursive descent?...

I'll second this, from experience. I was involved in a project about 6
years ago that involved a language parser in Forth. I wrote the parser,
language-dependent editor, and file system in about 2 months. Then the
guy managing the project took exception to the technique I used, which was
recursive-descent, and spent the next 6 months redoing what I'd done using
the yacc/lex method.

The project was subsequently canned, because it went way over schedule.

His biggest problem seemed to be all the time it took him to incorporate
a change in the language, since he'd lost the tight forth test/execute
loop...
-- 
 _--_|\  Peter da Silva. +1 713 274 5180. <peter@ficc.uu.net>.
/      \ Also <peter@ficc.lonestar.org> or <peter@sugar.lonestar.org>
\_.--._/
      v  "Have you hugged your wolf today?" `-_-'

chekmate@athena.mit.edu (Adam Kao) (01/09/90)

I've been thinking more about my original proposal; I need to study
the Dragon Book more closely before I can get more specific.  However,
I do feel that just reimplementing the C yacc table walker in Forth
makes for a more traditional parser architecture than I had in mind.
I was hoping for the easy extensibility that Peter mentioned.  I need
to study the Dragon Book to make sure I understand these and other
options and the tradeoffs.  Maybe I can even think of an option not in
the Dragon Book (well, I can dream, can't I?).

On a separate topic, I'm trying to decide which Forth to use.  I've
played with JForth and I like it a lot, but since this is a Forth
extension (not an application) I hesitate to depend on features
available from only one vendor on one machine (eg JForth's ODE).  It
might be better to use some standard UNIX Forth (memory protection!),
possibly with redistributable extensions like those in Dick Pountain's
book (those ARE redistributable, right?)  Opinions?  Advice?

Finally, I am very interested in the experiences of the people who
have implemented these kinds of things, and I'd love to get my hands
on any of the code you wrote, if that's possible.  Tayfun Kocaoglu's
thesis sounds like exactly what I should be reading -- could you put
me in touch with him, Mitch?

Thank you for your time,
Adam

fred@ubvax.UB.Com (Fred Noon) (01/10/90)

An advantage of recursive descent over LALR(1) tables is that error
diagnostics, recovery, or just doing things "on the fly" can be made
very straight-forward to implement.  Using tables one has to build
error recovery into the grammar; with LALR that can get a little
tricky.  I suspect that it is this, rather than FORTH's user interface,
which would be responsible for turning a 2-month project into a 6-month
project, particularly if it were someone's first go-round with lex and
yacc.  (Not being a production FORTHer I can't speak from experience,
however.)

======================================================================
Fredrik Noon  (ubvax.ub.com!fred)
5290 Garrison Circle; San Jose, CA 95123; USA       tel (408) 365-7167
or, when toiling away at Ungermann-Bass                 (408) 562-5632
======================================================================