[comp.lang.fortran] time to write a compiler

roman@gaudi.ccsf.caltech.edu (Roman Salvador) (11/01/90)

   I would like to find out the time it takes (more or less) to write a
compiler (i.e. a Fortran 90 one). Does anybody have any statistics or
experiences?. And also, about the time for the different parts of it (i.e.
lexical analysis, parsing, error checking, code generation, debugging +
testing).
   Thanks,
                Roman  (internet: roman@gaudi.caltech.edu)
[It took me about a year part-time to write an F77 compiler based on the old
Ritchie PDP-11 C compiler, but with no significant optimization beyond that
needed to compile i=i+1 into one instruction.  I suspect F90 would be
considerably harder.  By far the most work was the I/O system and all the
nit picking to catch all the special cases I forgot. -John]
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

marcel@vlsi.cs.caltech.edu (Marcel van der Goot) (11/06/90)

In <1990Oct31.180632.3201@nntp-server.caltech.edu> Roman Salvador
(roman@gaudi.ccsf.caltech.edu) asks

>    I would like to find out the time it takes (more or less) to write a
> compiler (i.e. a Fortran 90 one). Does anybody have any statistics or
> experiences?. 

During the past one-and-a-half year (well, actually almost two years already)
I have been working on a C-based programming language for message-passing
multi-computers, ``mcc'' (soon to be released ...). The project involved
the design of the language, writing a compiler for it, and writing a
thesis about it. Unfortunately, I haven't really kept track of the time
the different parts took me (shame on me), so my statistics are rather
imprecise, but I thought my experiences might be of some help.

The following account is rather long. My apologies for that.


* Introduction

The language is an extension of ansi C, with new constructs for
defining, creating, and connecting processes, for sending and receiving
messages, and for selecting communications. C was chosen because
it is relatively small (we thought) and widely used. The compiler
is supposed to be rather machine-independent and produces (standard) C
rather than machine-code, hence no time was spent on building
machine-code generators or code-optimizers --- I assume you have to
actually generate assembly- (or machine-) code.

* My background when I started the project

4 years of undergraduate study in computer science in the Netherlands
(which rather emphasized formal methods, compared with US schools).
Included was a two-term class about compiler construction, one term
of which was spent extending a simple compiler. I was familiar with
lexical scanners and parser generators, but had used different ones
than lex and yacc.
At Caltech I had taken a one-term class about ``logic programming and
compiler construction,'' which was mostly devoted to the (construction
of a) compiler for ``strand'' (a parallel programming language, sort-of
but not quite a logic language).
I guess my specialty is concurrent programming, not compiler writing.

* Prototype

Coming up with the basic ideas for the new language constructs, plus
writing a prototype took me about 2 months. The prototype implemented
only a limited part of the language (e.g., only integers could be
sent between processes), and was mostly intended to demonstrate that
one could indeed compile the language reasonably well into standard C.
The prototype was written using lex and yacc, with most of the program
flow controlled by the code statements included in the grammar. Basically
no type-checking was done, and the output was produced in a rather ad-hoc
fashion.

* The real compiler

I then started to write the real compiler, for the whole language.
I had decided to do all the parsing and type-checking etc., rather than
just some pattern-matching with a cpp-type preprocessor. I had also
decided to do everything from scratch, rather than take, say, gcc, and
hack it. Long discussions about the wisdom of these decisions are possible.
(I have no idea whether there are F90 compilers around that you could
change to suit your needs, or whether you also have to do everything from
scratch.)

My main experiences with this:

- C is much bigger than I thought.
  Parsing C is hard (C is somewhat context-sensitive with respect to type
  names), type-checking is complicated by many automatic conversions and
  special cases.

- Most books on compiler construction don't explain enough.
  Lexical scanning and parsing are very well understood, everyone knows
  how to do it, where to start, what to be careful about, etc. But now
  you have to do type-checking. It sounds so trivial, and is always
  displayed as such, but I soon found that it isn't. With relatively
  large programs it is very easy to choose a data-structure and then
  find out a month later that you don't have enough information available
  to make one particular check for one particular operator.

- Lack of tools.
  We only had lex and yacc. I quickly found that putting all the compiler
  code in the grammar led to inconvenient control-flow, and decided
  to just build a parse-tree. I then wrote my own tool for describing
  operations in terms of this tree (when I have time, I may one day put
  this tool in the public domain). However, I think it is worth quite
  some effort to find out about other compiler writing tools that already
  exist. (And, if you have enough code, writing your own tools pays off.)

- Compiling to C instead of assembly didn't help as much as expected.
  The main reason, I think, is that the extensions all have to do
  with parallel programming, and therefore almost any new construct
  is more complicated to translate than any of the standard C constructs.

Writing this compiler (including the above-mentioned tool) took me
about 7.5 months. Very roughly, I'd say 2.5 months parsing, 3.5 months
type-checking, 1 month for the run-time system (0.5 ?). The compiler
was then used in a class about concurrent programming, which more or less
served to debug it. I'm proud to say that actually very few programming
errors were found. Most of them were in the run-time system, which
isn't very big but involves important things like process scheduling.
The run-time system is not really related to compiler construction, I
don't think you need anything like it for F90 (on the other hand, you need a
library). With about 6000 lines, the compiler is much bigger than anything
I wrote before (considering that I wrote it all, rather than modifying
existing code), and also bigger than I had expected.

* A better compiler

I did find some errors and short-comings in the compiler that were
not easy to fix. Things like inconvenient data structures, with not
quite the right information; sometimes awkward modularization (use of
program-generating tools sometimes makes this worse); etc.
I therefore basically rewrote the compiler, although I copied substantial
parts. This version is supposed to be more or less final. It took (well,
it isn't quite finished yet) me about 5 months, interwoven with about
3 months of writing my master's thesis.


* Advice

(If you conclude from the above that I obviously did everything
completely wrong then probably this advice isn't much worth --- let
me know if that's the case, though.)

- Spend time doing serious planning (that's rather obvious; I knew
  it, but still the sheer size of the project and the program took me
  by surprise).

- Look around for other tools than lexical scanner-generators and
  parser-generators. Quite likely it pays off.

- You can use a prototype to get some idea of what modules, data-structures,
  etc. you need, but write the actual program from scratch (for heated
  discussions about this, write to comp.software-eng).

- Be prepared to rewrite the whole program --- almost certainly you'll
  find at the end, after having used the compiler somewhat, that there
  are some things you'd like to add that don't fit in, or that parts of
  the compiler are needlessly complicated, etc.

- Make detailed documentation, even for parts that only you are going to
  look at. For instance, write down exactly what code you generate for each
  construct, for all different situations. Without such detailed
  descriptions it is hard to convince yourself (or others) that each part
  of the program is correct, and that all the parts fit well together.

- If possible, don't do it alone. Although a program this size can
  be written by a single person (I've been told that real programs
  are at least 10^5 lines --- how do they do that?), I don't think it
  is a good idea. It helps if there is someone with whom you can really
  discuss all the details of the program, rather than just general ideas.
  It helps to get you through slumps when someone else needs your code now.
  I think that with two persons the additional burden of having to
  cooperate shouldn't be too large, and is in fact probably beneficial.

Well, that was quite a lot. I'm sorry that I cannot give a more precise
account of the time spent on each of the parts. Hope it helps.

                                         Marcel van der Goot
                                         marcel@vlsi.cs.caltech.edu
[When writing my F77 compiler, I found the library, particularly the I/O
library, to be as much work as the compiler itself. -John]
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!esegue!compilers.  Meta-mail to compilers-request.