[comp.compilers] Time to Write a Compiler

roman@gaudi.ccsf.caltech.edu (Roman Salvador) (11/01/90)

   I would like to find out the time it takes (more or less) to write a
compiler (i.e. a Fortran 90 one). Does anybody have any statistics or
experiences?. And also, about the time for the different parts of it (i.e.
lexical analysis, parsing, error checking, code generation, debugging +
testing).
   Thanks,
                Roman  (internet: roman@gaudi.caltech.edu)
[It took me about a year part-time to write an F77 compiler based on the old
Ritchie PDP-11 C compiler, but with no significant optimization beyond that
needed to compile i=i+1 into one instruction.  I suspect F90 would be
considerably harder.  By far the most work was the I/O system and all the
nit picking to catch all the special cases I forgot. -John]
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

meissner@osf.org (11/06/90)

| From: roman@gaudi.ccsf.caltech.edu (Roman Salvador)
| 
|    I would like to find out the time it takes (more or less) to write a
| compiler (i.e. a Fortran 90 one).

It depends on lots of different factors, such as:

	the skill level and experience of the person(s) doing the
	work;

	the complexity of the language proper;

	the complexity of the library (if provided);

	the tools available;

	doing 'just' a front end and using an existing back end, or
	writing the whole ball of wax;

	the target machine(s);

	the desired optimization level(s).

As another data point, I wrote the front end for the Data General MV/Eclipse
C compiler from scratch to plug into the common language code generator and
optimizer.  It took about 9 months from the start of the project until I had
a reasonable compiler (and another couple of years until it was rock solid).
I seem to recall the library took about 4-6 months for the first cut, but
since I didn't do much of this work, and we eventually rewrote a lot of it to
tie in with a UNIX emulation library developed elsewhere, I may be off on
this.  I had a free hand in writing the front end, since the previous person
on the project had left before doing much code development and I nuked what
was left.

Some things that I remember about this:

	I had to develop the compiler in PL/1, and it took some time
	to learn the language.

	I had to relearn C, since I had last used the V6 C compiler,
	and the language had changed since then.

	I was still a new programmer (I had graduated about 1-2 years
	previously).

	The parser generator that I used had some severe restrictions
	in the early days, since it was originally run in 16-bit mode,
	and could not handle 16 levels of precedence + the rest of C,
	so I had to do operator precedence in PL/1 rather than in the
	grammar.

	The reference manual (K&R) had some subtle flaws in it (you
	couldn't derive a function which returned a pointer to a
	function and took some arguments from the grammar).

	Convincing the code generator and optimizer to add new features
	that C needed (notably the ?: operator).

	Convincing the debugger to expand the common symbol table
	format, to handle user defined types, and C specific items.

--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!esegue!compilers.  Meta-mail to compilers-request.

holliday@csgrad.cs.vt.edu (Glenn Holliday) (11/06/90)

An a truly wierd data point, after completing a design for a graphics - to -
database compiler, the funding org forced us to do the coding in
Cobol.  Yes, it _is_ possible to write a compiler in any language, but
it took us more than a year to get it up and running in Cobol.

Glenn Holliday  holliday@csgrad.cs.vt.edu
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!esegue!compilers.  Meta-mail to compilers-request.

marcel@vlsi.cs.caltech.edu (Marcel van der Goot) (11/06/90)

In <1990Oct31.180632.3201@nntp-server.caltech.edu> Roman Salvador
(roman@gaudi.ccsf.caltech.edu) asks

>    I would like to find out the time it takes (more or less) to write a
> compiler (i.e. a Fortran 90 one). Does anybody have any statistics or
> experiences?. 

During the past one-and-a-half year (well, actually almost two years already)
I have been working on a C-based programming language for message-passing
multi-computers, ``mcc'' (soon to be released ...). The project involved
the design of the language, writing a compiler for it, and writing a
thesis about it. Unfortunately, I haven't really kept track of the time
the different parts took me (shame on me), so my statistics are rather
imprecise, but I thought my experiences might be of some help.

The following account is rather long. My apologies for that.


* Introduction

The language is an extension of ansi C, with new constructs for
defining, creating, and connecting processes, for sending and receiving
messages, and for selecting communications. C was chosen because
it is relatively small (we thought) and widely used. The compiler
is supposed to be rather machine-independent and produces (standard) C
rather than machine-code, hence no time was spent on building
machine-code generators or code-optimizers --- I assume you have to
actually generate assembly- (or machine-) code.

* My background when I started the project

4 years of undergraduate study in computer science in the Netherlands
(which rather emphasized formal methods, compared with US schools).
Included was a two-term class about compiler construction, one term
of which was spent extending a simple compiler. I was familiar with
lexical scanners and parser generators, but had used different ones
than lex and yacc.
At Caltech I had taken a one-term class about ``logic programming and
compiler construction,'' which was mostly devoted to the (construction
of a) compiler for ``strand'' (a parallel programming language, sort-of
but not quite a logic language).
I guess my specialty is concurrent programming, not compiler writing.

* Prototype

Coming up with the basic ideas for the new language constructs, plus
writing a prototype took me about 2 months. The prototype implemented
only a limited part of the language (e.g., only integers could be
sent between processes), and was mostly intended to demonstrate that
one could indeed compile the language reasonably well into standard C.
The prototype was written using lex and yacc, with most of the program
flow controlled by the code statements included in the grammar. Basically
no type-checking was done, and the output was produced in a rather ad-hoc
fashion.

* The real compiler

I then started to write the real compiler, for the whole language.
I had decided to do all the parsing and type-checking etc., rather than
just some pattern-matching with a cpp-type preprocessor. I had also
decided to do everything from scratch, rather than take, say, gcc, and
hack it. Long discussions about the wisdom of these decisions are possible.
(I have no idea whether there are F90 compilers around that you could
change to suit your needs, or whether you also have to do everything from
scratch.)

My main experiences with this:

- C is much bigger than I thought.
  Parsing C is hard (C is somewhat context-sensitive with respect to type
  names), type-checking is complicated by many automatic conversions and
  special cases.

- Most books on compiler construction don't explain enough.
  Lexical scanning and parsing are very well understood, everyone knows
  how to do it, where to start, what to be careful about, etc. But now
  you have to do type-checking. It sounds so trivial, and is always
  displayed as such, but I soon found that it isn't. With relatively
  large programs it is very easy to choose a data-structure and then
  find out a month later that you don't have enough information available
  to make one particular check for one particular operator.

- Lack of tools.
  We only had lex and yacc. I quickly found that putting all the compiler
  code in the grammar led to inconvenient control-flow, and decided
  to just build a parse-tree. I then wrote my own tool for describing
  operations in terms of this tree (when I have time, I may one day put
  this tool in the public domain). However, I think it is worth quite
  some effort to find out about other compiler writing tools that already
  exist. (And, if you have enough code, writing your own tools pays off.)

- Compiling to C instead of assembly didn't help as much as expected.
  The main reason, I think, is that the extensions all have to do
  with parallel programming, and therefore almost any new construct
  is more complicated to translate than any of the standard C constructs.

Writing this compiler (including the above-mentioned tool) took me
about 7.5 months. Very roughly, I'd say 2.5 months parsing, 3.5 months
type-checking, 1 month for the run-time system (0.5 ?). The compiler
was then used in a class about concurrent programming, which more or less
served to debug it. I'm proud to say that actually very few programming
errors were found. Most of them were in the run-time system, which
isn't very big but involves important things like process scheduling.
The run-time system is not really related to compiler construction, I
don't think you need anything like it for F90 (on the other hand, you need a
library). With about 6000 lines, the compiler is much bigger than anything
I wrote before (considering that I wrote it all, rather than modifying
existing code), and also bigger than I had expected.

* A better compiler

I did find some errors and short-comings in the compiler that were
not easy to fix. Things like inconvenient data structures, with not
quite the right information; sometimes awkward modularization (use of
program-generating tools sometimes makes this worse); etc.
I therefore basically rewrote the compiler, although I copied substantial
parts. This version is supposed to be more or less final. It took (well,
it isn't quite finished yet) me about 5 months, interwoven with about
3 months of writing my master's thesis.


* Advice

(If you conclude from the above that I obviously did everything
completely wrong then probably this advice isn't much worth --- let
me know if that's the case, though.)

- Spend time doing serious planning (that's rather obvious; I knew
  it, but still the sheer size of the project and the program took me
  by surprise).

- Look around for other tools than lexical scanner-generators and
  parser-generators. Quite likely it pays off.

- You can use a prototype to get some idea of what modules, data-structures,
  etc. you need, but write the actual program from scratch (for heated
  discussions about this, write to comp.software-eng).

- Be prepared to rewrite the whole program --- almost certainly you'll
  find at the end, after having used the compiler somewhat, that there
  are some things you'd like to add that don't fit in, or that parts of
  the compiler are needlessly complicated, etc.

- Make detailed documentation, even for parts that only you are going to
  look at. For instance, write down exactly what code you generate for each
  construct, for all different situations. Without such detailed
  descriptions it is hard to convince yourself (or others) that each part
  of the program is correct, and that all the parts fit well together.

- If possible, don't do it alone. Although a program this size can
  be written by a single person (I've been told that real programs
  are at least 10^5 lines --- how do they do that?), I don't think it
  is a good idea. It helps if there is someone with whom you can really
  discuss all the details of the program, rather than just general ideas.
  It helps to get you through slumps when someone else needs your code now.
  I think that with two persons the additional burden of having to
  cooperate shouldn't be too large, and is in fact probably beneficial.

Well, that was quite a lot. I'm sorry that I cannot give a more precise
account of the time spent on each of the parts. Hope it helps.

                                         Marcel van der Goot
                                         marcel@vlsi.cs.caltech.edu
[When writing my F77 compiler, I found the library, particularly the I/O
library, to be as much work as the compiler itself. -John]
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!esegue!compilers.  Meta-mail to compilers-request.

ge@sci.kun.nl (Ge' Weijers) (11/07/90)

>[When writing my F77 compiler, I found the library, particularly the I/O
>library, to be as much work as the compiler itself. -John]

On the subject of runtime support I'd like to point out a very interesting
article on RTS for a functional language by Andrew Appel, in 

	"A Runtime System"
	Andrew W. Appel
	Lisp and Symbolic Computation, no. 3 1990, pp. 343-380

Especially the size is impressive, < 2000 lines of C+assembler
including garbage collection. I/O is done by writing the library in
the functional language itself (Standard ML).

--
Ge' Weijers                                    Internet/UUCP: ge@cs.kun.nl
Faculty of Mathematics and Computer Science,   (uunet.uu.net!cs.kun.nl!ge)
University of Nijmegen, Toernooiveld 1         
6525 ED Nijmegen, the Netherlands              tel. +3180652483 (UTC-2)
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!esegue!compilers.  Meta-mail to compilers-request.

bright@nazgul.UUCP (Walter Bright) (11/08/90)

In article <1990Oct31.180632.3201@nntp-server.caltech.edu> roman@gaudi.ccsf.caltech.edu (Roman Salvador) writes:
<   I would like to find out the time it takes (more or less) to write a
<compiler (i.e. a Fortran 90 one). ...

It pretty much depends on if you are writing a research toy or plan to
sell the compiler. If you plan on selling it, triple all times you expect.

From personal experience, I started writing a C compiler in 1982. 2 years
later, it was shipped (Datalight C). I have been working on it steadilly
ever since. It will never be finished.

A global optimizer takes 1 year to write, and then 1 year to debug out in
the field. Some companies never get it to work reliably :-).

A C compiler is about twice as complex as a Pascal compiler. C++ is twice
as complex as C. Starting from a working C compiler, it takes 1 to 2 years
to add C++ capability to it.
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!esegue!compilers.  Meta-mail to compilers-request.

pohua@polaris.crhc.uiuc.edu (Pohua P. Chang) (11/09/90)

I have been involved with the IMPACT-I C compiler project in the past 3+
years.  Exact due dates have not been recorded. But here is a very imprecise
estimate.

The frond-end, including semantic analysis, took about 4 man-months to code,
but subsequently, more than another 6 months, to accommodate some
program/system specifics, e.g. initializing a typedef struct.

A friend of mine coded a code generator for the DECstation (MIPS-R2000) in
about 6 months. He has spent much of his time on instruction-set specific
optimizations, e.g. constant preloading, convert some compare-branch to
compare-against-zero and to faster branches (beq, bne). Fine-tuning the
register allocator was also a very difficult task, because it required us to
modify the decision component of some code optimizations.

So, if we add up the time to implement the front-end and a code generator,
it takes about 1 man-year to establish a reasonable compiler framework.

Implementing local code optimization is a simple task and may be done in just
few weeks. Implementing global code optimization is quite time-consuming
because of the need to build a large set of FAST dataflow analysis modules.
If getting the compiler out quickly is the main concern, I would say that you
skip global code optimization and go straight to loop optimizations.  Given
more time, spend some time on machine-dependent code optimizations.  Be
prepared to spend as much time on debugging as on coding the code optimizer.
Debugging assembly code can be a great pain, especially after mid-night.  6
man-months is a very conservative estimate of the time that is needed for
building a naive code optimizer that does not produce embarrasing code.

If your project requires your compiler to produce high quality code, I would
recommand that the register allocator be carefully fine-tuned.  On top of a
graph-coloring algorithm, reuse parameter registers, frame pointer, spill
registers, and other reserved registers whenever possible.  Also, take the
most advantage from the caller/callee-save convention.  Register resource is
a limiting resource (at least in MIPS-R2000).  Many code optimizations have
to be tuned-down, so they don't introduce too many live-ranges. If possible,
integrate the constant preloading optimization into the register allocation
algorithm.

In order to achieve a performance level that is close to the best commercial
C compilers, such as the MIPS C compiler, many other implicit costs become
apparent. For example, without a fairly powerful memory disambiguator,
accesses to some global scalar variables and fields cannot be moved out of
loops. For another example, general loop unrolling (non-constant loop bounds)
can introduce some overhead cost, and should be applied only if the number of
iterations is large enough. But, how do we get that information at
compile-time? A code optimizer that is as powerful as that of the MIPS C
compiler would take man-years. 

Other than classical code optimizations, there are some interesting
features that may be useful.
1. function inline expansion. (make sure the register allocator is good enough)
2. integrate a feedback loop in the compiler, in the form of an
	automatic profiler.
3. selective code duplication that allows frequently executed regions
	to be customized.

We have implemented an automatic profiler in three forms: 1. source code
preprocessing, 2. high-level intermediate form, and 3. assembly code level.
They are equally effective.  For a person who totally understands the
structure of the compiler, it takes just weeks to implement the profiler and
integrate it into the compiler. However, modifying the decision components of
the code optimizer to use the profile information, as well as that from loop
analysis, can be tricky.

The time that is required to implement function inline expansion depends on
whether you want an intra-file or an inter-file expander. Implementing an
inter-file function inline expansion optimization is difficult.  The benefit
from function inline expansion for conventional processors is not dramatic.
So don't try this early in your project.

Well, that's my 2 cents worth.

Pohua Paul Chang
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!esegue!compilers.  Meta-mail to compilers-request.

beckmann@das.harvard.edu (Gary Beckmann) (11/10/90)

This should maybe go into alt.hack, but since I didn't do it myself
and since I know how long it took I'm sending it to comp.compilers.

This was not a production quality compiler.

I introduced a friend of mine to the wonders of C a few years back.
His company (whose name and location will remain secret--for obvious
reasons as you read on) would not support the buying of a compiler for
his ibm-pc clone.  All he had was basic.  So he wrote a C compiler in
basic.  Took him seven months working on the side.  

He was dissatisfied with the performance of the compiler so he then
proceeded to write a basic compiler in C !! (My neck still aches from
the head-shaking I did when he informed me of his plans.)  That took
him only three months -- he claimed that the speed was due to his
familiarity with basic and the relative simple flavor of basic with
which he was working.

He was pleased with this "project" and decided to port it all to an
Apple II which he had.  This port was never completed-- I believed he
failed due to the radical differences in basic.  

To repeat: this was NOT a production quality compiler (I think he even
had some restrictions on the syntax.

Hope I haven't horrified to many . . . 

					Gary Beckmann
					beckmann@das.harvard.edu
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!esegue!compilers.  Meta-mail to compilers-request.

Chuck_Lins.SIAC_QMAIL@gateway.qm.apple.com (Chuck Lins) (11/13/90)

I have been working on an Object Oberon compiler in my spare time. While
Object Oberon is a much simplier language than Fortran90, I have kept track
of the hours invested in this project. First some background and a brief
status report.

Background
----------
The front-end and the shell for a back-end was obtained from ETH Zurich at
the cost of 1000 Swiss francs. It contained a portable scanner, lexical
analyzer, parser, and abstract syntax tree builder. The front-end covers the
language Oberon. Object Oberon is a superset of Oberon.

The back-end shell consists of an abstract syntax tree traversal module and
stubs for calls to a lower-level code generator. The job of the compiler
writer is to fill in these stubs as well as adding a machine dependent
low-level code generator.

The compiler itself is written in Oberon. The target processor is the
Motorola MC68000 family. Target hardware is the Macintosh (tm). Target
programming environment is MPW (Macintosh Programmers Workshop).

Status
------
I started with a copy of the MacOberon system from ETH. This is a standalone
application running on the Macintosh. It contains an Oberon compiler in
object form. Thus the first task was to write a cross-compiler running under
MacOberon but producing object code for MPW. Once this was done, I created
another copy of the compiler source code (based on the cross-compiler). This
version was modified to run under MPW and produce code for MPW. The main
task here was replacing calls to the Oberon environment/libraries with calls
to semantically equivalent or similar MPW libraries.

As of today (11/12/90) I have completed work on the cross-compiler and have a
working compiler running under MPW. The compiler compiles itself under MPW.

How Long it Took
----------------
To date, I have spent 399.5 hours on both the cross-compiler and the native
compiler.  I started on July 24, 1990. After 286 hours of work, on Oct 15 I
got the native compiler to link successfully. After another 44 hours of
debugging I was able to start using the native compiler to compile itself.
35 more hours of debugging and the native compiler was able to compile
itself. Since then I've been using the native compiler to add enhancements
to itself as well as improve its code generation.

What Gave Me Problems
---------------------
It took me quite a while to get the comparison and branching instructions
correct. Mapping the source language operands onto the hardware instructions
was very time-consuming. I remember spending quite a bit of my time in
MacsBug (the machine level debugger) single-stepping machine instructions
from a known 'good' point in the compiler. You had to do this when you
jumped off into space somewhere. I had one nasty bug for referencing VAR
parameters. Unfortunately, you'd sometimes overwrite part of the return
address (or destroy a pointer) and jump into the twilight zone. The good
part was that I found four other bugs while searching for this one.

Another source of errors was my failure to check for special cases of
variable addressing modes. Specifically things like VAR-parameters and
accessing parameters at outer nesting levels.

What Went Well
--------------
When I did the initial (paper) design, I reduced the 68000 addressing modes
to four categories: immediate operand, register value, memory operand, and
condition code.  Then I created a table for each AST node (eg integer ADD,
logical AND, procedure CALL, etc). Each AST node has (at most) two operands
and one result operand. I then made N rows covering all the operands as
input and output values. I was influenced here by Horspool's ideas given in
"An Alternative to the Graham-Glanville Code-Generation Method", IEEE
Software, May 1987.

An example table for integer addition would be as follows (where
op = operator, res = result, l = left subtree, r = right subtree,
R = register, O = operand, I = immediate, and CC = condition code):

op  res	 l  r
+	R	 R  O
+	R	 O  O
+	R	 R  R
+	R	 O  R
+	R	 R  I
+	R	 I  R
etc.

Each row was then filled in with the assembly language source statements
necessary to produce the proper result. This was where I forgot the special
cases. The generic operand sometimes needed object code to produce a
simplier addressing mode. It was easy to fix once I recognised the problem
(even though I had all the code written for the code generator).

I think using trees worked out well. It was relatively easy to generate
correct code from them. It may not be the most highly optimized object code,
but it's not terrible. At present, the compiler compiles the largest of its
source modules (63K) in about 75 seconds.

Improvements remain in the area of array indexing on parameters. My current
code involves a multiply. I can take advantage of type information to
eliminate most of these. This should yield a good performance improvement.

Now I can concentrate on optimizations, adding the object oriented
extensions, and the other capabilities necessary for a production quality
compiler.

Chuck Lins
lins@apple.com
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!esegue!compilers.  Meta-mail to compilers-request.

rst@cs.hull.ac.uk (Rob Turner) (11/15/90)

Chuck Lins writes:

>  [stuff about Object Oberon compiler he has written]
>
>Now I can concentrate on optimizations, adding the object oriented
>extensions, and the other capabilities necessary for a production quality
>compiler.

Does this mean that many of the optimizations that will eventually appear
in the compiler have not been thought about yet? In my experience, most
compilers have been written in this way. That is, they are designed
without consideration of optimizations and/or language enhancements, and
these are then tagged onto the end later, often resulting in a less then
elegant end-product. Maybe if things like optimization are taken into
consideration at the outset, a lot of compilers may be designed
differently.

This is in no way a criticism of Chuck's compiler. It is just a thought.

Rob
--
Robert Turner                      |   rst@cs.hull.ac.uk
Department of Computer Science
University of Hull
Hull HU6 7RX
England
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.