[comp.lang.c++] a new langauge construct

kearns@cs.columbia.edu (Steve Kearns) (11/06/90)

Adding Units to C++
(c) 1990 by Steven Kearns
============================

This paper introduces a simple language construct that permits future
implementation of precompiled headers and incremental compilation.  At
the same time, it maintains backwards compatibility with existing C++
code.

Computer languages, C++ in particular, are complicated by the
necessity of allowing separate compilation.  Some large
programs require many hours to completely recompile, so separate
compilation is a necessary evil.  

C++ compilers can spend most of their time compiling the same (large)
set of header files, instead of compiling new code.  This has led to a
number of proposals for adding the capability of pre-compiled headers
to existing compilers.  In fact, Microsoft Quick C 2.0 already allows
this.  

In addition, there have been suggestions for implementing incremental
compilation.  (Rational and Saber C both provide interpreters which
simulate incremental compilation.)  Finally, most integrated
programming environments support a "project document" which lists all
of the files that make up a project.  The integrated environments use
this to automate the process of building the program (the "make"
process).  It would be nice if a C++ compiler could accept 1 file
listing, in one place, all the files neccessary for building a
program.

There are a few positive features arising from separate compilation.
First, it allows one to declare variables and functions visible only
within a file ("static" variables), reducing name-space polution and
enhancing information hiding.  Second, it allows one to define the
same name in two different ways in two different compilation units.
For example, in one file the typename "foo" may represent an integer,
while in another it may represent a complicated structure.  It is
debatable whether this is a positive feature, as it can lead to very
subtle bugs.

Currently, C++ has a concept of a compilation unit.  This is basically
a file (typically the output of a C++ preprocessor) that can be
compiled without reference to any other file.  Also, a compilation unit
cannot effect any other compilation unit until link time.

The first thing we do is define an explicit syntax for declaring a
compilation unit:

	"unit" {
	#include "foo.c"	
	}

The list of declarations that appear inside the "unit" brackets
represent one compilation unit.  ("Unit" is surrounded by parentheses
to avoid introducing another C++ reserved word, similar to the "asm"
declaration of ANSI C.)  One of the benefits of introducing unit
declarations is that one file can now contain several compilation
units:

	"unit" {
	#include "foo.c"	
	}

	"unit" {
	#include "bar.c"	
	}

But that is not the only benefit!  By allowing unit declarations to
nest, we allow the specification of nested, pre-compiled headers, as
well as incremental compilation.

	#include global.h

	typedef int Number;

	Number i = 3;

	struct mystruct {
		char * name;
		int age;
	};

	"unit" {
	#define DEBUG 1
	#include "foo.c"	
	}

	"unit" {
	#include "bar.c"	
	}

The idea is that the declarations outside of the unit declarations
only have to be compiled once, and the results of this compilation can
be shared during the compilation of the units.  Also, changing
something inside a unit, such as the #definition of DEBUG in the first
unit, requires only the recompilation of the first unit.  (Of course,
this assumes that the result of compiling the surrounding declarations
has been saved somewhere.)  As an example of the utility of this idea,
consider the common case of experimenting with one subroutine in a
large file.  Current programming environments recompile the whole file
containing the subroutine, each time the subroutine is changed.
However, you can surround the subroutine with a unit declaration, and
future envirnoments will recompile only the subroutine.

We thus allow nested, precompiled headers, and we allow incremental
compilation.  In order for this to all work out, the following rules
should be implemented:

(1) All non-unit declarations preceed the unit declarations.
If we allowed input like this:

	typedef int Number;
	
	unit {
	#include "foo.c"
	}

	typedef struct { Number i; Number j; } fobble;

	unit {
	#include "bar.c"	
	}

then programmers would expect the first unit to compile knowing ONLY
the definition of Number, and they would expect the second unit to
compile knowing BOTH the definitions of Number and fobble.  This would
require the system to memorize several different compiler states for
each unit, instead of the single scope necessary if this rule is
followed.

(2)	No declarations inside a unit can effect anything outside the unit
(until link time).  This is really just the definition of a
compilation unit.

(3) Preprocessing must respect unit boundaries.  In particular, any
#define or #undef in a unit should not effect compilation outside the
unit.  However, preprocessing commands outside a unit must effect the
unit.  These rules preserve the incremental compilation and
pre-compiled header capabilities of the unit notation.



Implementation with Existing Compilers
========================================

First we will outline how existing compilers can implement unit
declarations.  Consider this sample input:

	typedef int Number;
	
	Number n = 3;

	unit {
	#include "foo.c"
	}

	unit {
	#include "bar.c"	
	}


The idea is to create a file for each unit, including only the
declarations outside the unit.  A compiler can simply split this input
into two files, as follows:

file1:   

	typedef int Number;
	
	Number n = 3;

	#include "foo.c"

file2:

	typedef int Number;
	
	Number n = 3;

	#include "bar.c"	

Compiling file1 and file2 separately, and linking the result, produces
the expected result.  

Actually, the process is more complicated than this, because foo.c and
bar.c may contain unit declarations of their own, so file1 and file2
may have to be recursively split into separate files.  

Also, C++ insists that there is exactly 1 definition for every object,
so the rewrite should actually turn every definition into a
declaration, and add a file3 with the actual definitions:

file1:   

	typedef int Number;
	
	extern Number n;

	#include "foo.c"

file2:

	typedef int Number;
	
	extern Number n;

	#include "bar.c"	

file3:

	typedef int Number;
	
	Number n = 3;


Unfortunately, this seems to require a sophisticated program that can
parse C++ declarations.

Implementation with Smart Compilers
========================================

Now we consider how a sophisticated programming environment might use
the unit declarations.  For clarity assume the following program
schema:

"unit" { // unit A
	NonUnitsA;

	"unit" { // unit B
		NonUnitsB;

		"unit" { // unit C
			NonUnitsC;
		} // end unit C
	} // end unit B

	"unit" { // unit D
		NonUnitsD;

		"unit" { // unit E
			NonUnitsE;
		} // end unit E
	} // end unit D
} // end unit A
		
For each unit in a program, a smart compiler must remember the state
of the compiler and the preprocessor after compiling the top level,
non-unit declarations.  So StateA is the state of the compiler after
processing NonUnitsA, StateB is the state of the compiler after
starting in StateA and compiling NonUnitsB, etc.. 

Now assume that something in NonUnitsB is changed.  Then unit B, and
any units inside B, must be recompiled.  The smart compiler must
restore the compiler to StateA, recompile NonUnitsB to recompute
StateB, store StateB, and recursively recompile unit C.

Deciding which units have to be recompiled is a crucial step.  This
process must be efficient to realize the benefits of incremental
compilation.  Specifically, detecting what has changed in a
compilation unit must be faster than just recompiling the compilation
unit.

Summary
========================

By adding a very simple top level declaration to the C++ langauge, we
provide a structured way of allowing future programming environments
to provide incremental compilation and precompiled header files.  All
we have done is provide a language construct, the unit declaration,
for what is typically an operating system construct, the file.  Then a
slight generalization allows the advanced features of incremental
compilation and precompiled header files.

williams@umaxc.weeg.uiowa.edu (Kent Williams) (11/06/90)

A fellow I know who is deep into language design refers to C++ as 'COBOL for
the 90s', and I tend to agree -- there's too much syntax as it is, and as
evidenced by this group, too many obscure ways to be bit by it.  I'm not
that crazy about MI either -- just so you'll know where I'm coming from.

As for the 'Unit' construct, I think that this needs to be NOT a part
of the language, but rather part of the programming environment.  You
can get 99% of the way there by using 'make' and precompiling headers.

One of the cases the 'Unit' construct is employed to fix is where you
want to change one function in a large file.  The easier way to solve
this is NO LARGE FILES.

This is where a good programming environment comes into play.  I would
prefer to see a tool that is capable of managing large numbers of
classes and functions transparent to the user, and smart enough to
only recompile what changes.  The smalltalk browser comes to mind.

In the real world, I regularly work on programs where 80% of compile
time is spent parsing headers.   Precompiling headers would help this
immensely.  No change to a production compiler is trivial, but this
would be fairly easy, and doesn't involve language changes.  Even
pre-tokenizing could give you a performance win without costing a
large amount of deep thought!



--
             Kent Williams --- williams@umaxc.weeg.uiowa.edu 
"'Is this heaven?' --- 'No, this is Iowa'" - from the movie "Field of Dreams"
"This isn't heaven, ... this is Cleveland" - Harry Allard, in "The Stupids Die"