kearns@cs.columbia.edu (Steve Kearns) (11/06/90)
Adding Units to C++ (c) 1990 by Steven Kearns ============================ This paper introduces a simple language construct that permits future implementation of precompiled headers and incremental compilation. At the same time, it maintains backwards compatibility with existing C++ code. Computer languages, C++ in particular, are complicated by the necessity of allowing separate compilation. Some large programs require many hours to completely recompile, so separate compilation is a necessary evil. C++ compilers can spend most of their time compiling the same (large) set of header files, instead of compiling new code. This has led to a number of proposals for adding the capability of pre-compiled headers to existing compilers. In fact, Microsoft Quick C 2.0 already allows this. In addition, there have been suggestions for implementing incremental compilation. (Rational and Saber C both provide interpreters which simulate incremental compilation.) Finally, most integrated programming environments support a "project document" which lists all of the files that make up a project. The integrated environments use this to automate the process of building the program (the "make" process). It would be nice if a C++ compiler could accept 1 file listing, in one place, all the files neccessary for building a program. There are a few positive features arising from separate compilation. First, it allows one to declare variables and functions visible only within a file ("static" variables), reducing name-space polution and enhancing information hiding. Second, it allows one to define the same name in two different ways in two different compilation units. For example, in one file the typename "foo" may represent an integer, while in another it may represent a complicated structure. It is debatable whether this is a positive feature, as it can lead to very subtle bugs. Currently, C++ has a concept of a compilation unit. This is basically a file (typically the output of a C++ preprocessor) that can be compiled without reference to any other file. Also, a compilation unit cannot effect any other compilation unit until link time. The first thing we do is define an explicit syntax for declaring a compilation unit: "unit" { #include "foo.c" } The list of declarations that appear inside the "unit" brackets represent one compilation unit. ("Unit" is surrounded by parentheses to avoid introducing another C++ reserved word, similar to the "asm" declaration of ANSI C.) One of the benefits of introducing unit declarations is that one file can now contain several compilation units: "unit" { #include "foo.c" } "unit" { #include "bar.c" } But that is not the only benefit! By allowing unit declarations to nest, we allow the specification of nested, pre-compiled headers, as well as incremental compilation. #include global.h typedef int Number; Number i = 3; struct mystruct { char * name; int age; }; "unit" { #define DEBUG 1 #include "foo.c" } "unit" { #include "bar.c" } The idea is that the declarations outside of the unit declarations only have to be compiled once, and the results of this compilation can be shared during the compilation of the units. Also, changing something inside a unit, such as the #definition of DEBUG in the first unit, requires only the recompilation of the first unit. (Of course, this assumes that the result of compiling the surrounding declarations has been saved somewhere.) As an example of the utility of this idea, consider the common case of experimenting with one subroutine in a large file. Current programming environments recompile the whole file containing the subroutine, each time the subroutine is changed. However, you can surround the subroutine with a unit declaration, and future envirnoments will recompile only the subroutine. We thus allow nested, precompiled headers, and we allow incremental compilation. In order for this to all work out, the following rules should be implemented: (1) All non-unit declarations preceed the unit declarations. If we allowed input like this: typedef int Number; unit { #include "foo.c" } typedef struct { Number i; Number j; } fobble; unit { #include "bar.c" } then programmers would expect the first unit to compile knowing ONLY the definition of Number, and they would expect the second unit to compile knowing BOTH the definitions of Number and fobble. This would require the system to memorize several different compiler states for each unit, instead of the single scope necessary if this rule is followed. (2) No declarations inside a unit can effect anything outside the unit (until link time). This is really just the definition of a compilation unit. (3) Preprocessing must respect unit boundaries. In particular, any #define or #undef in a unit should not effect compilation outside the unit. However, preprocessing commands outside a unit must effect the unit. These rules preserve the incremental compilation and pre-compiled header capabilities of the unit notation. Implementation with Existing Compilers ======================================== First we will outline how existing compilers can implement unit declarations. Consider this sample input: typedef int Number; Number n = 3; unit { #include "foo.c" } unit { #include "bar.c" } The idea is to create a file for each unit, including only the declarations outside the unit. A compiler can simply split this input into two files, as follows: file1: typedef int Number; Number n = 3; #include "foo.c" file2: typedef int Number; Number n = 3; #include "bar.c" Compiling file1 and file2 separately, and linking the result, produces the expected result. Actually, the process is more complicated than this, because foo.c and bar.c may contain unit declarations of their own, so file1 and file2 may have to be recursively split into separate files. Also, C++ insists that there is exactly 1 definition for every object, so the rewrite should actually turn every definition into a declaration, and add a file3 with the actual definitions: file1: typedef int Number; extern Number n; #include "foo.c" file2: typedef int Number; extern Number n; #include "bar.c" file3: typedef int Number; Number n = 3; Unfortunately, this seems to require a sophisticated program that can parse C++ declarations. Implementation with Smart Compilers ======================================== Now we consider how a sophisticated programming environment might use the unit declarations. For clarity assume the following program schema: "unit" { // unit A NonUnitsA; "unit" { // unit B NonUnitsB; "unit" { // unit C NonUnitsC; } // end unit C } // end unit B "unit" { // unit D NonUnitsD; "unit" { // unit E NonUnitsE; } // end unit E } // end unit D } // end unit A For each unit in a program, a smart compiler must remember the state of the compiler and the preprocessor after compiling the top level, non-unit declarations. So StateA is the state of the compiler after processing NonUnitsA, StateB is the state of the compiler after starting in StateA and compiling NonUnitsB, etc.. Now assume that something in NonUnitsB is changed. Then unit B, and any units inside B, must be recompiled. The smart compiler must restore the compiler to StateA, recompile NonUnitsB to recompute StateB, store StateB, and recursively recompile unit C. Deciding which units have to be recompiled is a crucial step. This process must be efficient to realize the benefits of incremental compilation. Specifically, detecting what has changed in a compilation unit must be faster than just recompiling the compilation unit. Summary ======================== By adding a very simple top level declaration to the C++ langauge, we provide a structured way of allowing future programming environments to provide incremental compilation and precompiled header files. All we have done is provide a language construct, the unit declaration, for what is typically an operating system construct, the file. Then a slight generalization allows the advanced features of incremental compilation and precompiled header files.
williams@umaxc.weeg.uiowa.edu (Kent Williams) (11/06/90)
A fellow I know who is deep into language design refers to C++ as 'COBOL for the 90s', and I tend to agree -- there's too much syntax as it is, and as evidenced by this group, too many obscure ways to be bit by it. I'm not that crazy about MI either -- just so you'll know where I'm coming from. As for the 'Unit' construct, I think that this needs to be NOT a part of the language, but rather part of the programming environment. You can get 99% of the way there by using 'make' and precompiling headers. One of the cases the 'Unit' construct is employed to fix is where you want to change one function in a large file. The easier way to solve this is NO LARGE FILES. This is where a good programming environment comes into play. I would prefer to see a tool that is capable of managing large numbers of classes and functions transparent to the user, and smart enough to only recompile what changes. The smalltalk browser comes to mind. In the real world, I regularly work on programs where 80% of compile time is spent parsing headers. Precompiling headers would help this immensely. No change to a production compiler is trivial, but this would be fairly easy, and doesn't involve language changes. Even pre-tokenizing could give you a performance win without costing a large amount of deep thought! -- Kent Williams --- williams@umaxc.weeg.uiowa.edu "'Is this heaven?' --- 'No, this is Iowa'" - from the movie "Field of Dreams" "This isn't heaven, ... this is Cleveland" - Harry Allard, in "The Stupids Die"