david@june.cs.washington.edu (David Callahan) (11/14/88)
Managing include files is a nightmare. Last time I was part of a good sized C project (this also applies to a PL/I project I was on), I could almost never generate a correct sequence of include files and invariably ended up copying a sequence from an existing file. One of the results of this was that I probably was including things I didn't need just because I was satisfied when I got the thing to compile. Part of this was my fault (for not spending the time to learn the structure of the system adequately), part of this was the project's fault (not enough time had been spent structuring and documenting things) and part of the fault is the primitive nature of #include. (This was an academic project. Spending resources, personal and team, on library management is a long-term investment that short-sighted research projects sometimes fail to make. Never again! :-) I observed two competing pressures on include file structure (meaning the content and inter-dependence of include files). Programmers naturally want to separate include files as fine as possible so that they make clean logical units and reduce compilation dependencies. This aids understanding of the include file in isolation. Clients (other programmers) naturally want to merge include files together so that if they want to manipulate abstract syntax trees (say) they simply include "ast.h" and off they go. Similarly, I want to include "interface" once not "buttons" and "menus" and ... separately. Of course, there is also a problem that I have to remember "ast.h" or "interface/buttons/....h" to use the system and if one of those big files changes, the world will have to be recompiled, unnecessarily in a lot of cases. (I know, I may be preaching to the choir, I just wanted to express my experience with this problem). I feel this problem arises because, abstractly, types have scopes just like functions do. Once I define "ast_node" I want that type to have "external" scope. In other modules I would like to say: extern struct ast_node ; and that would be the end of it; the system (compiler) would find out what it needed about "ast_node" and I would not need to know specifically where the type is defined. Even more useful for C++, after: extern class ast_node; I not only have the data type, but also member functions. We also might want to extend "extern" for functions: extern foo() to omit the type information (foo is defined so it already is typed) This would be particularly useful if "foo" is overloaded. It might not be necessary to even specify the category of a name: extern bar; looks up all symbols (struct's, functions, variables) with name "bar" and makes them available for use in the current context. The compiler infers from context which "bar" I need/mean. In fact, the compiler could treat any unresolved name implicitly as an extern but this could lead to some unintended interactions, particularly with overloading so I don't advocate that. This also is close to the Fortran-8X module mechanism: USE BAR ; will look up BAR as an external name of "module" type. Three problems must be addressed to support this concept: 1. Compilation Environment 2. Name space Management 3. Recompilation 1. By "compilation environment" I mean the dictionary of external names available when a module is compiled. This is essentially the same problem as defining include file search paths and should be handled with compiler command line directives (or environment variables or whatever). There are two kinds of context: dictionaries that are sources of name definitions and a dictionary that will be updated with the definitions of external names defined in the module. It is useful to group the names in a dictionary under a single name so that they can be updated as a group. The command line might look like: CC -P<program name>(<module name>) -L<library name> ... f1.c f2.c ... where the -P option asserts that the list files are to be compiled in the context of program <program name> and provide (or replace) the definitions grouped under the name <module name>. Additional unresolved names can be found in libraries specified with the -L option. (NOTE: pick your own option if you don't like P and L, I use them here because they are mnemonic.) Observe that -P is similar to a -o loader option which specifies an executable name and -L is similar to both a -I include search path specifier and a -l loader command. The major effect of allowing external types is that files that before did not need to be know to each other until link time, now must be specified together at compile time. The syntax "program(module)" is directly taken from the "archive(member)" syntax used in "make" and is used here because the database of external definitions could be kept in a file with name "program" with the information segmented into named segments. 2. By "name space management" I refer to the need to have local names. "File" local names are available by defining a name and not giving it the "extern" attribute. The problem is that since every library will want to have its own "list" (or whatever) we need a mechanism to limit the scope of an "extern" to a particular library. This will be very dependent on how libraries are constructed and hence very operating system dependent so I will speak in general terms. The basic task is to take a "program" dictionary of names and "close" it so that all "external" names are bound to the definition points inside the library if there are any and a subset of the external names are "exported". Only exported names can be used in other programs. When a program is turned into a library, it may not have any unresolved type definitions but it may have some unimplemented ones. Unresolved types could be allowed but that would require that the library source be retained so that it could be recompiled when the type bindings are resolved. The conclusion is that we need a file which specifies an "export" list and a command that takes a program and library search list together with an export specification and creates a new library. 3. By "recompilation" I refer to the need to determine, after a definition of a name has been modified, what other modules must be recompiled. This is an extremely important problem for large programs and the existence of the external name dictionary allows precise solutions. The assumption is that part of the information the compiler says in the program dictionary is a dependency list of some kind for each name. There are two situations that can occur: a. Module "A" uses a name and the type of that name changes so that "A" will have a type-fault when recompiled. The programmer should be made aware of this problem but may choose not to resolve it. Why? it occurs in a part of the program he knows will not be executed during his next execute cycle. A great solution is to have the debugger be able to simply set break points on all "out of date" functions to protect the programmer from his "knowledge". This requires the compiler record name usage at the function level but this does not seem difficult. An interesting prospect would be to allow the programmer to edit, recompile and re-link at one of these break points and continue execution. b. Module "A" uses a name and a non-type attribute of the name changes. Examples of non-type attributes are offsets of members within structures and default values of function arguments. These changes require that the compilation process be restarted from the point where these attributes are bound into the program. This could be as late as the link phase. Again, providing a list of names first to the programmer and then to the debugger could allow this process to be done "on demand". I'd be interested in developing a catalog of non-type attributes so that we can explore was to delay binding then as late as possible. The basic step is provide a command that takes a "program" name dictionary and lists module or function or variable names which are "out of date" and needed either editing to fix type faults or recompilation. In a UNIX/make environment you could easily have each module a "target" in a make file and have the output of this "consistency" check formatted like a command lines for other tools. For example: make -k `verify <program name> | awk -f make_targets.awk` or verify <program name> | awk -f dbx_breakpoints.awk > .dbxinit dbx a.out assuming dbx supported a "set break point at this function" command line option. or even RCS asssuming a module name can be mapped into an RCS version number: verify <program name> | awk -f rcs_checkout.awk > rcs_script ; source rcs_script Thus this dictionary could be made to fit well into a UNIX environment and you could run you "source" program through any kind of preprocessor (cpp, m4, ratfor, cfront, tangle...) you liked before it hits the compiler. Comments are encouraged. David Callahan, Tera Computer Co. Seattle, WA