[comp.lang.c] "external" types

david@june.cs.washington.edu (David Callahan) (11/14/88)
Managing include files is a nightmare. Last time I was part
of a good sized C project (this also applies to a PL/I project 
I was on), I could almost never generate
a correct sequence of include files and invariably ended
up copying a sequence from an existing file. One of the
results of this was that I probably was including things I didn't
need just because I was satisfied when I got the thing
to compile. Part of this was my fault (for not spending
the time to learn the structure of the system adequately),
part of this was the project's fault (not enough time
had been spent structuring and documenting things)
and part of the fault is the primitive nature of #include.
(This was an academic project. Spending resources, personal
and team, on library management is a long-term
investment that short-sighted research projects sometimes
fail to make. Never again! :-)

I observed two competing pressures on include file
structure (meaning the content and inter-dependence of
include files). Programmers  naturally want to separate
include files as fine as possible so that they make
clean logical units and reduce compilation dependencies. 
This aids understanding of the 
include file in isolation. Clients (other programmers)
naturally want to merge include files together so that
if they want to manipulate abstract syntax trees (say)
they simply include "ast.h" and off they go. 
Similarly, I want to include "interface" once not
"buttons" and "menus" and ... separately.  Of course,
there is also a problem that I have to remember
"ast.h" or "interface/buttons/....h" to use the system
and if one of those big files changes, the world will
have to be recompiled, unnecessarily in a lot of cases.

(I know, I may be preaching to the choir, I just wanted
to express my experience with this problem).

I feel this problem arises because, abstractly, types have scopes
just like functions do. Once I define "ast_node" I want
that type to have "external" scope. In other
modules I would like to say:
	extern struct ast_node ;
and that would be the end of it; the system (compiler) would find
out what it needed about "ast_node" and I would not
need to know specifically where the type is defined. 
Even more useful for C++, after:
	extern class ast_node;
I not only have the data type, but also member functions. We also
might want to extend "extern" for functions:
	extern foo() 
to omit the type information (foo is defined so it already is typed)
This would be particularly useful if "foo" is overloaded.
It might not be necessary to even specify the category of a name:
	extern bar;
looks up all symbols (struct's, functions, variables) with
name "bar" and makes them available for use in the current
context. The compiler infers from context which "bar" 
I need/mean.  In fact, the compiler could treat any unresolved
name implicitly as an extern but this could lead to some 
unintended interactions, particularly with overloading so I
don't advocate that. This also is close to the Fortran-8X module
mechanism:
	USE  BAR ;
will look up BAR as an external name of "module" type.

Three problems must be addressed to support this concept:
	1. Compilation Environment
	2. Name space Management
	3. Recompilation

1. By "compilation environment" I mean the dictionary of external
names available when a module is compiled. This is essentially
the same problem as defining include file search paths and should
be handled with compiler command line directives (or environment
variables or whatever). There are two kinds of context: dictionaries
that are sources of name definitions and a dictionary that
will be updated with the definitions of external names defined
in the module. It is useful to group the names in a dictionary under
a single name so that they can be updated as a group. The command line
might look like:

	CC -P<program name>(<module name>) -L<library name> ...
		f1.c f2.c ...

where the -P option asserts that the list files are to be compiled
in the context of program <program name> and provide (or replace)
the definitions grouped under the name <module name>. Additional
unresolved names can be found in libraries specified with
the -L option. (NOTE: pick your own option if you don't like P and L,
I use them here because they are mnemonic.) 

Observe that -P is similar to a -o loader option which
specifies an executable name and -L is similar to both a -I 
include search path specifier and a -l loader command. The major 
effect of allowing external types is that files that before did not 
need to be know to each other until link time, now must be 
specified together at compile time.

The syntax "program(module)" is directly taken from the 
"archive(member)" syntax used in "make" and is used here because 
the database of external definitions could be kept in a file 
with name "program" with the information segmented into
named segments.

2. By "name space management" I refer to the need to have local
names. "File" local names are available by defining a name and
not giving it the "extern" attribute. The problem is that since
every library will want to have its own "list" (or whatever)
we need a mechanism to limit the scope of an "extern" to a 
particular library. This will be very dependent on how libraries 
are constructed and hence very operating system dependent so I will
speak in general terms. The basic task is to take a "program"
dictionary of names and "close" it so that all "external" names
are bound to the definition points inside the library 
if there are any and a subset of the external names are "exported".
Only exported names can be used in other programs. When a 
program is turned into a library, it may not have any unresolved
type definitions but it may have some unimplemented ones.
Unresolved types could be allowed but that would require
that the library source be retained so that it could be recompiled
when the type bindings are resolved. The conclusion is that
we need a file which specifies an "export" list and a command
that takes a program and library search list together with an export
specification and creates a new library.

3. By "recompilation" I refer to the need to determine, after 
a definition of a name has been modified, what other modules
must be recompiled. This is an extremely important problem for
large programs and the existence of the external
name dictionary allows precise solutions. The assumption is that
part of the information the compiler says in the program dictionary
is a dependency list of some kind for each name. 
There are two situations that can occur:

a. Module "A" uses a name and the type of that name changes
so that "A" will have a type-fault when recompiled. The programmer 
should be made aware of this problem but may choose not to
resolve it. Why? it occurs in a part of the program he knows
will not be executed during his next execute cycle. A great
solution is to have the debugger be able to simply set break
points on all "out of date" functions to protect the 
programmer from his "knowledge". This requires the compiler
record name usage at the function level but this does not 
seem difficult. An interesting prospect would be to allow 
the programmer to edit, recompile and re-link at one
of these break points and continue execution.

b. Module "A" uses a name and a non-type attribute of the name
changes. Examples of non-type attributes are offsets of
members within structures and default values
of function arguments. These changes require that the compilation
process be restarted from the point where these attributes are bound 
into the program. This could be as late as the link phase.
Again, providing a list of names first to the programmer
and then to the debugger could allow this process to
be done "on demand".

I'd be interested in developing a catalog of non-type
attributes so that we can explore was to delay binding
then as late as possible.

The basic step is provide a command that takes a "program"
name dictionary and lists module or function or variable
names which are "out of date" and needed either editing to
fix type faults or recompilation. In a UNIX/make environment
you could easily have each module a "target" in a make file
and have the output of this "consistency" check formatted like
a command lines for other tools. For example:

	make -k `verify  <program name> | awk -f make_targets.awk`
or
	verify <program name> | awk -f dbx_breakpoints.awk > .dbxinit
	dbx a.out 
assuming dbx supported a "set break point at this function" command
line option.
or even RCS asssuming a module name can be mapped into 
an RCS version number:
	verify <program name> | awk -f rcs_checkout.awk > rcs_script ;
	source rcs_script
Thus this dictionary could be made to fit well into a UNIX
environment and you could run you "source" program through
any kind of preprocessor (cpp, m4, ratfor, cfront, tangle...)
you liked before it hits the compiler.

Comments are encouraged.

David Callahan, Tera Computer Co.   Seattle, WA