[comp.lang.c] Tools for reading readable code

ADLER1@brandeis.bitnet (01/01/88)

This message may be regarded as a sequel to a message I posted several
months ago entitled "Reading Readable Code" in which I asked for advice
on how to read large C programs. Here is one stage of my thinking about
this question.


I am interested in having a tool which I think others would also find
useful. It is based on the point of view that a C program is a collection
of commands which are extensions to the language C. From this point of
view, one can analyze a program in a certain way and one can make these
new commands available to oneself with a minimum of hacking. Here is how
it works.

Let CF denote the set of all functions which are normally part of the
language C, including those in the libraries. Let P be a C program which
we assume compiles and runs and which will be fixed throughout this discussion.
Let PF denote the set of all functions which are defined in the C program P .
If A is a subset of PF, denote by PF[A] the C program obtained by commenting
out all of the function definitions of functions which are not in A.
Depending on the choice of the subset A, the program PF[A] may or may not run.
I am interested in subsets A such that PF[A] does compile,link and run.
One obvious condition on the subset A in order for this to happen is that
every function called in PF[A] is either in A or is in CF . Let us agree
to call a subset A of PF "closed" if this condition is satisfied. It is
easy to see that there is a topology on PF with respect to which these "closed"
sets are actually closed, and that is one reason for this choice of
nomenclature. But this discussion does not require a knowledge of topology.

Given a subset A of PF, we define the "closure" of A to be the smallest closed
set containing A . Thus, the closure of A is the set B of all functions in PF
which must be included in any working program which contains PF[A]. The closure
of A will be denoted by A# . We also define K(A) to be all of the functions
which are called directly within the bodies of functions in A . Then K(A)
may or may not contain A. Now, for any positive integer n define L(n,A) to
be A if n=1 and to be the union of L(n-1,A) and K(L(n-1,A)) if n > 1 .
Since PF is a finite set, there is some positive integer m such that
L(m,A) = L(m+1,A) . By the "level of A" I mean the smallest such integer m
and I denote the level of A by LEV(A) . It is easy to see that the closure of
A is the same as L(LEV(A),A) .

Having arrived at this notion of the level of a subset A, it is possible to
introduce some natural subsets A of PF which lead to interesting programs
PF[A] . For example, suppose k is a positive integer and we let PF_k denote
the set of all functions f in PF such the the set {f} has level at most k.
Then PF[PF_k] runs and only uses relatively uncomplicated functions.
Take the case k=1, for example. Then PF[PF_1] is a program which contains
all of the functions in PF which call only C functions and themselves.
If one wanted to study PF, one could first play with PF[PF_1] to familiarize
oneself with the level 1 commands. These would also be available to one
in other programs without a lot of hacking. After one mastered the level 1
functions, one could play with PF[PF_2], and so on. Many people do something
like this in a random intuitive way when looking at a program, but this
formulation makes possible to automate the process.

Thus it is desirable to have routines which, given a set A, will do the
following:

(1) edit the file containing P and produce the program PF[A] .
(2) compute the level of A .
(3) compute the closure of A .

It is also desirable to have routines which will find sets such as PF_k .
There are other interesting sets which one could work with. For example,
suppose one defines the dimension of a closed set A to be the largest
nonnegative integer n such that there are n  nonempty closed sets A1, A2,
A3, ... An such that A properly contains A1 , A1 properly contains A2, etc.
A minimal closed set will be called an "atom", so an atom is a closed set
of dimension 0 . One could then play with all of the PF[A] where A was
an atom or one could let A be the union of all of the closed sets of
dimension at most k .

Given a routine to compute the dimension of a closed set and a routine
to collect all closed sets which have given numerical characteristics such
as level and dimension and routines to take unions and intersections and
complements,  one could have great flexibility in configuring the program
PF[A] that one would like to play with and the work of configuring it would
be done by the computer. One of the functions almost certain to be commented
out is main(). This allows one to write a new version of main and it is in
this version of main that one writes programs which use the new
commands which one has found.

If one wanted to go off the deep end, one could forget all about programming
and start computing the homotopy type of the space PF with the topology I
have described above, but let us try to focus on the practical problem
of implementing the routines which I have suggested would be desirable.
Here I run up against the problem that I am really weak at writing C
programs. I am not equal to the task of writing a program which will
make the list PF,  for example. Nor am I capable of writing a routine
which will produce a list of all functions referenced by a given function.
I have ideas for kluges to get around my weakness in this area but even
the kluges lead to difficulties which I can't get around. Here are some
possibilities:

(1) In GNU emacs, there is a nice program called etags.c which produces
    a file called TAGS which contains all of the functions defined in
    the programs under study. Even I can write a program which will
    edit this file and produce the list PF . But I have had a lot of
    trouble getting etags.c to run on non-UNIX systems. On VMS , on the IBM 370,
    and on an IBM PC XT, it produces the file TAGS but it doesn't print the
    functions names (only their positions) and there is some doubt in
    my mind as to whether it is recognizing functions properly. I am not
    competent, apparently, to patch it up so that it works or to analyze
    etags.c by itself without going through the laborious process of
    implementing my suggestions for PF[PF_1] etc by hand.

(2) In GNU CC there is a file called parse.y which contains a yacc (or rather
    bison) grammar for the GNU C compiler. If I understood the workings of
    yacc better, I could modify the file so that its actions would do all
    of the editing required to produce PF[PF_k] or PF[A] for various A .
    Or, it could compute the closure of a set A. These seem like handy
    options for a compiler, but I am not competent to do it myself. The
    only kluge that occurred to for discovering dependence of functions
    was to write a program which would do textual analysis of the error
    messages from the linker when PF[A] is compiled. But even that is
    somewhat beyond me.


If anyone has any such routines or if anyone is interested in writing such
routines or has any other advice, I welcome your correspondence.

Sincerely,

Allan Adler
ADLER1@BRANDEIS.BITNET

g-rh@cca.CCA.COM (Richard Harter) (01/02/88)

[... long article about closure of calls and levels of calls ...]
...
[... requesting information about possible implementations ...]

My company (SMDS Inc.) sells software that includes this sort of 
functionality.  This won't help if you want a public domain set
of tools; however you can build this sort of functionality with
PD tools without too much trouble (:-)).

What you need are two sets of tools -- one to analyze source programs,
and the second to manipulate sets.  I will describe what each set of
tools has to do and then mention fundamental problems with doing this.

Source Analysis Tools Needed:

(1)	A source language scanner that determines, for each file, the
functions contained in the file, and the functions called by each function.

(2)	A source language scanner that determines, for each file, the 
include files included in the file.

One can put together (2) using standard unix tools; we have a C program
to do this.  It is a simple program to write.  The source language scanner
is a bit trickier:  We have programs to do this.  One can also kludge it
together, using standard tools.  However we don't do it "right" and it
is a rather difficult proposition to "do it right".  Some of the problems
are:

(2.a)	The scanner should be able to distinguish between macro calls and
function calls.

(2.b)	The structure of the software in a file can be changed by the
arguments to the preprocesser, i.e. you can pass in "defines" through
the cc command.

(2.c)	Pointers to functions can be passed as data, either through 
globals or through calling sequences.  For example, suppose that an
array of function pointers is global.  Then any routine which can see
this array can alter a pointer in the array.  If routine X executes
function n in the array we cannot tell, via static analysis, what 
routine it will be executing nor can we tell, by examining the file
containing routine X, what functions it might possibly be executing.
There are a lot of wrinkles on this problem and I won't go into it
any further.

(2.d)	One must also take into accunt files included, since include
files can introduce new function references that are not apparent in
the original source.  Again, the actual references induced depend upon
the "ifdefs" used.

(2.e)	There are subtle issues involved with name scoping and duplication.
Function name foo may occur in several different files and be different
functions in different places.


Given the descriptive relationship data one also needs tools to manipulate
sets.  This breaks down into two parts.  The first is the mechanics of
reading the lists of relationships from the scanners and handling the
representation of them.  This is a nontrivial problem for large programs.

The set tools needed are (a) Given a relationship and a set of functions
(files) you need to be able to get the set bearing that relationship to
your original set.  In our jargon you need to be able to say things like

P2 = calls P1

where P1 is a list of functions and P2 is the list of all functions calling
any function in list P1.  You also need to be able to do

P2 = P1 calls

where P2 is the list of all functions called by any function in list P1.
You also need to be able to find unions and  intersections of sets.

A function call closure algorithm runs as follows:

P1 = main
P2 = main calls
while (P2 is not empty) begin
  P2 = P2 - P1
  P1 = P1 + P2
  P2 = P2 calls
  end

In this loop P1 is the closure list (the list of all routines directly or
indirectly called by main) and P2 is the list of all new routines introduced
in the next level down.

Note that this closure algorithm only produces the list of routines directly
or indirectly called by main.  It does not necessarily produce the load list
(the mimimum list of files needed for a complete link and load.)  I.e. we
can't blithely say

F1 = contains P1

where F1 is the list of files containing all of the entry points in list  P1
because a particular file can contain definitions of functions that are not
in list P1.  To produce a minimum closure load list we need an outer loop
which adds the functions which are referenced but not called, add their 
closure to the list, add the files needed for them, until no new files
are introduced.

----

Note that this entire approach does not take globals which are not functions
into account.  An alternative approach in the data analysis phase which
does take globals into account is to work with the symbol tables of the
relocatable object files.  This also works (the set manipulation tools
are the same) but is subject to the objection that it doesn't tell you
where in the source files the globals are referenced.

To do the whole works really right is a non-trivial problem.  The real
problem, as I have indicated, is to extract all of the relationship data,
taking into account the vagaries of the preprocessor, the possible search
paths, the effect on the preprocessor of arguments to cc, and the effects
of the use of pointers to globals.

Hope this is of some help.
-- 

In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die.
	Richard Harter, SMDS  Inc.

ADLER1@brandeis.bitnet (01/02/88)

I would just like to mention an error in my previous posting "Tool for
reading readable code". I claimed that the set PF_k is a closed set.
That is incorrect and here is a counterexample: suppose that f is
a function which calls the functions a,b and c ; suppose that a calls
b, b calls c and c calls only regular C functions. Then the level of
f is 2 but it calls the function a which has level 3. So if PF contains
only f,a,b and c then PF_2 is not closed.

I mention this for the sake of accuracy. I don't think it affects the
validity of my message as a whole. There are many variations on the
definition of PF_k which provide suitable closed sets to play with.

Allan Adler
ADLER1@BRANDEIS.BITNET