[comp.compilers] "Tokenizing" Analyses

pfeiffer@herve.cs.wisc.edu (Phil Pfeiffer) (09/12/90)

In an earlier posting, Kwang-Keun Yi writes:

> [In] "A Flexible Approach to Interprocedural Data flow Analysis
> and Programs with Recursive Data Structures" by Neil Jones and Steven
> Muchnick [POPL '82], they mentioned in the conclusion section that
>
>  "Two areas of investigation that appear fruitful are to consider how
>  the computational complexity of the analyzer varies with the choice
>  of token set and the the feasibility of building an analyzer with
>  the choice of token set as a parameter."
>
> Can somebody help me to find papers on the investigations? 

This is a *big* question.   There have been a series of papers since 1982 that
amount to *specific* extensions of JM82; some of the authors have addressed
complexity in their papers as well.  Here's my stab at providing you with 
references and a (partial, incomplete) survey.

Extensions to JM82 typically vary according to the proposed labeling
technique and the strategy used to approximate arbitrarily large data
structures (I'm using the word "label" in place of the word "token" because I
like it better.)  Jones and Muchnick, as I understand the paper, suggest
labeling every structure x in a store s with a value such as [k] to show that
x was created by the statement "at" program point [k].  Another labeling
scheme that has been proposed tags x with the value ([k],n) to show that x
was the nth object created at [k].  Also, more elaborate labeling schemes
have been proposed that take a program's nesting structure into account --
e.g., if s is labeled ([k],m,n), then s was the nth object created during the
mth iteration of statment [k]'s enclosing loop.  This strategy is mentioned
in a footnote in P. Hudak's "A semantic model of reference counting and its
abstraction" in _Abstract Interpretation of Declarative Languages_ (pp.
45-62)).  A variant of this is mentioned in a forthcoming paper by Francois
Bodin (you should be able to get a copy of this from Dennis Gannon, who's
also at CSRD) and in a 1989 Sigplan paper by Susan Horwitz, Phil Pfeiffer,
and Tom Reps.  The HPR89 paper also describes several other strategies for
using labels to track a program's execution history.  See also the
Chase-Wegman-Zadeck paper in SIGPLAN 90, and the papers referenced in its
related works section (more on CWZ90 below).

Two different approximation mechanisms are discussed in JM82: one for stores,
and a second for the stack.  Many of the analyses that I've seen, however,
simply "flatten" the stack:  i.e., assume that a procedure can return to any 
of the procedures that can call it.  More complex approximations to the stack 
than this are usually avoided because approximations to stores that contain 
dynamically-allocated objects are already quite expensive to compute.  Both 
of the reports that I know of that *do* construct non-trivial approximations 
to the stack were "forced into it" by of the language under consideration.
The one, Uwe Pleban's 1981 thesis (U.Kansas), describes a preexecution 
analysis for a dialect of Scheme that supports downward closures.  The other, 
Jan Stransky's 1988 thesis (University de Paris-Sud, Centre d'Orsay, disser-
tation in French) describes a preexecution analysis for the (dynamically-
scoped language) Lisp.

Different authors have given different estimates of the respective
computational complexities of their analyses.  One of the best references on
the complexity of labeled analyses is a paper in SIGPLAN 1990 by Chase,
Wegman, and Zadeck.  CWZ90 deals specifically with the problem of computing
an approximation to the set of stores that a program can generate, in the
presence of allocation primitives (e.g., cons) and procedures.  The authors
also give a clever algorithm that runs much faster than the standard,
"simple" store-building algorithm when a program doesn't create "too many
edges".  CWZ90 also quotes some complexity results from other people's theses
that you might find helpful.  (Their characterization of Ruggieri's work,
however, is not quite right; Ruggieri, in the final chapter of her thesis,
discusses implementation techniques (such as union-find) that can be used to
improve the running-time of her analysis -- i.e., the one cited in the
report).  I'd also recommend Hendren's work on dag-like and tree-like stores
(there's a reference a paper by her in the Chase-Wegman-Zadeck paper; her
thesis is available from Cornell as technical report 90-1114.  Hendren's work
is actually based on matrix manipulation, but it is equivalent to yet another
variation on store graphs if you stare at it, cross-eyed, for a suitably long
time).

Stransky actually built an analyzer, and experimented with different
parameters for his analyzer.  Stransky's thesis has some empirical
observations on how well the analyzer ran for different labeling techniques
and different "orders" of stack approximations.  A six-word summation of
Stransky's experience with his analyzer: "effective, but very slow and
clumsy".  This is one of the threads that runs through all of the literature
on pointer analysis.

The SIGPLAN paper that Susan Horwitz, Tom Reps, and I wrote also describes a
technique for proving that labels are being used correctly.  (It turns out
that the same concern was brought up in Pleban's thesis, Stransky's thesis,
and, apparently, in Cousot's thesis).  

-- Phil Pfeiffer
UUCP: ...!{harvard,seismo,topaz,akgua,allegra,usbvax}!cs.wisc.edu!pfeiffer
(608) 262-6625
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.