[comp.lang.c++] 02/Summary: Restructuring and Reverse Engineering.

Orville.R..Weyrich@sunbrk.FidoNet.Org (Orville R. Weyrich) (05/25/91)
l from Language Technology, Inc.  I used it
on a program that "followed" me when I transferred up from DESC in Dayton, and
was fairly pleased with the results.  My only "complaint", if you will, is that
I had to go back in and "fix up" the result.  I have a certain way I like my
programs to look, and RECODER didn't do it the way I usually do.  The biggest
tangible difference was the complete lack of GO TOs; we are allowed to GO TO
an EXIT, which helps hold down your program size and complexity.

-------------------------------------------------------------------------------
I am working on my Master's project concerning updating a graphics package. 
Part of the project lies in restructuring.  There are several folds 
of restructuring. Deleting GO_TO's is certainly one of the main streams in
business. My preference is in modules and variables.

The program I am working on is a one_shot program. i.e. all codes in one file 
and no separate compilation. 

For module restructuring I classify the routines then put them in several 
files which I think each one has it's own major data structure and consistent 
functionality. About the variables they were declared excessively globally. 
I try to minimize their scopes and confine the global declaration. 
------------------------------------------------------------------------------
You did not mention a  very  important  use  of  program  restructuring:
discovering  parallelism  in programs (yeah, they usually talk about the
old dusty deck of FORTRAN code, but they also mean new programs). I have
not a particular reference to give you on this, but any book, conference
or journal on parallel programming will have plenty of articles.

On the topic of program restructuring for parallelization there is mixed
success: On one hand  experienced  programmers  can  write  code  (in  a
special  version of FORTRAN) that will parallelize very well, and on the
other hand if you write code unaware of  the  parallelizer  (is  that  a
word?) you achieve poor performance.

------------------------------------------------------------------------------
I helped develop a system called ARA for module-level restructuring in Ada 
using AI techniques.  

ARA is implemented in Common Lisp on a TI Explorer using GBB, a blackboard 
architecture development environment.

Our research was concerned with supporting evolution in the composition of 
Ada packages over time, rather than removing GOTOs, but the techniques might 
be interesting to you.

------------------------------------------------------------------------------
Toolpack has tools for restructuring code.  Good points:  toolpack is
not just an isolated tool, but has modular tools for doing many
things: restructuring, redoing declarations, rolling/unrolling loops,
checking for portability.  Bad points:  pretty strict F-77, with few
extensions supported.  If you cannot get toolpack to "compile" your
code, none of the other tools will be usable.

------------------------------------------------------------------------------
[Indirect reference to] Eric Bush at Language Technology.  Their
main product was a Cobol reformatter, and I think it initially was
written in PL/I and ran on an IBM mainframe.  The last time I heard, 
they were working on a prolog version, and the company was doing pretty well.  The
The company is located in Salem, MA.

------------------------------------------------------------------------------
I wrote a program to translate 4.5 Megalines of a very strange FORTRAN
dialect into idiomatic C.  By idiomatic, I mean that the resulting code
should look like it had been written by fluent C programmers, not
translated FORTRAN.

I used an ATN as the basic translating engine, and still think it was a 
reasonable choice.

------------------------------------------------------------------------------
As far as AI based systems goes you might want to look at the Programmer's 
Apprentice work done at MIT.  Those papers mention the reversal of their 
techniques to perform reverse engineering.  I did some work in this area 
when I was at Surrey University and most of the results were published at 
the IEEE conference. Simply source code restructuring is algorithmic and 
probably does not require the application of AI techniques.  The real win 
for AI would be in reverse engineering.

------------------------------------------------------------------------------
My experiences with such tools are usually quite good. Here is a
review I did of one.



		    Executive Summary
		   ======================

	If neither you nor your customers ever use Fortran, skip this
	review. 

	If all your Fortran code is brand new, this product is of some
	use, but not wonderful.

	If much of you (and your customers) Fortran code dates back to
	FORTRAN II ... you (they) probably need this product or one
	like it.

	Has some rough edges, but worth serious consideration.


I've just obtained a demo copy of Polyhedron's SPAG tool (and
friends). I've run a few 10K's lines of code through it and have
formed some initial impressions. A full review will come later. 

It has two major components:
==== CRAM
	Provides a report of who calls who. Checks arg lists for
	consistent number of arguments. Provides a potential overlay
	map (really handy on Virtual Memory systems :>).

===== SPAG

SPAG provides several functions which:

0)   Can make your life easier as you try to port huge hunks of code
     for benchmarking. Can also make optimizations easier to spot ...
     also handy for you benchmarkers out there.

1)   Simple "pretty printing" mode, can be used to help encourage
     (enforce) coding style guidelines. Especially helpful when you
     start out with a Million Lines of raw sewage.

2)   Can be configured to replace all 

	 DO 1 J=1, N ... 1 x=10

     with

         DO j=1, n .... x=10 ; END DO

     Which, as it happens can help to expose good candidates for loop
     fusion, viz.

      DO     4 I=1,20
    4 DPDTT  (I)  =INI
      DO     5 I=1,20
    5 TDPDT  (I)  =INI
      DO     6 I=1,20
    6 DBSOR  (I)  =INI
      DO     7 I=1,20
    7 TSOR   (I)  =INI
becomes
      DO i = 1 , 20
         dpdtt(i) = ini
      ENDDO
      DO i = 1 , 20
         tdpdt(i) = ini
      ENDDO
      DO i = 1 , 20
         dbsor(i) = ini
      ENDDO
which is trivally changed to 

      DO i = 1 , 20
         dpdtt(i) = ini
         tdpdt(i) = ini
         dbsor(i) = ini
      ENDDO

Which executes faster. When we have run out of bigger automatic
optimization wins, and other scheduled work, the compiler will
probably do this for you. But for now, SPAG can help you spot this
sort of thing. Especially probable is that the original code had
assigned goto's and such ... and only after SPAG is this sort of thing
obvious. 

      IF(Z6.GT.ZMAX)GOTO 50
      IF(Z6.LT.ZMIN)GOTO 45
      I6FT=I
      J6FT=J
  45  IF(Z8.GT.ZMAX)GOTO 50
      I8FT=I
      J8FT=J
      GOTO 55
   50 CONTINUE
   55 CONTINUE

becomes

            IF ( z6.LE.zmax ) THEN
               IF ( z6.GE.zmin ) THEN
                  i6ft = i
                  j6ft = j
               ENDIF
               IF ( z8.LE.zmax ) THEN
                  i8ft = i
                  j8ft = j
                  GOTO 100
.... etc.

     Don't get unreasonable expectations, the revised code is likely
     to run a few %faster, but not a lot faster ... and can be
     modestly slower too. The point is that after SPAGization (pity
     they didn't call it DESPAG :>) transformations are likely to be
     easier. 

3)   About 50 user configurable transformations. I suspect that most
     users will pick a transformation and stick to it.

As I said in the opening, I've run a few tens of thousands of lines
through SPAG and it works,  and aside from the
MVS/PC style of interaction it does quite well. The code produced
worked correctly (there were some warnings which the dedicated would
have fixed up) and produced identical answers (although FP arithmetic
being what it is, this could vary a bit ... just as optimization
levels can impact 'em). With the nattering redirected to a file, on a
4/260 operating on NFS mounted files (so it's very like a 4/60), the
conversion rate was a respectable 20Klines/min.

------------------------------------------------------------------------------
There have been such tools around for many years, typically site
specific, and maintained by the local gurus (remember that most of the
C tools we take for granted, cxref, flow, etc. are based on Fortran
tools which predate them ... but never made it to Unix). Alas, much of
the good culture got left behind when we (as an industry) unplugged
the mainframes and shot all the support staff. This is the first
commercial tool which I've seen running on Sun boxes.

-------------------------------------------------------------------------------
I looked at this sort of thing several years ago when contracted to
create a "smart" FORTRAN to Ada translator (it was never built).
Anyway, we were able to track down many papers on program
restructuring. This was a really hot topic in the 70s. When it proved
to not solve the bigger problems "software engineering" came into the
picture. 

I remember contacting the Army about tools for restructuring (ROME
Defense Establishment I think) and most were currently not in use. 

-------------------------------------------------------------------------------
My feeling about program restructuring tools: GIGO.

-------------------------------------------------------------------------------
		     Summary of Doctoral Research

			     John Hartman

		 Artificial Intelligence Laboratory*
		   Department of Computer Sciences
		  The University of Texas at Austin
		       Austin, Texas 78712-1188
	       hartman@cs.utexas.edu     (512) 471-9586


Program understanding involves recognizing abstract concepts like
``read-process loop'' in existing programs.  Programmers spend much of
their time understanding programs, so studying and automating the
process has many benefits.  For example, program understanding plays a
major role in software maintenance, and progress in program
understanding can make an important contribution to maintenance tools
and methods.

Programming plans are units of programming knowledge connecting
abstract concepts and their implementations. Existing research assumes
that plan instances can be recognized to recover the programmer's
abstract concepts and intentions.  This approach has not been
confirmed empirically.  Because of input restrictions and performance
limitations, current program understanders have only been demonstrated
with small example programs.  A program understander that runs on a
significant sample of real-world programs is needed to investigate the
feasibility, limits, and applications of plan instance recognition.

This dissertation describes a general and useful program understanding
method and its implementation, evaluation and applications. Control
concepts are abstract notions about interactions between control flow,
data flow, and computation, eg. ``do loop'', ``read-process loop'',
and ``bounded linear search''. We present a practical method for
bottom-up control concept recognition in large, unstructured
imperative programs . The method is implemented in the UNPROG program
understander and tested with Cobol and Lisp source programs. UNPROG's
value is demonstrated in Cobol restructuring and the empirical study
of programs. Contributions include:
  
  A recognition method that is robust, efficient and scalable  

  Implementation, testing, and evaluation 

  Improved Cobol restructuring using recognized concepts

  Empirical study of programs at the conceptual level.


Recognition Method 
  
Plan instances and concepts are recognized by comparing a program
representation against a library of standard plans.  The program
representation is an abstract program model given by a hierarchical
control flow/data flow graph. The model is decomposed into a tree of
sub-models using propers (single entry/exit control flow sub-graphs).
Plans are represented by similar graphs with added qualifications.
Recognition is based on simple matching between sub-models and plans.
It produces model-plan-concept bindings.


Implementation, Testing and Evaluation 
  
The method was implemented in the UNPROG program understander and
tested using Lisp and Cobol analyzers and a plan library.  The
method's design and analysis were confirmed in tests where efficient
control and data flow checking procedures produced low recognition
costs.
 
The method was evaluated for performance and generality over
languages, programs, and concepts.  Comparing sub-models and plans is
efficient because sub-models are small and have restricted, canonical
control flow.  Their abstraction structure focuses recognition on
criterial program features and hides non-plan material.  The model can
be formed for all language constructs which permit static
determination of control and data flow. The number of sub-models and
comparisons increases linearly with program size. These performance
and scaling advantages decrease as plans and concepts correspond less
well to the sub-models produced by proper decomposition.


Cobol Restructuring 

To show how our method can raise the level of reverse engineering and
re-engineering tools, we demonstrated its application to automatic
Cobol restructuring. The RESTRUC restructurer uses the
model-plan-concept bindings found by UNPROG.  Knowledge associated
with plans and concepts interprets the cause of unstructured program
parts and permits more specific and insightful transformation, code
generation, and documentation than is possible with existing syntactic
methods.


Empirical Study of Programs  

Our method can also be used for empirical study of programs at the
conceptual level.  We described a methodology using UNPROG to improve
recognizer performance, acquire plans, catalog natural plans and
concepts, and characterize program populations with respect to aspects
like recognizable and unrecognizable concepts, plan use, and domain
features.  For example, the distribution of plan recognition rates is
a measure of the population's planfulness and can be used to test the
hypothesis that programs are planful.


Future Work and Implications  

Future work includes empirical studies using UNPROG, method
extensions, and expansion to other program understanding domains.
Larger studies with UNPROG will characterize program populations and
method performance, and aid plan library acquisition and development.
Method extensions include changes in the program representation, plan
representation, and the recognition method. For example, abstract
branches and other decomposition principles will be investigated.
Expansion beyond bottom-up control plan instance recognition includes
non-control concepts, hierarchical recognition, use of additional
information, and advanced program understander architectures.

Besides restructuring, our method can be applied to reverse
engineering and re-engineering tools for applications like analysis,
documentation, and translation.  Recognizing and cataloging control
concepts and plans has implications for future maintenance and
development systems that represent and manipulate programs using
programming concepts, including programming assistants, environments,
and interfaces.

Empirical studies using program understanders can confirm and bound
program understanding theories.  Conceptual data and program
understanding models have many uses in software psychology.  For
example, they can identify natural concepts and programming knowledge,
and describe their role in program comprehension and programming.
Implications extend to complexity measures, language design,
structured programming, and education.

----
*   This work has taken place in the Qualitative Reasoning Group at
the Artificial Intelligenc

 * Origin: Seaeast - Fidonet<->Usenet Gateway - sunbrk (1:343/15.0)