[comp.compilers] Help on disassembler/decompilers

wwho@ucdavis.edu (W. Wilson Ho) (09/06/90)

	I am looking for any information related to disassembling
object code into assembly langauge or even higher-level language such
as C.  Would someone please give me pointers to program sources,
documentation or papers related to this?

	Thanks in advance!

  W. Wilson Ho		        |  INTERNET:  how@ivy.ucdavis.edu
  Division of Computer Science	|  UUCP:      ...!ucbvax!ucdavis!ivy!how
  EECS Department		|  BITNET:    wwho@ucdavis.bitnet
  University of California	|
  Davis, CA 95616		|
[Turning object code back into assembler is pretty straightforward, and
every debugger does it.  Someone else asked about disassembling into higher
level languages a little while ago, but I didn't see any responses. -John]
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

hrubin@l.cc.purdue.edu (Herman Rubin) (09/07/90)

In article <HOW.90Sep5173755@sundrops.ucdavis.edu>, wwho@ucdavis.edu (W. Wilson Ho) writes:
> 
> 	I am looking for any information related to disassembling
> object code into assembly langauge or even higher-level language such
> as C.  Would someone please give me pointers to program sources,
> documentation or papers related to this?
			...................

> [Turning object code back into assembler is pretty straightforward, and
> every debugger does it.  Someone else asked about disassembling into higher
> level languages a little while ago, but I didn't see any responses. -John]

It is not quite that straightforward, and disassemblers can be somewhat
hard to get.  Debuggers usually use object code set up specially for
debugging, with certain features available.  Some debuggers even require
that the source code be included in such a way that this can be displayed
at debug time.  Disassembly frequently is even ambiguous.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)	{purdue,pur-ee}!l.cc!cik(UUCP)
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

pardo@cs.washington.edu (David Keppel) (09/07/90)

>[Turning object code back into assembler is pretty straightforward, and
>every debugger does it.  Someone else asked about disassembling into higher
>level languages a little while ago, but I didn't see any responses. -John]

See anonymous ftp from cs.washington.edu (128.95.1.4) `pub/decomp.tar.Z'.
It takes VAX object code back to fake C.  One of my profs did a one-day hack
a while back to decompile assembly code in to compiler IR.  From that, he
could have created source code in several languages.  My guess is that
decompiling in to a language that is e.g., saccarine-sweetened assembler (C)
is `easy', while decompiling e.g., in to APL is hard.

	;-D on  ( Now for a deprogrammer... )  Pardo
-- 
		    pardo@cs.washington.edu
    {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

raul@sdnp1.ucsd.edu (Raul Rathmann) (09/08/90)

In article <HOW.90Sep5173755@sundrops.ucdavis.edu> wwho@ucdavis.edu (W. Wilson Ho) writes:
>
>	I am looking for any information related to disassembling
>object code into assembly langauge or even higher-level language such
>as C.  Would someone please give me pointers to program sources,
>documentation or papers related to this?

I have used a disassembler called "unas" that spit out "as" assembly code
from object code on VAXen and Suns running BSD UNIX.  I picked it up from
some anonymous ftp site, possibly from a comp.sources.unix archive site.  It
was pretty old and I might still have it around, but no guarantees.

I used it to reverse engineer part of a library that I had no way of getting
source code for.  Getting it from assembly to C was pretty difficult and was
mostly an intuitive operation.  To make a program that automatically did
this would probably be a major undertaking and even then I don't think it
would be able to figure out certain things.

Raul Rathmann
raul@sdnp1.ucsd.edu
OR rrathmann@ucsd.edu
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

seanf@sco.COM (Sean Fagan) (09/09/90)

In article <2900@network.ucsd.edu> raul@sdnp1.ucsd.edu (Raul Rathmann) writes:
>Getting it from assembly to C was pretty difficult and was
>mostly an intuitive operation.  To make a program that automatically did
>this would probably be a major undertaking and even then I don't think it
>would be able to figure out certain things.

As I've pointed out before, someone once wrote a decompiler for the VAX,
which I've played with.  It *knew* how PCC generated code, so, after
"disassembling" (internal use only, generally; it didn't try to decompile
assembly text, in other words), so it would structure the code generated
from that.

It's really not that difficult thing to do, if you don't have to deal with
esoteric instructions and a smart compiler.

Oh, yeah:  it didn't use the symbol table, and the code it generated
certainly had some, uhm, quirks.  Variable names such as 'r0' through 'r15',
for example (except for global variables, that is).  Structure members were
considered different variables (I once tried decompiling fork.o, for
example; something like three quarters of the variables were offsets into
the user structure, and I came up with a little chart that had the offsets
(and sizes) of all the elements).  Etc.

But it did work.  Thus borneth BSD Empire.

-- 
Sean Eric Fagan, seanf@sco.COM, uunet!sco!seanf, (408) 458-1422
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

rwallace@vax1.tcd.ie (09/09/90)

In article <HOW.90Sep5173755@sundrops.ucdavis.edu>, wwho@ucdavis.edu (W. Wilson Ho) writes:
> 	I am looking for any information related to disassembling
> object code into assembly langauge or even higher-level language such
> as C.  Would someone please give me pointers to program sources,
> documentation or papers related to this?
> [Turning object code back into assembler is pretty straightforward, and
> every debugger does it.  Someone else asked about disassembling into higher
> level languages a little while ago, but I didn't see any responses. -John]

There's no unique mapping from machine code to HLL and hence (unlike machine
code to assembler) no simple algorithm (your algorithm might recognize
something it thinks is a loop but is it a for loop, a while loop or just
something hacked together with gotos? that's before you even think about what
optimized machine code will look like). You could probably figure out a
complicated algorithm but to what purpose, since with no symbol table the HLL
code will not be significantly more readible than the assembler code anyway.
-- 
Russell Wallace, Trinity College, Dublin
rwallace@vax1.tcd.ie
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

hankd@dynamo.ecn.purdue.edu (Hank Dietz) (09/10/90)

In article <HOW.90Sep5173755@sundrops.ucdavis.edu> you write:
>	I am looking for any information related to disassembling
>object code into assembly langauge or even higher-level language such
>as C.  Would someone please give me pointers to program sources,
>documentation or papers related to this?

Basic disassembly is trivial, particularly if you have an object
module with a name list.  The interesting problems are:

[1]	Determining which portions of a raw memory image are
	code and which are data.  Typically, this is done by
	providing a set of code entry points and having the
	disassembler trace program flow marking each word with
	type information as each flow path is followed.

[2]	Dealing with self-modifying code.  At least the
	technique of [1] can detect when this might happen....
	I don't know of any reasonable way to deal with it.

Notice that indirect jump tables are particularly difficult to flow
trace (see [1]), as are techniques which use a Call instruction but
follow the instruction with the argument values (raw data) and tweak
the return address appropriately (as in some threaded interpreters).
Notice that knowing that the code image came from a particular
compiler can make these problems much easier to deal with, since you
can simply recognize the compiler's code generation idiom.

						-hankd@ecn.purdue.edu

PS: Back around 1981-2 I did a flow analyzing disassembler for several
    then-popular microprocessors (e.g., 8080).  I still have it, but
    it really isn't very impressive... especially when it hits some of
    those problem cases noted above (e.g., PCHL).
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

vu0310@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) (09/10/90)

In article <6839.26ea3b0e@vax1.tcd.ie> rwallace@vax1.tcd.ie writes:
>There's no unique mapping from machine code to HLL and hence (unlike machine
>code to assembler) no simple algorithm (your algorithm might recognize
>something it thinks is a loop but is it a for loop, a while loop or just
>something hacked together with gotos? that's before you even think about what
>optimized machine code will look like). You could probably figure out a

There's no _unique_ mapping from HLL into assembler, or from assembler into
assembler for that matter, either! In _principle_ translation between
assembler and an HLL is no harder than the other way 'round. You can't even
say that "assembly language is close to the machine & you can do things in
it that you can't in HLLs" -- after all, assembly and HLL are both capable
of emulating Turning machines, right?

I have made various attempts at writing assemlber -> HLL translators over
the years, mostly inspired by the `beatify' tool that attempts to transform
spagetti FORTRAN into structured FORTRAN. A bit of graph thy is all that's
needed for __most_ of that.

I ultimately see such tools as pointless, however, in that they only
encourage the maintenance of ``dusty deck'' stuff (in a new prettier form)
-- re-inventing the wheel occasionally comes up with a better wheel!

-Kym Horsell
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

Chuck.Phillips@FtCollins.NCR.COM (Chuck.Phillips) (09/10/90)

>>>>> On 9 Sep 90 12:52:29 GMT, rwallace@vax1.tcd.ie said:
> There's no unique mapping from machine code to HLL and hence
> (unlike machine code to assembler) no simple algorithm (your
> algorithm might recognize something it thinks is a loop but is it
> a for loop, a while loop or just something hacked together with
> gotos?

Very true, but it doesn't matter whether it uses "for" or "while" loops or
a combination based on heuristics.  (Turning a mess of "goto"s into "for"
or "while" loops sounds particularly attractive!)  Recreating the original
code _exactly_ is overkill; it's the _algorithms_ one generally wants to
see.  The more the decompiler is able to abstract "goto"s into "while"s,
"for"s, "do...while"s, etc. the better, even if it doesn't match the
original.  Further heuristics could be used for combining conditionals to
avoid 3 "if"s when a single one would do, finding candidates for "switch"
statements, etc.

Best of all: Combined with an optimizing compiler, you get a kluge source
code optimizer, and at least open the possibility of porting optimizations
to a platform with a less effective optimizer.  In some cases, it may even
make obvious new optimizations of the underlying algorithm.  (Just the tool
to have around when optimizing highly time critical subroutines. ;-)

Structure offsets and stack variable names do present a particular problem,
which may be _partially_ overcome by reading the symbol table (if it
exists) and annotating accordingly.  Patching your compiler to include
additional information in the object file or a side file can further the
possible abstraction.

In the the very worst case, you get to read cryptic C instead of cryptic
assembler.

	Cheers,
--
Chuck Phillips  MS440
NCR Microelectronics 			Chuck.Phillips%FtCollins.NCR.com
2001 Danfield Ct.
Ft. Collins, CO.  80525   		uunet!ncrlnk!ncr-mpd!bach!chuckp
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

dik@cwi.nl (Dik T. Winter) (09/10/90)

In article <2524@l.cc.purdue.edu> hrubin@l.cc.purdue.edu (Herman Rubin) writes:
 > > [Turning object code back into assembler is pretty straightforward, and
 > > every debugger does it.  Someone else asked about disassembling into higher
 > > level languages a little while ago, but I didn't see any responses. -John]
 > 
 > It is not quite that straightforward, and disassemblers can be somewhat
 > hard to get.  Debuggers usually use object code set up specially for
 > debugging, with certain features available.  Some debuggers even require
 > that the source code be included in such a way that this can be displayed
 > at debug time.  Disassembly frequently is even ambiguous.

Moreover, debuggers know at what addresses instructions start, a disassembler
does not have knowledge about it.  This gets hairy if the program uses
indirect jumps on a machine with variable sized instructions.  I once wrote
a disassembler for a CDC Cyber PPU.  It tried to follow all code threads,
but failed horribly in this aspect; and had to be guided.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

adamsf@turing.cs.rpi.edu (Frank Adams) (09/11/90)

In article <12976@june.cs.washington.edu> pardo@cs.washington.edu (David Keppel) writes:
>My guess is that decompiling in to a language that is e.g.,
>saccarine-sweetened assembler (C) is `easy', while decompiling e.g.,
>in to APL is hard.

If we assume that the program is to be decompiled into the language in
which it was written, it is in general easier to decompile the less the
compiler optimizes the generated code.

A second problem is type inference.  APL, with a fixed set of data types,
is easier in this respect than C.  For example, when the code loads a pointer
into a register and indexes off of it, what kind of struct is the pointer
pointing to?

If the object is only to get some kind of higher-level language representation
of an arbitrary executable, C will indeed be easier.  But this kind of
decompilation is not very useful -- what read

	foo.bar = 0;

in the original is likely to come out as

	*(int *)(((char *)&foo + 8)) = 0;

-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

markh@csd4.csd.uwm.edu (Mark William Hopkins) (09/11/90)

A while back I wrote a disassembler for the Intel 8052 family.  This is a
fairly straightforward task, so it only took a few hours.

However, there are some non-trivial aspects to even THIS process that brought
about complications which were primarily responsible for making the project
so long-winded (a few hours is too long ;) ).

(1) Recognizing data embedded within code segments.
To accomplish this, I basically 'told' the disassembler to treat everything
not accessible from the set of entry points as data.  This disassembler will
recursively search out all the code segments deriveable from the entry points
by "jumps"'s and "call"'s.   But ...

(2) Treating indirect jumps and calls.
... It could not locate the destinations of indirect jumps, since this
basically required a run-time analysis of the code.  The particular program I
was analysing, however, was of a special form where all the jumps were stored
in a jump table and could potentially be recognized as such by the
disassembler.  But I simply wasn't interested in making such an investment in
the effort at the time.  Consequentally, the code segment contained numerous
'gaps'.  One consolation, though, was that I could add these destinations by
hand to the initial set of entry points after doing my own analysis of the
code disassembled up to that point.

This particular program happened to be an interpreter, and thus was filled
with jump tables...

(3) Symbol table generation.
This is an obvious problem with any object code lacking a symbol table.
Ultimately, you're talking about an ideal application of 'natural language
generation' which is one of the most central issues of AI.  Your labels and
variable names and routine names should be reabable and should relate to
their purpose and use.

Of course, you can resort to the unimaginative alternative of having the
program give your decompiled code unimaginative names.  Or, as a compromise,
you could have it query the user for names...

------------------------------------------------------------

Recently, I've also developed a manual algorithm to allow me to disassemble
substantial portions of library routines embedded in compiled C code run on
this machine (whose processor is similar to the VAX).  The same kind of
disassembler can be written here as well.

In this case, however, (1) and (3) can often be resolved in part.  The format
of object files follows the a.out format typical of UNIX, and thus has neatly
separated data and code segments, and an embedded symbol table (for globals).

Problem (1) cannot be resolved completely though: there were often cases where
data was embedded in the code segment itself (for instance in our
implementation of the "as" assembler -- which is where I got the opcodes for
our machine from :) ).  Also, (3) cannot be resolved for 'hidden' local
variables that remain anonymous to files on the outside.

Similar issues, I assume, result in trying to disassemble object code output
to, say, the MicroSoft series of Quick compilers, or the Turbo compilers...

------------------------------------------------------------

In the process of doing the above, I also developed a partial algorithm for
'compiling' assembly into C.  When you talk about producing a 'decompiler'
you are actually confronting the task of writing a compiler: an unusual one,
though, that takes an unstructured language and creates from it a structured
language.

There is a standard UNIX utility (at least for the 4.3 bsd we're running) that
does something just like this: "struct".  This utility takes standard
Fortran-77 programs and generated from it Ratfor code.  Ratfor is a
'rationalized' Fortran that includes all the Algol-derived control structured.

Structuring code is not a really substantial issue, if you're talking about
generating loops from jumps.  Basically, you have to recognize combinations
of "segments" and "jumps" as being treatable by loops.  I've developed a set
of rules while generating structured BASIC from BASIC on a few occasions, or
while generating C from assembly.  All these rules will ultimately be
based on equivalences of the form:

	      while (Exp) Sta   <--->   x:
					   if (Exp) {
					      Sta; goto x;
                                           }

or
	       *: goto x;       <--->    *: goto y;
	       ...
	       x: goto y
or
	       <CODE SEGMENT>   <--->    <CODE SEGMENT>
	       goto x;                   x:
	       x:
and so on.

To generate control structures, you're teaching your decompiler an algebra of
jumps and segments.  I've often used the last rule in reverse to split off
segments in order to move them around in other parts of the code, for
instance.

There's also the fact that compilers will often produce code in very regular
ways, since they are often written around the syntax of the language.  So you
can spot out 'signatures' in the object code (if it was unoptimized) that
represent the output of certain kinds of structures.  While loops ALWAYS get
translated this way, functions ALWAYS get translated this way, etc.

However, there is the really substantial issue of generating *data*
structures, especially those dynamic structure involving pointers.  For
example, would your decompiler translate a jump table into a static array of
function pointers in C?  Would it create structured types if it finds enough
"instances" of it floating around in the unstructured code to justify this
abstraction?

Now you're getting into the issue of guessing what the programmer's intent was
from the code he or she wrote.  It's doable, I do it by hand all the time,
but very non-algorithmic.  This is AI pure and simple.

I can see the possibility of setting up a neural net which can be trained by
actual examples to 'recognize' data structures that may potentially lie in
the object code.  A decompiler with this kind of tool would not just be a
decompiler, it would potentially be a program that created improved versions
of the source code based on its analysis of the object code: it could, for
instance, pick up many data structures that eluded the programmer.

My claim is that you can't really have a decompiler that structures data
without having this extra feature.
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

harrison@necssd.NEC.COM (Mark Harrison) (09/12/90)

In article <1990Sep9.010032.23235@sco.COM>, seanf@sco.COM (Sean Fagan) writes:

> As I've pointed out before, someone once wrote a decompiler for the VAX,
> which I've played with.  It *knew* how PCC generated code, so, after
> "disassembling" (internal use only, generally; it didn't try to decompile
> assembly text, in other words), so it would structure the code generated
> from that.

On a related topic, does anyone know of any research into machine
translation of hand-written assembly code into a HLL?
-- 
Mark Harrison             harrison@necssd.NEC.COM
(214)518-5050             {necntc, cs.utexas.edu}!necssd!harrison
[How about all those Autocoder to Cobol translators there were 30 years
ago? -John]
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

freek@fwi.uva.nl (Freek Wiedijk) (09/12/90)

vu0310@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) writes:
>I have made various attempts at writing assemlber -> HLL translators over
>the years, ...
>
>I ultimately see such tools as pointless, however, in that they only
>encourage the maintenance of ``dusty deck'' stuff (in a new prettier form)
>-- re-inventing the wheel occasionally comes up with a better wheel!

No, no, no, they're VERY useful!  No one will be EVER able to complain
anymore that a system is sold without source code.  Take a look at the
complaints about the NeXT operating system.  Also, Nosy is my favorite
Mac application.  Despite its interface.

Freek "the Pistol Major" Wiedijk                      E-mail: freek@fwi.uva.nl
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

Jonathan.Bowen@prg.oxford.ac.uk (Jonathan Bowen) (09/13/90)

Speaking of decompilers, I have recently written a simple compiler in
Prolog based almost directly on a formal compiling specification.
Currently a high-level program can be supplied and object code produced
(non-deterministically, so several (possibly an infinite number of)
versions may be generated) or a high-level program and some object code
may be supplied and checked against one another.  Prolog in principle
can run backwards, so it may be possible to supply some object code and
produce a high-level program or programs as output. The main problem is
running the necessary arithmetic backwards (i.e. avoiding the use of
"is") and I am currently looking into this. Has anyone else done any
similar work or can anyone supply any useful references?

I am cross posting this to comp.lang.prolog as well as comp.compilers
for wider coverage.

--
Jonathan Bowen, Oxford.
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

nreadwin@miclon.uucp (Neil Readwin) (09/13/90)

In article <9009092238.AA28341@dynamo.ecn.purdue.edu> hankd@dynamo.ecn.purdue.edu (Hank Dietz) writes:
>Basic disassembly is trivial

Not for all instruction sets. The VAX has a CASE statement where the number
of cases (and therefore the size of the instruction) can be in a register.
Disassembling this is *not* trivial. Neil.
-- 
Phone: +44 71 528 8282  E-mail: nreadwin@micrognosis.co.uk
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

td@alice.UUCP (Tom Duff) (09/13/90)

vu0310@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell @ SUNY Binghamton, NY) says:
>I have made various attempts at writing assemlber -> HLL translators over
>the years, mostly inspired by the `beatify' tool that attempts to transform
>spagetti FORTRAN into structured FORTRAN.

Wouldn't a `beatify' tool transform Profane Fortran into Blessed Fortran?
[We could always hope so. -John]
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

kym@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) (09/14/90)

In article <679@culhua.prg.ox.ac.uk> Jonathan Bowen <Jonathan.Bowen@prg.oxford.ac.uk> writes:
>produce a high-level program or programs as output. The main problem is
>running the necessary arithmetic backwards (i.e. avoiding the use of
>"is") and I am currently looking into this. Has anyone else done any
>similar work or can anyone supply any useful references?

Funny, I didn't _see_ this in comp.lang.prolog...

To answer your query, arithmetic _can_ be run backward.

One simple, but inefficient, technique is to use ``unary'' arithmetic.  (I.e.
zero is 0, 1 is suc(0), 2 is suc(suc(0))). Rewriting the is/2, >/2, etc is a
pain but it _can_ be done.

Quite a lot of interesting work has been done on investigating which Prolog
relations can run backward -- look thru the Int Conf on Logic Prog & similar
things (they only go a few years so why check out the ref for you)?

DHD Warren wrote a nice article in Software Practice and Experience regarding
compiler-writing in Prolog -- check that out too (he writes fairly _neat_
code for one thing, and we're talking about readability here).

-Kym Horsell
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

ch@dce.ie (Charles Bryant) (09/15/90)

In article <3972@bingvaxu.cc.binghamton.edu> vu0310@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) writes:
>There's no _unique_ mapping from HLL into assembler, or from assembler into
>assembler for that matter, either! In _principle_ translation between
>assembler and an HLL is no harder than the other way 'round. You can't even
>say that "assembly language is close to the machine & you can do things in
>it that you can't in HLLs" -- after all, assembly and HLL are both capable
>of emulating Turning machines, right?

Well how would you translate this C function into Pascal.

	typedef struct list {
		struct list *next;
		int item;
	} list;

	list *head;
	
	insert(list *newelem)
	{
	 list **p;
	 for (p = &head; *p; p = &(*p)->next)
		if ( (*p)->item >= newelem->item) break;
	 newelem->next = *p;
	 *p = newelem;
	}

It seems to me that you either need an extra node, or you have to simulate the
pointers in an array. If you manage the Pascal, try BASIC (even with no
pointer operations at all it is still "equivalent" to C).
-- 
Charles Bryant (ch@dce.ie)
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

hankd@dynamo.ecn.purdue.edu (Hank Dietz) (09/15/90)

In article <11326@alice.UUCP> you write:
>vu0310@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell @ SUNY Binghamton, NY) says:
>>I have made various attempts at writing assemlber -> HLL translators over
>>the years, mostly inspired by the `beatify' tool that attempts to transform
>>spagetti FORTRAN into structured FORTRAN.
>
>Wouldn't a `beatify' tool transform Profane Fortran into Blessed Fortran?
>[We could always hope so. -John]

Nope.  The key transformation in fixing spaghetti code is an oldie but
goodie called "code straightening."  It's usually pretty easy.

However, such transformations are generally not very good at dealing with
comments.  Not only are they notoriously bad at creating meaningful new
comments automatically, but they don't even really know what to do with the
comments that were in the original source.

This last problem is rooted in the fact that most languages do not
syntactically group comments with the source code constructs to which they
apply (they are usually implemented as a lexical hack -- removed from the
source before syntax analysis).  Further, how does a beautifier know what
the comment refers to -- it would be pretty strange, for example, to retain
a comment about multiplying things when the multiply has been
strength-reduced out of existence?

One might argue that the big problem in programs today isn't unstructured
use of GOTO, but unstructured use of comments....  One of my PhD students
(Eng-Siong Tan) is currently investigating this problem....

						-hankd@ecn.purdue.edu
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

hawley@icot32.icot.or.jp (David John Hawley) (09/15/90)

In article <679@culhua.prg.ox.ac.uk> Jonathan Bowen <Jonathan.Bowen@prg.oxford.ac.uk> writes:
>.... The main problem is running the necessary arithmetic backwards (i.e.
>avoiding the use of "is") and I am currently looking into this. Has anyone
>else done any similar work or can anyone supply any useful references?

I'm not confident that that is the major problem, but as far as
"more declarative" realizations of arithmetic (and other non-logical goodies),
check out "constraint logic programming". See the recent pair of articles
in the July/90 CACM - an interpreter and compiler are available for the
CLP(R) language mentioned there (for academic use).

If you are concerned about completeness issues for your "invertible 'is'",
maybe you would be interested in our Grobner-base constraint solver,
the "elephant gun approach" also mentioned in the above articles ;-)

David Hawley,
CAL group, ICOT
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

kym@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) (09/15/90)

In article <1990Sep14.181616.26890@dce.ie> ch@dce.ie (Charles Bryant) writes:
>Well how would you translate this C function into Pascal.
>
>	typedef struct list {
>		struct list *next;
>		int item;
>	} list;
>
>	list *head;
>	
>	insert(list *newelem)
>	{
>	 list **p;
>	 for (p = &head; *p; p = &(*p)->next)
>		if ( (*p)->item >= newelem->item) break;
>	 newelem->next = *p;
>	 *p = newelem;
>	}

I'm not sure at what level to aim this so I'll play it straight.

I think we can try at least:

	type 
		list = 0..maxmem;
		listinfo = record next: list; item: integer; end
			{listinfo};
	var
		mem: array[0..maxmem] of listinfo;
		head: list;

	procedure insert(newelem: list);
	var	p: list; break: boolean;
	begin
		p := head;
		break := false;
		while (p<>0) and not break do
		if mem[p].item>=mem[newelem].item then break:=true
		else p := mem[p].next;
		mem[newelem].next = p;
		mem[p].next = newelem;
	end {insert};

It even duplicates the same bug as the original code! (:-)

Complaints about _efficiency_ can be countered by resorting to arguments re
code optimization (which has come a long way since even Pascal was designed)
and various tricks that architects get up to (i.e. since _all addressing_ on
a given machine might be register-relative the indexing in my Pascal code
will not actually cost anything cf the C code). In any case efficiency issues
were not (originally) in question.

We might also translate the code to Fortran as follows:

	subroutine insert(newelem)
	common mem(0:maxmem)
	integer list
	integer p
	p = head
10	if(p.eq.0) goto 20
	if(mem(p+ITEMOFFSET)>=mem(newelem+ITEMOFFSET)) goto 20
	p = mem(p+NEXTOFFSET)
	goto 10
20	mem(newelem+NEXTOFFSET) = p
	mem(p+NEXTOFFSET) = newelem
	return
	end

Again, same bug...

You can perform any syntactic sugar to get this into (whatever)
version of Basic you like.

There may be some transcription errors here & there above, but
I don't think anything is essentially _wrong_ with the code,
so don't get too picky.

The good thing about theoretical results from formal language theory
(and thank you Manfred von Thun) is that we _can_ say (all too few) general 
things about programming and be sure we're right.

Of course none of this has _proved_ that things are ``equally hard''
as indicated in my original post -- that _would_ be much harder.

-Kym Horsell

[I suspect that the original author expected the Pascal version to use Pascal
pointers.  My Pascal is rusty enough that I don't immediately see what the
problem is, other than perhaps the static list head.
My experience is that translating from one language to another is 90% easy,
but the 10% can be incredibly hard.  In about 1970, I took a very vanilla
Fortran calendar printing program and ran it though IBM's Fortran to PL/I.
Although most of the Fortran was recognizable more or less statement by
statement, some of it, particularly the I/O, expanded by orders of magnitude.
Most of the blow-up handled cases that I knew wouldn't happen (most notably
reading into a literal in Format statement) but it was hard for the
translator to know that.  This is a long way from decompiling assembler into
some language. -John]
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

roland@ai.mit.edu (Roland McGrath) (09/17/90)

	<3972@bingvaxu.cc.binghamton.edu> <11326@alice.UUCP>
Organization: Hackers Anonymous International, Ltd., Inc. (Applications
	welcome)
Keywords: disassemble
In-Reply-To: td@alice.UUCP's message of 13 Sep 90 14:07:39 GMT


In article <11326@alice.UUCP> td@alice.UUCP (Tom Duff) writes:
   Wouldn't a `beatify' tool transform Profane Fortran into Blessed Fortran?
   [We could always hope so. -John]

Yes, if one beats on Fortran enough, it eventually becomes a blessed pile of
mush, much more beautiful than the original.
--
	Roland McGrath
	Free Software Foundation, Inc.
roland@ai.mit.edu, uunet!ai.mit.edu!roland
[So how's Gnu Fortran coming? -John]
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

raulmill@usc.edu (09/17/90)

In article <_5A%GS%@rpi.edu> adamsf@turing.cs.rpi.edu (Frank Adams)
writes: 
In article <12976@june.cs.washington.edu> pardo@cs.washington.edu
(David Keppel) writes: 
>My guess is that decompiling in to a language that is e.g.,
>saccarine-sweetened assembler (C) is `easy', while decompiling e.g.,
>in to APL is hard.

If we assume that the program is to be decompiled into the language in
which it was written, it is in general easier to decompile the less the
compiler optimizes the generated code.

A second problem is type inference.  APL, with a fixed set of data types,
is easier in this respect than C.  For example, when the code loads a pointer
into a register and indexes off of it, what kind of struct is the pointer
pointing to?

[Frank then goes on to state his opinion that C is pretty good for
exact transliteration of machine language.]

If I may point out...

[1] the first commercial use of APL was to describe the IBM 360 architecture.
APL has the ability to concisely describe just about any machine
architecture.

[2] As far as I know, the language analysis/verification tools available for
APL are pretty good [some would say better than those available for any other
language, but without first hand knowledge I'm not so sure.  I do know that 7
or 8 years ago 3 bugs were found in that 360 description by one of these
verifiers.]

If you want an exact HLL transliteration of raw machine code, or a
translation into an assembler-like language, there is no reason why APL
should be harder than any other language (though I'd recommend using J
instead, because there is an odd sort of problem getting APL to talk in
ascii, and J is better IMHO :)

To turn back to the original poster's question, the best disassemblers
I have seen often do a lot of interpolation based on system calls
whose arguments are known, various compiler conventions and, if you
are lucky enough, linking/debugging information left in the code by
the developers.

As far as I've seen, the worst problem in converting from machine language to
other representations is figuring out what to call a specific piece of
memory.  (code? text? struct? etc.)  A lot of this information can be
interpolated by logic on the order of 'well, if this instruction is illegal,
we know everything back to the last branch isn't instructions.'
[It seems to me that [1] is a red herring, the IBM POO describes the 370
in English, but disassembling into English is difficult. On the other hand,
decompiling into scalar APL expressions shouldn't be hard. -John]

-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

ch@dce.ie (Charles Bryant) (09/18/90)

In article <4028@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) writes:
>In article <1990Sep14.181616.26890@dce.ie> ch@dce.ie (Charles Bryant) writes:
>>Well how would you translate this C function into Pascal.
>>[simple linked list insertion]
>
>I'm not sure at what level to aim this so I'll play it straight.
>
>[pascal version using fixed size array]
>It even duplicates the same bug as the original code! (:-)

That's funny, I can't see any bug in the original. I can see a bug in the
Pascal one though: when the list is empty new elements can't be inserted
because 'head' is never changed.

>Complaints about _efficiency_ can be countered by resorting to arguments re
>code optimization ...

Quite true. But one important property of the original is that it has no
fixed limit. The Pascal version is limited by the space allocated in
the array. If the array is big, space is wasted; and any fixed size may turn
out to be too small.

>[I suspect that the original author expected the Pascal version to use Pascal
>pointers.  My Pascal is rusty enough that I don't immediately see what the
>problem is, other than perhaps the static list head.

Any method can be used if the basic properties of the original code are
preserved. One of these is that the list has no predetermined maximum size.
As far as I can see this means pointers are necessary, but it is not possible
to take the address of a variable in Pascal, so the case of an empty list
must be treated specially, I'll use the array method here to make it easy to
compare.

	<same declarations>
	begin
		if head=0 then
			begin
				head := newelem;
				mem[newelem].next := 0;
			end
		else begin
			p := head; break:= false;
			while (p<>0) and not break do
			   if mem[p].item >= mem[newelem].item then break:=true;
			   else p := mem[p].next;
			mem[newelem].next := mem[p].next;
			mem[p].next := newelem;
		end;
	end;

>...  This is a long way from decompiling assembler into
>some language. -John]

What I meant was that unless the decompiler produces code in the same
language that the code was originally written in, it may have to produce
things that bear no relation to the original code (nor the assembler). If the
target language is less powerful that the original a decompilation may not be
possible.

(Well obviously you could write a program in the target language that emulated
the machine that the assembly came from, but I wouldn't consider that to be
decompilation).
-- 
Charles Bryant (ch@dce.ie)
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

ctl8588@rigel.tamu.edu (LAUGHLIN, CHET) (09/19/90)

In article <1990Sep14.181616.26890@dce.ie>, ch@dce.ie (Charles Bryant) writes...
>Well how would you translate this C function into Pascal.
  [ code deleted ]

How one 'translates' into Pascal is not the point.  The point is if you
can take the machine code (of the resulting program) and pull it into Pascal,
or Basic for that matter.  It may not be as pretty to look at, but that 
should be more of a limitation inposed by the language structure itself.

-Chet Laughlin
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

djones@decwrl.dec.com (Dave Jones) (09/19/90)

I just wrote a SPARC object-code dumper a couple of days ago. I am as much
interested in the symbol-table and relocation-information as I am in the
instructions, but I never even worried about data intertwined with the
instructions. (It helps that all SPARC instructions begin on 32 bit
boundaries, of course.) I simply have one flag that prints the text segment
as a hex and ascii dump, another that prints it as machine instructions.
Can't run the assembler over the output, but then we're not supposed to do
that are we? It's plenty good for my purposes, which is to inspect machine
code in order to infer undocumented conventions, and to verify that compiled
code is correct.
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

markh@csd4.csd.uwm.edu (Mark William Hopkins) (09/19/90)

In article <1990Sep14.181616.26890@dce.ie> ch@dce.ie (Charles Bryant) writes:
>Well how would you translate this C function into Pascal.
>
>	typedef struct list {
>		struct list *next;
>		int item;
>	} list;
>
>	list *head;
>	
>	insert(list *newelem)
>	{
>	 list **p;
>	 for (p = &head; *p; p = &(*p)->next)
>		if ( (*p)->item >= newelem->item) break;
>	 newelem->next = *p;
>	 *p = newelem;
>	}
...
>It seems to me that you either need an extra node, or you have to simulate the
>pointers in an array. If you manage the Pascal, try BASIC (even with no
>pointer operations at all it is still "equivalent" to C).

By first translating it into Proper C...

[A] Change the 'list = struct list' into 'list = struct list *', and
    make the original program readable :).

typedef struct List {
   struct List *Next; int Item;
} *List;

List Head;

Insert(List NewElem) {
   List *P;
   for (P = &Head; *P != NULL; P = &(*P)->Next)
      if ((*P)->Item >= NewElem->Item) break;
   NewElem->Next = *P;
   *P = NewElem;
}

[B] Add in a dummy variable, dP, to represent *P.

Insert(List NewElem) {
   List *P, dP;

   for (P = &Head, dP = *P; dP != NULL; P = &dP->Next, dP = *P)
      if (dP->Item >= NewElem->Item) break;
   NewElem->Next = dP;
   dP = *P = NewElem;
}

[C] Use semantic equalities to make P go away...

Insert(List NewElem) {
   List *P, dP;

   for (P = &Head, dP = Head; dP != NULL; P = &dP->Next, dP = dP->Next)
      if (dP->Item >= NewElem->Item) break;
   NewElem->Next = dP;
   *P = NewElem; dP = NewElem;
}

[D] Make P go away, and make believe that dP was P all along...

Insert(List NewElem) {
   List P;

   for (P = Head; P != NULL; P = P->Next)
      if (P->Item >= NewElem->Item) break;
   NewElem->Next = P; P = NewElem;
}

Proper C is a beautiful subset of C, in which all non-system identifiers are
syntatically required to begin in capitals, all "&"'s are removed, and all
recursively defined structures are required to have pointers to that structure
with the same name as the structure...

See?  No new node.  You closed your eyes in steps [B] and [C], didn't you? :)
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

td@alice.UUCP (Tom Duff) (09/21/90)

markh@csd4.csd.uwm.edu (Mark William Hopkins @ University of Wisconsin-Milwaukee)
rewrites the following code (due to ch@dce.ie (Charles Bryant)):
	typedef struct list {
		struct list *next;
		int item;
	} list;

	list *head;
	
	insert(list *newelem)
	{
	 list **p;
	 for (p = &head; *p; p = &(*p)->next)
		if ( (*p)->item >= newelem->item) break;
	 newelem->next = *p;
	 *p = newelem;
	}
as
	typedef struct List {
	   struct List *Next; int Item;
	} *List;
	
	List Head;
	
	Insert(List NewElem) {
	   List P;
	
	   for (P = Head; P != NULL; P = P->Next)
	      if (P->Item >= NewElem->Item) break;
	   NewElem->Next = P; P = NewElem;
	}

claiming that the result is trivially translatable into Pascal, (indeed, it
is written in the Pascal subset of C), and that the two are semantically
equivalent.

Unfortunately, the two are not equivalent.  In the first, the assignment
	*p=newelem; can change the value of `head' (say when head==0).  In
the rewritten version, no assignment to Head is possible, since it is never
mentioned on the left of an assignment, and no pointer to it is ever
developed.  Of course, the code is useless without the potential side-effect
on head.

The problem is that the rewrite has decreased p's level of indirection by 1.
So p (in the rewritten version) has the same rvalue as *p (in the original),
but a different lvalue.  Unfortunately, on the left side of an assignment it
is the lvalue that matters.

Mr. Bryant's whole point was that occasionally the assignment
	*p = newelem; can change head.  This effect is not directly
achievable in Pascal, because it is not possible to develop a pointer to data
not allocated on the heap.  (Well, maybe call-by-reference is a partial
exception, but the implicit pointer there is not a first class citizen.)

Many Pascal programmers (apparently including Mr. Hopkins) consider the
semantic notions of lvalue and rvalue to be needless complication in the
description of C, since there is little need for them in describing Pascal,
which contains no lvalue-to-rvalue conversion operator (unary & in C.)
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.