wwho@ucdavis.edu (W. Wilson Ho) (09/06/90)
I am looking for any information related to disassembling object code into assembly langauge or even higher-level language such as C. Would someone please give me pointers to program sources, documentation or papers related to this? Thanks in advance! W. Wilson Ho | INTERNET: how@ivy.ucdavis.edu Division of Computer Science | UUCP: ...!ucbvax!ucdavis!ivy!how EECS Department | BITNET: wwho@ucdavis.bitnet University of California | Davis, CA 95616 | [Turning object code back into assembler is pretty straightforward, and every debugger does it. Someone else asked about disassembling into higher level languages a little while ago, but I didn't see any responses. -John] -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
hrubin@l.cc.purdue.edu (Herman Rubin) (09/07/90)
In article <HOW.90Sep5173755@sundrops.ucdavis.edu>, wwho@ucdavis.edu (W. Wilson Ho) writes: > > I am looking for any information related to disassembling > object code into assembly langauge or even higher-level language such > as C. Would someone please give me pointers to program sources, > documentation or papers related to this? ................... > [Turning object code back into assembler is pretty straightforward, and > every debugger does it. Someone else asked about disassembling into higher > level languages a little while ago, but I didn't see any responses. -John] It is not quite that straightforward, and disassemblers can be somewhat hard to get. Debuggers usually use object code set up specially for debugging, with certain features available. Some debuggers even require that the source code be included in such a way that this can be displayed at debug time. Disassembly frequently is even ambiguous. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!cik(UUCP) -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
pardo@cs.washington.edu (David Keppel) (09/07/90)
>[Turning object code back into assembler is pretty straightforward, and >every debugger does it. Someone else asked about disassembling into higher >level languages a little while ago, but I didn't see any responses. -John] See anonymous ftp from cs.washington.edu (128.95.1.4) `pub/decomp.tar.Z'. It takes VAX object code back to fake C. One of my profs did a one-day hack a while back to decompile assembly code in to compiler IR. From that, he could have created source code in several languages. My guess is that decompiling in to a language that is e.g., saccarine-sweetened assembler (C) is `easy', while decompiling e.g., in to APL is hard. ;-D on ( Now for a deprogrammer... ) Pardo -- pardo@cs.washington.edu {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
raul@sdnp1.ucsd.edu (Raul Rathmann) (09/08/90)
In article <HOW.90Sep5173755@sundrops.ucdavis.edu> wwho@ucdavis.edu (W. Wilson Ho) writes: > > I am looking for any information related to disassembling >object code into assembly langauge or even higher-level language such >as C. Would someone please give me pointers to program sources, >documentation or papers related to this? I have used a disassembler called "unas" that spit out "as" assembly code from object code on VAXen and Suns running BSD UNIX. I picked it up from some anonymous ftp site, possibly from a comp.sources.unix archive site. It was pretty old and I might still have it around, but no guarantees. I used it to reverse engineer part of a library that I had no way of getting source code for. Getting it from assembly to C was pretty difficult and was mostly an intuitive operation. To make a program that automatically did this would probably be a major undertaking and even then I don't think it would be able to figure out certain things. Raul Rathmann raul@sdnp1.ucsd.edu OR rrathmann@ucsd.edu -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
seanf@sco.COM (Sean Fagan) (09/09/90)
In article <2900@network.ucsd.edu> raul@sdnp1.ucsd.edu (Raul Rathmann) writes: >Getting it from assembly to C was pretty difficult and was >mostly an intuitive operation. To make a program that automatically did >this would probably be a major undertaking and even then I don't think it >would be able to figure out certain things. As I've pointed out before, someone once wrote a decompiler for the VAX, which I've played with. It *knew* how PCC generated code, so, after "disassembling" (internal use only, generally; it didn't try to decompile assembly text, in other words), so it would structure the code generated from that. It's really not that difficult thing to do, if you don't have to deal with esoteric instructions and a smart compiler. Oh, yeah: it didn't use the symbol table, and the code it generated certainly had some, uhm, quirks. Variable names such as 'r0' through 'r15', for example (except for global variables, that is). Structure members were considered different variables (I once tried decompiling fork.o, for example; something like three quarters of the variables were offsets into the user structure, and I came up with a little chart that had the offsets (and sizes) of all the elements). Etc. But it did work. Thus borneth BSD Empire. -- Sean Eric Fagan, seanf@sco.COM, uunet!sco!seanf, (408) 458-1422 -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
rwallace@vax1.tcd.ie (09/09/90)
In article <HOW.90Sep5173755@sundrops.ucdavis.edu>, wwho@ucdavis.edu (W. Wilson Ho) writes: > I am looking for any information related to disassembling > object code into assembly langauge or even higher-level language such > as C. Would someone please give me pointers to program sources, > documentation or papers related to this? > [Turning object code back into assembler is pretty straightforward, and > every debugger does it. Someone else asked about disassembling into higher > level languages a little while ago, but I didn't see any responses. -John] There's no unique mapping from machine code to HLL and hence (unlike machine code to assembler) no simple algorithm (your algorithm might recognize something it thinks is a loop but is it a for loop, a while loop or just something hacked together with gotos? that's before you even think about what optimized machine code will look like). You could probably figure out a complicated algorithm but to what purpose, since with no symbol table the HLL code will not be significantly more readible than the assembler code anyway. -- Russell Wallace, Trinity College, Dublin rwallace@vax1.tcd.ie -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
hankd@dynamo.ecn.purdue.edu (Hank Dietz) (09/10/90)
In article <HOW.90Sep5173755@sundrops.ucdavis.edu> you write: > I am looking for any information related to disassembling >object code into assembly langauge or even higher-level language such >as C. Would someone please give me pointers to program sources, >documentation or papers related to this? Basic disassembly is trivial, particularly if you have an object module with a name list. The interesting problems are: [1] Determining which portions of a raw memory image are code and which are data. Typically, this is done by providing a set of code entry points and having the disassembler trace program flow marking each word with type information as each flow path is followed. [2] Dealing with self-modifying code. At least the technique of [1] can detect when this might happen.... I don't know of any reasonable way to deal with it. Notice that indirect jump tables are particularly difficult to flow trace (see [1]), as are techniques which use a Call instruction but follow the instruction with the argument values (raw data) and tweak the return address appropriately (as in some threaded interpreters). Notice that knowing that the code image came from a particular compiler can make these problems much easier to deal with, since you can simply recognize the compiler's code generation idiom. -hankd@ecn.purdue.edu PS: Back around 1981-2 I did a flow analyzing disassembler for several then-popular microprocessors (e.g., 8080). I still have it, but it really isn't very impressive... especially when it hits some of those problem cases noted above (e.g., PCHL). -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
vu0310@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) (09/10/90)
In article <6839.26ea3b0e@vax1.tcd.ie> rwallace@vax1.tcd.ie writes: >There's no unique mapping from machine code to HLL and hence (unlike machine >code to assembler) no simple algorithm (your algorithm might recognize >something it thinks is a loop but is it a for loop, a while loop or just >something hacked together with gotos? that's before you even think about what >optimized machine code will look like). You could probably figure out a There's no _unique_ mapping from HLL into assembler, or from assembler into assembler for that matter, either! In _principle_ translation between assembler and an HLL is no harder than the other way 'round. You can't even say that "assembly language is close to the machine & you can do things in it that you can't in HLLs" -- after all, assembly and HLL are both capable of emulating Turning machines, right? I have made various attempts at writing assemlber -> HLL translators over the years, mostly inspired by the `beatify' tool that attempts to transform spagetti FORTRAN into structured FORTRAN. A bit of graph thy is all that's needed for __most_ of that. I ultimately see such tools as pointless, however, in that they only encourage the maintenance of ``dusty deck'' stuff (in a new prettier form) -- re-inventing the wheel occasionally comes up with a better wheel! -Kym Horsell -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
Chuck.Phillips@FtCollins.NCR.COM (Chuck.Phillips) (09/10/90)
>>>>> On 9 Sep 90 12:52:29 GMT, rwallace@vax1.tcd.ie said: > There's no unique mapping from machine code to HLL and hence > (unlike machine code to assembler) no simple algorithm (your > algorithm might recognize something it thinks is a loop but is it > a for loop, a while loop or just something hacked together with > gotos? Very true, but it doesn't matter whether it uses "for" or "while" loops or a combination based on heuristics. (Turning a mess of "goto"s into "for" or "while" loops sounds particularly attractive!) Recreating the original code _exactly_ is overkill; it's the _algorithms_ one generally wants to see. The more the decompiler is able to abstract "goto"s into "while"s, "for"s, "do...while"s, etc. the better, even if it doesn't match the original. Further heuristics could be used for combining conditionals to avoid 3 "if"s when a single one would do, finding candidates for "switch" statements, etc. Best of all: Combined with an optimizing compiler, you get a kluge source code optimizer, and at least open the possibility of porting optimizations to a platform with a less effective optimizer. In some cases, it may even make obvious new optimizations of the underlying algorithm. (Just the tool to have around when optimizing highly time critical subroutines. ;-) Structure offsets and stack variable names do present a particular problem, which may be _partially_ overcome by reading the symbol table (if it exists) and annotating accordingly. Patching your compiler to include additional information in the object file or a side file can further the possible abstraction. In the the very worst case, you get to read cryptic C instead of cryptic assembler. Cheers, -- Chuck Phillips MS440 NCR Microelectronics Chuck.Phillips%FtCollins.NCR.com 2001 Danfield Ct. Ft. Collins, CO. 80525 uunet!ncrlnk!ncr-mpd!bach!chuckp -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
dik@cwi.nl (Dik T. Winter) (09/10/90)
In article <2524@l.cc.purdue.edu> hrubin@l.cc.purdue.edu (Herman Rubin) writes: > > [Turning object code back into assembler is pretty straightforward, and > > every debugger does it. Someone else asked about disassembling into higher > > level languages a little while ago, but I didn't see any responses. -John] > > It is not quite that straightforward, and disassemblers can be somewhat > hard to get. Debuggers usually use object code set up specially for > debugging, with certain features available. Some debuggers even require > that the source code be included in such a way that this can be displayed > at debug time. Disassembly frequently is even ambiguous. Moreover, debuggers know at what addresses instructions start, a disassembler does not have knowledge about it. This gets hairy if the program uses indirect jumps on a machine with variable sized instructions. I once wrote a disassembler for a CDC Cyber PPU. It tried to follow all code threads, but failed horribly in this aspect; and had to be guided. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
adamsf@turing.cs.rpi.edu (Frank Adams) (09/11/90)
In article <12976@june.cs.washington.edu> pardo@cs.washington.edu (David Keppel) writes: >My guess is that decompiling in to a language that is e.g., >saccarine-sweetened assembler (C) is `easy', while decompiling e.g., >in to APL is hard. If we assume that the program is to be decompiled into the language in which it was written, it is in general easier to decompile the less the compiler optimizes the generated code. A second problem is type inference. APL, with a fixed set of data types, is easier in this respect than C. For example, when the code loads a pointer into a register and indexes off of it, what kind of struct is the pointer pointing to? If the object is only to get some kind of higher-level language representation of an arbitrary executable, C will indeed be easier. But this kind of decompilation is not very useful -- what read foo.bar = 0; in the original is likely to come out as *(int *)(((char *)&foo + 8)) = 0; -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
markh@csd4.csd.uwm.edu (Mark William Hopkins) (09/11/90)
A while back I wrote a disassembler for the Intel 8052 family. This is a
fairly straightforward task, so it only took a few hours.
However, there are some non-trivial aspects to even THIS process that brought
about complications which were primarily responsible for making the project
so long-winded (a few hours is too long ;) ).
(1) Recognizing data embedded within code segments.
To accomplish this, I basically 'told' the disassembler to treat everything
not accessible from the set of entry points as data. This disassembler will
recursively search out all the code segments deriveable from the entry points
by "jumps"'s and "call"'s. But ...
(2) Treating indirect jumps and calls.
... It could not locate the destinations of indirect jumps, since this
basically required a run-time analysis of the code. The particular program I
was analysing, however, was of a special form where all the jumps were stored
in a jump table and could potentially be recognized as such by the
disassembler. But I simply wasn't interested in making such an investment in
the effort at the time. Consequentally, the code segment contained numerous
'gaps'. One consolation, though, was that I could add these destinations by
hand to the initial set of entry points after doing my own analysis of the
code disassembled up to that point.
This particular program happened to be an interpreter, and thus was filled
with jump tables...
(3) Symbol table generation.
This is an obvious problem with any object code lacking a symbol table.
Ultimately, you're talking about an ideal application of 'natural language
generation' which is one of the most central issues of AI. Your labels and
variable names and routine names should be reabable and should relate to
their purpose and use.
Of course, you can resort to the unimaginative alternative of having the
program give your decompiled code unimaginative names. Or, as a compromise,
you could have it query the user for names...
------------------------------------------------------------
Recently, I've also developed a manual algorithm to allow me to disassemble
substantial portions of library routines embedded in compiled C code run on
this machine (whose processor is similar to the VAX). The same kind of
disassembler can be written here as well.
In this case, however, (1) and (3) can often be resolved in part. The format
of object files follows the a.out format typical of UNIX, and thus has neatly
separated data and code segments, and an embedded symbol table (for globals).
Problem (1) cannot be resolved completely though: there were often cases where
data was embedded in the code segment itself (for instance in our
implementation of the "as" assembler -- which is where I got the opcodes for
our machine from :) ). Also, (3) cannot be resolved for 'hidden' local
variables that remain anonymous to files on the outside.
Similar issues, I assume, result in trying to disassemble object code output
to, say, the MicroSoft series of Quick compilers, or the Turbo compilers...
------------------------------------------------------------
In the process of doing the above, I also developed a partial algorithm for
'compiling' assembly into C. When you talk about producing a 'decompiler'
you are actually confronting the task of writing a compiler: an unusual one,
though, that takes an unstructured language and creates from it a structured
language.
There is a standard UNIX utility (at least for the 4.3 bsd we're running) that
does something just like this: "struct". This utility takes standard
Fortran-77 programs and generated from it Ratfor code. Ratfor is a
'rationalized' Fortran that includes all the Algol-derived control structured.
Structuring code is not a really substantial issue, if you're talking about
generating loops from jumps. Basically, you have to recognize combinations
of "segments" and "jumps" as being treatable by loops. I've developed a set
of rules while generating structured BASIC from BASIC on a few occasions, or
while generating C from assembly. All these rules will ultimately be
based on equivalences of the form:
while (Exp) Sta <---> x:
if (Exp) {
Sta; goto x;
}
or
*: goto x; <---> *: goto y;
...
x: goto y
or
<CODE SEGMENT> <---> <CODE SEGMENT>
goto x; x:
x:
and so on.
To generate control structures, you're teaching your decompiler an algebra of
jumps and segments. I've often used the last rule in reverse to split off
segments in order to move them around in other parts of the code, for
instance.
There's also the fact that compilers will often produce code in very regular
ways, since they are often written around the syntax of the language. So you
can spot out 'signatures' in the object code (if it was unoptimized) that
represent the output of certain kinds of structures. While loops ALWAYS get
translated this way, functions ALWAYS get translated this way, etc.
However, there is the really substantial issue of generating *data*
structures, especially those dynamic structure involving pointers. For
example, would your decompiler translate a jump table into a static array of
function pointers in C? Would it create structured types if it finds enough
"instances" of it floating around in the unstructured code to justify this
abstraction?
Now you're getting into the issue of guessing what the programmer's intent was
from the code he or she wrote. It's doable, I do it by hand all the time,
but very non-algorithmic. This is AI pure and simple.
I can see the possibility of setting up a neural net which can be trained by
actual examples to 'recognize' data structures that may potentially lie in
the object code. A decompiler with this kind of tool would not just be a
decompiler, it would potentially be a program that created improved versions
of the source code based on its analysis of the object code: it could, for
instance, pick up many data structures that eluded the programmer.
My claim is that you can't really have a decompiler that structures data
without having this extra feature.
--
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
harrison@necssd.NEC.COM (Mark Harrison) (09/12/90)
In article <1990Sep9.010032.23235@sco.COM>, seanf@sco.COM (Sean Fagan) writes: > As I've pointed out before, someone once wrote a decompiler for the VAX, > which I've played with. It *knew* how PCC generated code, so, after > "disassembling" (internal use only, generally; it didn't try to decompile > assembly text, in other words), so it would structure the code generated > from that. On a related topic, does anyone know of any research into machine translation of hand-written assembly code into a HLL? -- Mark Harrison harrison@necssd.NEC.COM (214)518-5050 {necntc, cs.utexas.edu}!necssd!harrison [How about all those Autocoder to Cobol translators there were 30 years ago? -John] -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
freek@fwi.uva.nl (Freek Wiedijk) (09/12/90)
vu0310@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) writes: >I have made various attempts at writing assemlber -> HLL translators over >the years, ... > >I ultimately see such tools as pointless, however, in that they only >encourage the maintenance of ``dusty deck'' stuff (in a new prettier form) >-- re-inventing the wheel occasionally comes up with a better wheel! No, no, no, they're VERY useful! No one will be EVER able to complain anymore that a system is sold without source code. Take a look at the complaints about the NeXT operating system. Also, Nosy is my favorite Mac application. Despite its interface. Freek "the Pistol Major" Wiedijk E-mail: freek@fwi.uva.nl -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
Jonathan.Bowen@prg.oxford.ac.uk (Jonathan Bowen) (09/13/90)
Speaking of decompilers, I have recently written a simple compiler in Prolog based almost directly on a formal compiling specification. Currently a high-level program can be supplied and object code produced (non-deterministically, so several (possibly an infinite number of) versions may be generated) or a high-level program and some object code may be supplied and checked against one another. Prolog in principle can run backwards, so it may be possible to supply some object code and produce a high-level program or programs as output. The main problem is running the necessary arithmetic backwards (i.e. avoiding the use of "is") and I am currently looking into this. Has anyone else done any similar work or can anyone supply any useful references? I am cross posting this to comp.lang.prolog as well as comp.compilers for wider coverage. -- Jonathan Bowen, Oxford. -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
nreadwin@miclon.uucp (Neil Readwin) (09/13/90)
In article <9009092238.AA28341@dynamo.ecn.purdue.edu> hankd@dynamo.ecn.purdue.edu (Hank Dietz) writes: >Basic disassembly is trivial Not for all instruction sets. The VAX has a CASE statement where the number of cases (and therefore the size of the instruction) can be in a register. Disassembling this is *not* trivial. Neil. -- Phone: +44 71 528 8282 E-mail: nreadwin@micrognosis.co.uk -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
td@alice.UUCP (Tom Duff) (09/13/90)
vu0310@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell @ SUNY Binghamton, NY) says: >I have made various attempts at writing assemlber -> HLL translators over >the years, mostly inspired by the `beatify' tool that attempts to transform >spagetti FORTRAN into structured FORTRAN. Wouldn't a `beatify' tool transform Profane Fortran into Blessed Fortran? [We could always hope so. -John] -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
kym@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) (09/14/90)
In article <679@culhua.prg.ox.ac.uk> Jonathan Bowen <Jonathan.Bowen@prg.oxford.ac.uk> writes: >produce a high-level program or programs as output. The main problem is >running the necessary arithmetic backwards (i.e. avoiding the use of >"is") and I am currently looking into this. Has anyone else done any >similar work or can anyone supply any useful references? Funny, I didn't _see_ this in comp.lang.prolog... To answer your query, arithmetic _can_ be run backward. One simple, but inefficient, technique is to use ``unary'' arithmetic. (I.e. zero is 0, 1 is suc(0), 2 is suc(suc(0))). Rewriting the is/2, >/2, etc is a pain but it _can_ be done. Quite a lot of interesting work has been done on investigating which Prolog relations can run backward -- look thru the Int Conf on Logic Prog & similar things (they only go a few years so why check out the ref for you)? DHD Warren wrote a nice article in Software Practice and Experience regarding compiler-writing in Prolog -- check that out too (he writes fairly _neat_ code for one thing, and we're talking about readability here). -Kym Horsell -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
ch@dce.ie (Charles Bryant) (09/15/90)
In article <3972@bingvaxu.cc.binghamton.edu> vu0310@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) writes: >There's no _unique_ mapping from HLL into assembler, or from assembler into >assembler for that matter, either! In _principle_ translation between >assembler and an HLL is no harder than the other way 'round. You can't even >say that "assembly language is close to the machine & you can do things in >it that you can't in HLLs" -- after all, assembly and HLL are both capable >of emulating Turning machines, right? Well how would you translate this C function into Pascal. typedef struct list { struct list *next; int item; } list; list *head; insert(list *newelem) { list **p; for (p = &head; *p; p = &(*p)->next) if ( (*p)->item >= newelem->item) break; newelem->next = *p; *p = newelem; } It seems to me that you either need an extra node, or you have to simulate the pointers in an array. If you manage the Pascal, try BASIC (even with no pointer operations at all it is still "equivalent" to C). -- Charles Bryant (ch@dce.ie) -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
hankd@dynamo.ecn.purdue.edu (Hank Dietz) (09/15/90)
In article <11326@alice.UUCP> you write: >vu0310@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell @ SUNY Binghamton, NY) says: >>I have made various attempts at writing assemlber -> HLL translators over >>the years, mostly inspired by the `beatify' tool that attempts to transform >>spagetti FORTRAN into structured FORTRAN. > >Wouldn't a `beatify' tool transform Profane Fortran into Blessed Fortran? >[We could always hope so. -John] Nope. The key transformation in fixing spaghetti code is an oldie but goodie called "code straightening." It's usually pretty easy. However, such transformations are generally not very good at dealing with comments. Not only are they notoriously bad at creating meaningful new comments automatically, but they don't even really know what to do with the comments that were in the original source. This last problem is rooted in the fact that most languages do not syntactically group comments with the source code constructs to which they apply (they are usually implemented as a lexical hack -- removed from the source before syntax analysis). Further, how does a beautifier know what the comment refers to -- it would be pretty strange, for example, to retain a comment about multiplying things when the multiply has been strength-reduced out of existence? One might argue that the big problem in programs today isn't unstructured use of GOTO, but unstructured use of comments.... One of my PhD students (Eng-Siong Tan) is currently investigating this problem.... -hankd@ecn.purdue.edu -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
hawley@icot32.icot.or.jp (David John Hawley) (09/15/90)
In article <679@culhua.prg.ox.ac.uk> Jonathan Bowen <Jonathan.Bowen@prg.oxford.ac.uk> writes: >.... The main problem is running the necessary arithmetic backwards (i.e. >avoiding the use of "is") and I am currently looking into this. Has anyone >else done any similar work or can anyone supply any useful references? I'm not confident that that is the major problem, but as far as "more declarative" realizations of arithmetic (and other non-logical goodies), check out "constraint logic programming". See the recent pair of articles in the July/90 CACM - an interpreter and compiler are available for the CLP(R) language mentioned there (for academic use). If you are concerned about completeness issues for your "invertible 'is'", maybe you would be interested in our Grobner-base constraint solver, the "elephant gun approach" also mentioned in the above articles ;-) David Hawley, CAL group, ICOT -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
kym@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) (09/15/90)
In article <1990Sep14.181616.26890@dce.ie> ch@dce.ie (Charles Bryant) writes: >Well how would you translate this C function into Pascal. > > typedef struct list { > struct list *next; > int item; > } list; > > list *head; > > insert(list *newelem) > { > list **p; > for (p = &head; *p; p = &(*p)->next) > if ( (*p)->item >= newelem->item) break; > newelem->next = *p; > *p = newelem; > } I'm not sure at what level to aim this so I'll play it straight. I think we can try at least: type list = 0..maxmem; listinfo = record next: list; item: integer; end {listinfo}; var mem: array[0..maxmem] of listinfo; head: list; procedure insert(newelem: list); var p: list; break: boolean; begin p := head; break := false; while (p<>0) and not break do if mem[p].item>=mem[newelem].item then break:=true else p := mem[p].next; mem[newelem].next = p; mem[p].next = newelem; end {insert}; It even duplicates the same bug as the original code! (:-) Complaints about _efficiency_ can be countered by resorting to arguments re code optimization (which has come a long way since even Pascal was designed) and various tricks that architects get up to (i.e. since _all addressing_ on a given machine might be register-relative the indexing in my Pascal code will not actually cost anything cf the C code). In any case efficiency issues were not (originally) in question. We might also translate the code to Fortran as follows: subroutine insert(newelem) common mem(0:maxmem) integer list integer p p = head 10 if(p.eq.0) goto 20 if(mem(p+ITEMOFFSET)>=mem(newelem+ITEMOFFSET)) goto 20 p = mem(p+NEXTOFFSET) goto 10 20 mem(newelem+NEXTOFFSET) = p mem(p+NEXTOFFSET) = newelem return end Again, same bug... You can perform any syntactic sugar to get this into (whatever) version of Basic you like. There may be some transcription errors here & there above, but I don't think anything is essentially _wrong_ with the code, so don't get too picky. The good thing about theoretical results from formal language theory (and thank you Manfred von Thun) is that we _can_ say (all too few) general things about programming and be sure we're right. Of course none of this has _proved_ that things are ``equally hard'' as indicated in my original post -- that _would_ be much harder. -Kym Horsell [I suspect that the original author expected the Pascal version to use Pascal pointers. My Pascal is rusty enough that I don't immediately see what the problem is, other than perhaps the static list head. My experience is that translating from one language to another is 90% easy, but the 10% can be incredibly hard. In about 1970, I took a very vanilla Fortran calendar printing program and ran it though IBM's Fortran to PL/I. Although most of the Fortran was recognizable more or less statement by statement, some of it, particularly the I/O, expanded by orders of magnitude. Most of the blow-up handled cases that I knew wouldn't happen (most notably reading into a literal in Format statement) but it was hard for the translator to know that. This is a long way from decompiling assembler into some language. -John] -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
roland@ai.mit.edu (Roland McGrath) (09/17/90)
<3972@bingvaxu.cc.binghamton.edu> <11326@alice.UUCP>
Organization: Hackers Anonymous International, Ltd., Inc. (Applications
welcome)
Keywords: disassemble
In-Reply-To: td@alice.UUCP's message of 13 Sep 90 14:07:39 GMT
In article <11326@alice.UUCP> td@alice.UUCP (Tom Duff) writes:
Wouldn't a `beatify' tool transform Profane Fortran into Blessed Fortran?
[We could always hope so. -John]
Yes, if one beats on Fortran enough, it eventually becomes a blessed pile of
mush, much more beautiful than the original.
--
Roland McGrath
Free Software Foundation, Inc.
roland@ai.mit.edu, uunet!ai.mit.edu!roland
[So how's Gnu Fortran coming? -John]
--
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
raulmill@usc.edu (09/17/90)
In article <_5A%GS%@rpi.edu> adamsf@turing.cs.rpi.edu (Frank Adams) writes: In article <12976@june.cs.washington.edu> pardo@cs.washington.edu (David Keppel) writes: >My guess is that decompiling in to a language that is e.g., >saccarine-sweetened assembler (C) is `easy', while decompiling e.g., >in to APL is hard. If we assume that the program is to be decompiled into the language in which it was written, it is in general easier to decompile the less the compiler optimizes the generated code. A second problem is type inference. APL, with a fixed set of data types, is easier in this respect than C. For example, when the code loads a pointer into a register and indexes off of it, what kind of struct is the pointer pointing to? [Frank then goes on to state his opinion that C is pretty good for exact transliteration of machine language.] If I may point out... [1] the first commercial use of APL was to describe the IBM 360 architecture. APL has the ability to concisely describe just about any machine architecture. [2] As far as I know, the language analysis/verification tools available for APL are pretty good [some would say better than those available for any other language, but without first hand knowledge I'm not so sure. I do know that 7 or 8 years ago 3 bugs were found in that 360 description by one of these verifiers.] If you want an exact HLL transliteration of raw machine code, or a translation into an assembler-like language, there is no reason why APL should be harder than any other language (though I'd recommend using J instead, because there is an odd sort of problem getting APL to talk in ascii, and J is better IMHO :) To turn back to the original poster's question, the best disassemblers I have seen often do a lot of interpolation based on system calls whose arguments are known, various compiler conventions and, if you are lucky enough, linking/debugging information left in the code by the developers. As far as I've seen, the worst problem in converting from machine language to other representations is figuring out what to call a specific piece of memory. (code? text? struct? etc.) A lot of this information can be interpolated by logic on the order of 'well, if this instruction is illegal, we know everything back to the last branch isn't instructions.' [It seems to me that [1] is a red herring, the IBM POO describes the 370 in English, but disassembling into English is difficult. On the other hand, decompiling into scalar APL expressions shouldn't be hard. -John] -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
ch@dce.ie (Charles Bryant) (09/18/90)
In article <4028@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) writes: >In article <1990Sep14.181616.26890@dce.ie> ch@dce.ie (Charles Bryant) writes: >>Well how would you translate this C function into Pascal. >>[simple linked list insertion] > >I'm not sure at what level to aim this so I'll play it straight. > >[pascal version using fixed size array] >It even duplicates the same bug as the original code! (:-) That's funny, I can't see any bug in the original. I can see a bug in the Pascal one though: when the list is empty new elements can't be inserted because 'head' is never changed. >Complaints about _efficiency_ can be countered by resorting to arguments re >code optimization ... Quite true. But one important property of the original is that it has no fixed limit. The Pascal version is limited by the space allocated in the array. If the array is big, space is wasted; and any fixed size may turn out to be too small. >[I suspect that the original author expected the Pascal version to use Pascal >pointers. My Pascal is rusty enough that I don't immediately see what the >problem is, other than perhaps the static list head. Any method can be used if the basic properties of the original code are preserved. One of these is that the list has no predetermined maximum size. As far as I can see this means pointers are necessary, but it is not possible to take the address of a variable in Pascal, so the case of an empty list must be treated specially, I'll use the array method here to make it easy to compare. <same declarations> begin if head=0 then begin head := newelem; mem[newelem].next := 0; end else begin p := head; break:= false; while (p<>0) and not break do if mem[p].item >= mem[newelem].item then break:=true; else p := mem[p].next; mem[newelem].next := mem[p].next; mem[p].next := newelem; end; end; >... This is a long way from decompiling assembler into >some language. -John] What I meant was that unless the decompiler produces code in the same language that the code was originally written in, it may have to produce things that bear no relation to the original code (nor the assembler). If the target language is less powerful that the original a decompilation may not be possible. (Well obviously you could write a program in the target language that emulated the machine that the assembly came from, but I wouldn't consider that to be decompilation). -- Charles Bryant (ch@dce.ie) -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
ctl8588@rigel.tamu.edu (LAUGHLIN, CHET) (09/19/90)
In article <1990Sep14.181616.26890@dce.ie>, ch@dce.ie (Charles Bryant) writes... >Well how would you translate this C function into Pascal. [ code deleted ] How one 'translates' into Pascal is not the point. The point is if you can take the machine code (of the resulting program) and pull it into Pascal, or Basic for that matter. It may not be as pretty to look at, but that should be more of a limitation inposed by the language structure itself. -Chet Laughlin -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
djones@decwrl.dec.com (Dave Jones) (09/19/90)
I just wrote a SPARC object-code dumper a couple of days ago. I am as much interested in the symbol-table and relocation-information as I am in the instructions, but I never even worried about data intertwined with the instructions. (It helps that all SPARC instructions begin on 32 bit boundaries, of course.) I simply have one flag that prints the text segment as a hex and ascii dump, another that prints it as machine instructions. Can't run the assembler over the output, but then we're not supposed to do that are we? It's plenty good for my purposes, which is to inspect machine code in order to infer undocumented conventions, and to verify that compiled code is correct. -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
markh@csd4.csd.uwm.edu (Mark William Hopkins) (09/19/90)
In article <1990Sep14.181616.26890@dce.ie> ch@dce.ie (Charles Bryant) writes: >Well how would you translate this C function into Pascal. > > typedef struct list { > struct list *next; > int item; > } list; > > list *head; > > insert(list *newelem) > { > list **p; > for (p = &head; *p; p = &(*p)->next) > if ( (*p)->item >= newelem->item) break; > newelem->next = *p; > *p = newelem; > } ... >It seems to me that you either need an extra node, or you have to simulate the >pointers in an array. If you manage the Pascal, try BASIC (even with no >pointer operations at all it is still "equivalent" to C). By first translating it into Proper C... [A] Change the 'list = struct list' into 'list = struct list *', and make the original program readable :). typedef struct List { struct List *Next; int Item; } *List; List Head; Insert(List NewElem) { List *P; for (P = &Head; *P != NULL; P = &(*P)->Next) if ((*P)->Item >= NewElem->Item) break; NewElem->Next = *P; *P = NewElem; } [B] Add in a dummy variable, dP, to represent *P. Insert(List NewElem) { List *P, dP; for (P = &Head, dP = *P; dP != NULL; P = &dP->Next, dP = *P) if (dP->Item >= NewElem->Item) break; NewElem->Next = dP; dP = *P = NewElem; } [C] Use semantic equalities to make P go away... Insert(List NewElem) { List *P, dP; for (P = &Head, dP = Head; dP != NULL; P = &dP->Next, dP = dP->Next) if (dP->Item >= NewElem->Item) break; NewElem->Next = dP; *P = NewElem; dP = NewElem; } [D] Make P go away, and make believe that dP was P all along... Insert(List NewElem) { List P; for (P = Head; P != NULL; P = P->Next) if (P->Item >= NewElem->Item) break; NewElem->Next = P; P = NewElem; } Proper C is a beautiful subset of C, in which all non-system identifiers are syntatically required to begin in capitals, all "&"'s are removed, and all recursively defined structures are required to have pointers to that structure with the same name as the structure... See? No new node. You closed your eyes in steps [B] and [C], didn't you? :) -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
td@alice.UUCP (Tom Duff) (09/21/90)
markh@csd4.csd.uwm.edu (Mark William Hopkins @ University of Wisconsin-Milwaukee) rewrites the following code (due to ch@dce.ie (Charles Bryant)): typedef struct list { struct list *next; int item; } list; list *head; insert(list *newelem) { list **p; for (p = &head; *p; p = &(*p)->next) if ( (*p)->item >= newelem->item) break; newelem->next = *p; *p = newelem; } as typedef struct List { struct List *Next; int Item; } *List; List Head; Insert(List NewElem) { List P; for (P = Head; P != NULL; P = P->Next) if (P->Item >= NewElem->Item) break; NewElem->Next = P; P = NewElem; } claiming that the result is trivially translatable into Pascal, (indeed, it is written in the Pascal subset of C), and that the two are semantically equivalent. Unfortunately, the two are not equivalent. In the first, the assignment *p=newelem; can change the value of `head' (say when head==0). In the rewritten version, no assignment to Head is possible, since it is never mentioned on the left of an assignment, and no pointer to it is ever developed. Of course, the code is useless without the potential side-effect on head. The problem is that the rewrite has decreased p's level of indirection by 1. So p (in the rewritten version) has the same rvalue as *p (in the original), but a different lvalue. Unfortunately, on the left side of an assignment it is the lvalue that matters. Mr. Bryant's whole point was that occasionally the assignment *p = newelem; can change head. This effect is not directly achievable in Pascal, because it is not possible to develop a pointer to data not allocated on the heap. (Well, maybe call-by-reference is a partial exception, but the implicit pointer there is not a first class citizen.) Many Pascal programmers (apparently including Mr. Hopkins) consider the semantic notions of lvalue and rvalue to be needless complication in the description of C, since there is little need for them in describing Pascal, which contains no lvalue-to-rvalue conversion operator (unary & in C.) -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.