[comp.compilers] Disassembly

phorgan@cup.portal.com (09/10/90)

The problem with disassembling arbitrary object code is that data bears a
disturbing resemblance to code at times:) Even when running through code
disassembling starting at known code, it's not always possible to
determine when code stops and data begins.  Then it's not possible to tell
when object code starts up again.  This is easy to see using most
dissassemblers; when you hit the data, the unknown op-code indicator
appears (typically ???), then random sequences of ??? and op-codes, then
when the code starts, often the disassembler has just guessed wrong and
includes the first byte or two of the 'real' op-code in a previous 'false'
one.  It might take a while to 're-synchronize' and start showing 'real'
op-codes.  The only time this isn't a problem would be with fixed single
length op-codes with an alignment requirement.  It is possible to reduce
the problem with an algorithm that looks ahead starting byte-by-byte and
sees which one generates a most successful string of instructions.  From a
'good starting byte', you could disassemble in reverse to find a previous
starting location.  Even this fails in many cases of self modifying code
or in cases where strange things are done like overlapped code.  (A trick
often done on code to be put in machines with limited ROM space).  If
you're familiar with coding practices for the processor though, heuristic
methods can be applied with some success.

Patrick Horgan                   phorgan@cup.portal.com
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

meissner@osf.org (09/12/90)

| The problem with disassembling arbitrary object code is that data bears a
| disturbing resemblance to code at times:) Even when running through code
| disassembling starting at known code, it's not always possible to
| determine when code stops and data begins.  Then it's not possible to tell
| when object code starts up again.

I discovered that the MIPS assembler has a 'solution' to this problem.
It doesn't prohibit you from putting constants in the text section,
but if you do, the line numbers for debugging are messed up.  I found
this when I had GCC putting the switch label array in .text.  The MIPS
people I talked to about this said it was a feature, and not a bug....

Thus on a MIPS system, you don't have to worry about data being in the
text section.... (and of course each instruction is exactly 32-bits,
so you don't have to worry about starting in the middle of an
instruction, like you do on CISC machines.

| 				     This is easy to see using most
| dissassemblers; when you hit the data, the unknown op-code indicator
| appears (typically ???), then random sequences of ??? and op-codes, then
| when the code starts, often the disassembler has just guessed wrong and
| includes the first byte or two of the 'real' op-code in a previous 'false'
| one.  It might take a while to 're-synchronize' and start showing 'real'
| op-codes.  The only time this isn't a problem would be with fixed single
| length op-codes with an alignment requirement.  It is possible to reduce
| the problem with an algorithm that looks ahead starting byte-by-byte and
| sees which one generates a most successful string of instructions.  From a
| 'good starting byte', you could disassemble in reverse to find a previous
| starting location.

You could always do a complete scan of the text, using a bitmap or
some such to identify every place that has an instruction.  It would
have to be a backtracking scan, so that you can mark both the fall
through case, and conditional branch target cases.  IMHO though, it
would be too slow and consume too much memory to be useful.

|		      Even this fails in many cases of self modifying code
| or in cases where strange things are done like overlapped code.  ...

Fortunately for this case, self modifying code seems to be mostly on
the decline, and only used where needed (or because you have a macho
hacker type that thinks self modifying code is neat).

--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

pl@news.funet.fi.tut.fi (Lehtinen Pertti) (09/14/90)

>From article <9009121606.AA26236@curley.osf.org>, by meissner@osf.org:
> | The problem with disassembling arbitrary object code is that data bears a
> | disturbing resemblance to code at times:) ...

> I discovered that the MIPS assembler has a 'solution' to this problem.
> It doesn't prohibit you from putting constants in the text section,
> but if you do, the line numbers for debugging are messed up.  I found
> this when I had GCC putting the switch label array in .text.  The MIPS
> people I talked to about this said it was a feature, and not a bug....

I once wrote a disassembler for Z80. It started on given address and put
every jump or call destination into heap.  When it hit rts, or other
stopping intruction, it picked next address from heap and continued.  So I
was able to find all possible control flows forward from given place.

This was nice feature when looking at interrupt services and such things.

And it never dropped from sync, because it followed same paths as
execution.

Of course indirect jumps could cause problems.

On Sparc we hit one problem on disassembly (not actually but...)

Memory addressing on sparc is always register relative (no 32 absolute
addressing) so it is sometimes hard to find out, which symbol is accessed,
because base register initialization could be quite far away.

--
pl@tut.fi
Pertti Lehtinen
Tampere University of Technology
Software Systems Laboratory
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

Chuck.Phillips@FtCollins.NCR.COM (Chuck.Phillips) (09/14/90)

>>>>> On 9 Sep 90 17:32:55 GMT, phorgan@cup.portal.com said:

> The problem with disassembling arbitrary object code is that data
> bears a disturbing resemblance to code at times:) Even when
> running through code disassembling starting at known code, it's
> not always possible to determine when code stops and data begins.

> It is possible to reduce the problem with an algorithm that looks
> ahead starting byte-by-byte and sees which one generates a most
> successful string of instructions.

Suggestion:

Use a program that starts with the first executable instruction, marking
and decoding as it goes every instruction except for conditional branches.
Upon encountering a conditional branch, follow _both_ branches.  If
implemented recursively, the stop conditions are, 1) a branch to an already
marked instruction and b) an end-of-program.  Even this can fail if there
is a conditional branch to garbage, which never happens in practice due to
the underlying algorithm.

> Even this fails in many cases of self modifying code...

Ouch!  Now you're stuck.  

> If you're familiar with coding practices for the processor though,
> heuristic methods can be applied with some success.

...and no guarantees.

	Cheers,

Chuck Phillips  MS440
NCR Microelectronics 			Chuck.Phillips%FtCollins.NCR.com
2001 Danfield Ct.
Ft. Collins, CO.  80525   		uunet!ncrlnk!ncr-mpd!bach!chuckp
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

mason@uunet.UU.NET (Dave Mason) (09/16/90)

In article <9009091032.1.139@cup.portal.com> phorgan@cup.portal.com writes:
>The problem with disassembling arbitrary object code is that data bears a
>disturbing resemblance to code at times:) Even when running through code
>disassembling starting at known code, it's not always possible to
>determine when code stops and data begins.

As others have said, you must walk the code graph to determine all
possible code sections.

What I've done several times is to have an auxiliary file that contains hints
to the disassembler.  This contains known addresses, useful names for same,
and the class of object, e.g:
	ffd0	CONS_RCV	word
	ffd2	CONS_STS	byte
	d000	START		code
	d123	DISPTCH		jump-table 16
	d456	GETCHAR		code
	d678	HELP_MSG	ascii

So you run the disassembler against the input and this file, then examine the
output, discover more about the program, fill in more labels, and iterate.

This can produce a pretty good version of a program.  Particularly
useful for ROMs.
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

albaugh@dms.UUCP (Mike Albaugh) (09/18/90)

>From article <CHUCK.PHILLIPS.90Sep14170558@halley.FtCollins.NCR.COM>, by Chuck.Phillips@FtCollins.NCR.COM (Chuck.Phillips):
> 
> Suggestion:
> 
> Use a program that starts with the first executable instruction, marking
> and decoding as it goes every instruction except for conditional branches.
> Upon encountering a conditional branch, follow _both_ branches.  If
> implemented recursively, the stop conditions are, 1) a branch to an already
> marked instruction and b) an end-of-program.

	Even _I_ have re-invented this particular wheel, although I use
a more-or-less-human-readable "state file" to convey "marks", so I can
guide it "between runs" as to new avenues to explore (e.g. mark as "a new
entry" all the targets of what _I_ can see is a jump-table).

>  Even this can fail if there is a conditional branch to garbage, which
> never happens in practice due to the underlying algorithm.

	Never say never. If one is disassembling 6502 code, this happens
a LOT. For those unfamiliar with the 6502, it has _no_ unconditional
branch, hence programmers often use a conditional branch where the
condition is "known to be true." The simple cases are along the lines

	LDA	#1
	BNE	foo		; ALWAYS TAKEN

but there seems to be some sort of macho pride taken in establishing the
condition via a _long_ chain of computation. Even in the source, this
can be confusing. When reading raw machine code the problem of detecting
entry into the chain of computation (without even labels to go by) that
invalidates the assumptions can get, uh, very interesting. Oh, yeah, the
short reach of these branches means that they often get used as "stepping
stones", so even the one shown above _might_ fall through. Now it's my turn
to say "but that wouldn't happen in in practice" :-)

	And while we're at it, do typical disassemblers just handle the
documented instructions, or do they "simulate" the more commonly used
"illegal instructions" that hackers seem to love? I would ordinarily
assume that since this thread started with "de-compiling", we are _not_
interested in code that "a compiler would never generate", but...

				Mike

| Mike Albaugh (albaugh@dms.UUCP || {...decwrl!pyramid!}weitek!dms!albaugh)
| Atari Games Corp (Arcade Games, no relation to the makers of the ST)
| 675 Sycamore Dr. Milpitas, CA 95035		voice: (408)434-1709
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

aglew@dwarfs.crhc.uiuc.edu (Andy Glew) (09/19/90)

Many people have mentioned following branches, etc., to guide
disassembly.
    Static is obvious.  You can also do it dynamically, using the
techniques used in generating profiling feedback for a compiler.  More
branches can be followed - eg. out of a jump table.  Moreover, several
simplifying assumptions may be useful: 
    (1) code is never executed "out of phase" - ie. if a code sequence
begins with the 4 byte instruction at address A, there is no code
sequence beginning at address A+1.
    (2) Code and data may be emulsified, but they aren't miscible -
ie.  addresses that are executed are not data; similarly, addresses
that are fetched as data are not code (may be some boundary effects
here).
    Hackers may break these assumptions, but if all you are trying to
do is run binaries from machine A on machine B, they may be enough for
you.

Conceptually, given a mixed code/data address space A, you can create
multiple code data spaces c1(A), c2(A), etc., for every possible
placement of address boundaries.  Or, rather, you can start of your
disassembly in the following manner:

    code_0x0000f561:	/* 4 byte instruction */  ADD ...
    	    	    	goto code_0x0000f565;

    code_0x0000f562:	/* 2 byte instruction */  MOV ...
    	    	    	goto code_0x0000f564;

    code_0x0000f563:    /* 2 byte instruction */ ...
    	    	    	goto code_0x0000f561;

    code_0x0000f564:    /* 1 byte instruction */ ...
    	    	    	goto code_0x0000f565;

    code_0x0000f565:    /* lots of possible paths converge here */
    	    	    	goto code_0x0000f565;

    data_0x0000f561:    ....
    data_0x0000f562:    ....
    data_0x0000f563:    ....
    data_0x0000f564:    ....
    data_0x0000f565:    ....

Static branch following eliminates some code and data entries; dynamic
profiling eliminates a few more.  Not all of the ambiguities may be
resolved, but the amount of replication will quickly fall to tolerable
levels.

The same approach might be used for data representation - eg. have
separate spaces for data addressed as bytes, words, etc - except that
data is much more frequently accessed by different packet sizes.  It's
almost easier to have your disassembly/reassembly support library
convert "load word at data_0x0000f562" into the required sequence of
loads and shifts (to handle byte ordering) than it is to attempt to
order the data "naturally" for the target machine. This produces a
run-time penalty for the translated binary, but hopefully the new
machine is that much faster anyway, and all you are trying to do is
gain access to the wonderful world of IBM PC/VAX/IBM 360 software?
(Or, for startup UNIX companies, maybe you're just trying to get
MIPS/SUN/Ultrix binaries running on your new hardware).

In any case, the hardest thing about binary translation is handling
the stuff that isn't being disassembled - namely OS calls.
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

chris@cs.UMD.EDU (Chris Torek) (09/20/90)

Andy Glew suggests some simplifying assumptions:

>    (1) code is never executed "out of phase" - ie. if a code sequence
>begins with the 4 byte instruction at address A, there is no code
>sequence beginning at address A+1.

This one, particularly in hand-coded assembly for weak architectures
(that is, those with limited address space or `missing' instructions
such as unconditional branches), will prove false often enough to be
a problem.  A common trick in 8080/Z80 code was the mysterious sequence:

	0x1000:	ld	e,6
	0x1002:	ld	bc,0x071e
	0x1005:	ld	bc,0x081e
	0x1008:	<code>

The trick is that a branch to 0x1003 turns out to be

	0x1003:	ld	e,7
	0x1005:	ld	bc,0x081e
	0x1008:	<code>

and a branch to 0x1006 turns out to be

	0x1006:	ld	e,8
	0x1008:	<code>

and the code uses register `e' to do whatever it does.

Although I have never seen a compiler use this trick, it would not be
too difficult to arrange (if a register is dead, it can be used as the
target of a `useless' instruction that exists merely to embed another
instruction in the immediate data field).  Peephole optimizers do code
merging all the time; this is merely a (scary) variant on that.
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.