phorgan@cup.portal.com (09/10/90)
The problem with disassembling arbitrary object code is that data bears a disturbing resemblance to code at times:) Even when running through code disassembling starting at known code, it's not always possible to determine when code stops and data begins. Then it's not possible to tell when object code starts up again. This is easy to see using most dissassemblers; when you hit the data, the unknown op-code indicator appears (typically ???), then random sequences of ??? and op-codes, then when the code starts, often the disassembler has just guessed wrong and includes the first byte or two of the 'real' op-code in a previous 'false' one. It might take a while to 're-synchronize' and start showing 'real' op-codes. The only time this isn't a problem would be with fixed single length op-codes with an alignment requirement. It is possible to reduce the problem with an algorithm that looks ahead starting byte-by-byte and sees which one generates a most successful string of instructions. From a 'good starting byte', you could disassemble in reverse to find a previous starting location. Even this fails in many cases of self modifying code or in cases where strange things are done like overlapped code. (A trick often done on code to be put in machines with limited ROM space). If you're familiar with coding practices for the processor though, heuristic methods can be applied with some success. Patrick Horgan phorgan@cup.portal.com -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
meissner@osf.org (09/12/90)
| The problem with disassembling arbitrary object code is that data bears a | disturbing resemblance to code at times:) Even when running through code | disassembling starting at known code, it's not always possible to | determine when code stops and data begins. Then it's not possible to tell | when object code starts up again. I discovered that the MIPS assembler has a 'solution' to this problem. It doesn't prohibit you from putting constants in the text section, but if you do, the line numbers for debugging are messed up. I found this when I had GCC putting the switch label array in .text. The MIPS people I talked to about this said it was a feature, and not a bug.... Thus on a MIPS system, you don't have to worry about data being in the text section.... (and of course each instruction is exactly 32-bits, so you don't have to worry about starting in the middle of an instruction, like you do on CISC machines. | This is easy to see using most | dissassemblers; when you hit the data, the unknown op-code indicator | appears (typically ???), then random sequences of ??? and op-codes, then | when the code starts, often the disassembler has just guessed wrong and | includes the first byte or two of the 'real' op-code in a previous 'false' | one. It might take a while to 're-synchronize' and start showing 'real' | op-codes. The only time this isn't a problem would be with fixed single | length op-codes with an alignment requirement. It is possible to reduce | the problem with an algorithm that looks ahead starting byte-by-byte and | sees which one generates a most successful string of instructions. From a | 'good starting byte', you could disassemble in reverse to find a previous | starting location. You could always do a complete scan of the text, using a bitmap or some such to identify every place that has an instruction. It would have to be a backtracking scan, so that you can mark both the fall through case, and conditional branch target cases. IMHO though, it would be too slow and consume too much memory to be useful. | Even this fails in many cases of self modifying code | or in cases where strange things are done like overlapped code. ... Fortunately for this case, self modifying code seems to be mostly on the decline, and only used where needed (or because you have a macho hacker type that thinks self modifying code is neat). -- Michael Meissner email: meissner@osf.org phone: 617-621-8861 Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142 -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
pl@news.funet.fi.tut.fi (Lehtinen Pertti) (09/14/90)
>From article <9009121606.AA26236@curley.osf.org>, by meissner@osf.org: > | The problem with disassembling arbitrary object code is that data bears a > | disturbing resemblance to code at times:) ... > I discovered that the MIPS assembler has a 'solution' to this problem. > It doesn't prohibit you from putting constants in the text section, > but if you do, the line numbers for debugging are messed up. I found > this when I had GCC putting the switch label array in .text. The MIPS > people I talked to about this said it was a feature, and not a bug.... I once wrote a disassembler for Z80. It started on given address and put every jump or call destination into heap. When it hit rts, or other stopping intruction, it picked next address from heap and continued. So I was able to find all possible control flows forward from given place. This was nice feature when looking at interrupt services and such things. And it never dropped from sync, because it followed same paths as execution. Of course indirect jumps could cause problems. On Sparc we hit one problem on disassembly (not actually but...) Memory addressing on sparc is always register relative (no 32 absolute addressing) so it is sometimes hard to find out, which symbol is accessed, because base register initialization could be quite far away. -- pl@tut.fi Pertti Lehtinen Tampere University of Technology Software Systems Laboratory -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
Chuck.Phillips@FtCollins.NCR.COM (Chuck.Phillips) (09/14/90)
>>>>> On 9 Sep 90 17:32:55 GMT, phorgan@cup.portal.com said: > The problem with disassembling arbitrary object code is that data > bears a disturbing resemblance to code at times:) Even when > running through code disassembling starting at known code, it's > not always possible to determine when code stops and data begins. > It is possible to reduce the problem with an algorithm that looks > ahead starting byte-by-byte and sees which one generates a most > successful string of instructions. Suggestion: Use a program that starts with the first executable instruction, marking and decoding as it goes every instruction except for conditional branches. Upon encountering a conditional branch, follow _both_ branches. If implemented recursively, the stop conditions are, 1) a branch to an already marked instruction and b) an end-of-program. Even this can fail if there is a conditional branch to garbage, which never happens in practice due to the underlying algorithm. > Even this fails in many cases of self modifying code... Ouch! Now you're stuck. > If you're familiar with coding practices for the processor though, > heuristic methods can be applied with some success. ...and no guarantees. Cheers, Chuck Phillips MS440 NCR Microelectronics Chuck.Phillips%FtCollins.NCR.com 2001 Danfield Ct. Ft. Collins, CO. 80525 uunet!ncrlnk!ncr-mpd!bach!chuckp -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
mason@uunet.UU.NET (Dave Mason) (09/16/90)
In article <9009091032.1.139@cup.portal.com> phorgan@cup.portal.com writes: >The problem with disassembling arbitrary object code is that data bears a >disturbing resemblance to code at times:) Even when running through code >disassembling starting at known code, it's not always possible to >determine when code stops and data begins. As others have said, you must walk the code graph to determine all possible code sections. What I've done several times is to have an auxiliary file that contains hints to the disassembler. This contains known addresses, useful names for same, and the class of object, e.g: ffd0 CONS_RCV word ffd2 CONS_STS byte d000 START code d123 DISPTCH jump-table 16 d456 GETCHAR code d678 HELP_MSG ascii So you run the disassembler against the input and this file, then examine the output, discover more about the program, fill in more labels, and iterate. This can produce a pretty good version of a program. Particularly useful for ROMs. -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
albaugh@dms.UUCP (Mike Albaugh) (09/18/90)
>From article <CHUCK.PHILLIPS.90Sep14170558@halley.FtCollins.NCR.COM>, by Chuck.Phillips@FtCollins.NCR.COM (Chuck.Phillips): > > Suggestion: > > Use a program that starts with the first executable instruction, marking > and decoding as it goes every instruction except for conditional branches. > Upon encountering a conditional branch, follow _both_ branches. If > implemented recursively, the stop conditions are, 1) a branch to an already > marked instruction and b) an end-of-program. Even _I_ have re-invented this particular wheel, although I use a more-or-less-human-readable "state file" to convey "marks", so I can guide it "between runs" as to new avenues to explore (e.g. mark as "a new entry" all the targets of what _I_ can see is a jump-table). > Even this can fail if there is a conditional branch to garbage, which > never happens in practice due to the underlying algorithm. Never say never. If one is disassembling 6502 code, this happens a LOT. For those unfamiliar with the 6502, it has _no_ unconditional branch, hence programmers often use a conditional branch where the condition is "known to be true." The simple cases are along the lines LDA #1 BNE foo ; ALWAYS TAKEN but there seems to be some sort of macho pride taken in establishing the condition via a _long_ chain of computation. Even in the source, this can be confusing. When reading raw machine code the problem of detecting entry into the chain of computation (without even labels to go by) that invalidates the assumptions can get, uh, very interesting. Oh, yeah, the short reach of these branches means that they often get used as "stepping stones", so even the one shown above _might_ fall through. Now it's my turn to say "but that wouldn't happen in in practice" :-) And while we're at it, do typical disassemblers just handle the documented instructions, or do they "simulate" the more commonly used "illegal instructions" that hackers seem to love? I would ordinarily assume that since this thread started with "de-compiling", we are _not_ interested in code that "a compiler would never generate", but... Mike | Mike Albaugh (albaugh@dms.UUCP || {...decwrl!pyramid!}weitek!dms!albaugh) | Atari Games Corp (Arcade Games, no relation to the makers of the ST) | 675 Sycamore Dr. Milpitas, CA 95035 voice: (408)434-1709 -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
aglew@dwarfs.crhc.uiuc.edu (Andy Glew) (09/19/90)
Many people have mentioned following branches, etc., to guide disassembly. Static is obvious. You can also do it dynamically, using the techniques used in generating profiling feedback for a compiler. More branches can be followed - eg. out of a jump table. Moreover, several simplifying assumptions may be useful: (1) code is never executed "out of phase" - ie. if a code sequence begins with the 4 byte instruction at address A, there is no code sequence beginning at address A+1. (2) Code and data may be emulsified, but they aren't miscible - ie. addresses that are executed are not data; similarly, addresses that are fetched as data are not code (may be some boundary effects here). Hackers may break these assumptions, but if all you are trying to do is run binaries from machine A on machine B, they may be enough for you. Conceptually, given a mixed code/data address space A, you can create multiple code data spaces c1(A), c2(A), etc., for every possible placement of address boundaries. Or, rather, you can start of your disassembly in the following manner: code_0x0000f561: /* 4 byte instruction */ ADD ... goto code_0x0000f565; code_0x0000f562: /* 2 byte instruction */ MOV ... goto code_0x0000f564; code_0x0000f563: /* 2 byte instruction */ ... goto code_0x0000f561; code_0x0000f564: /* 1 byte instruction */ ... goto code_0x0000f565; code_0x0000f565: /* lots of possible paths converge here */ goto code_0x0000f565; data_0x0000f561: .... data_0x0000f562: .... data_0x0000f563: .... data_0x0000f564: .... data_0x0000f565: .... Static branch following eliminates some code and data entries; dynamic profiling eliminates a few more. Not all of the ambiguities may be resolved, but the amount of replication will quickly fall to tolerable levels. The same approach might be used for data representation - eg. have separate spaces for data addressed as bytes, words, etc - except that data is much more frequently accessed by different packet sizes. It's almost easier to have your disassembly/reassembly support library convert "load word at data_0x0000f562" into the required sequence of loads and shifts (to handle byte ordering) than it is to attempt to order the data "naturally" for the target machine. This produces a run-time penalty for the translated binary, but hopefully the new machine is that much faster anyway, and all you are trying to do is gain access to the wonderful world of IBM PC/VAX/IBM 360 software? (Or, for startup UNIX companies, maybe you're just trying to get MIPS/SUN/Ultrix binaries running on your new hardware). In any case, the hardest thing about binary translation is handling the stuff that isn't being disassembled - namely OS calls. -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
chris@cs.UMD.EDU (Chris Torek) (09/20/90)
Andy Glew suggests some simplifying assumptions: > (1) code is never executed "out of phase" - ie. if a code sequence >begins with the 4 byte instruction at address A, there is no code >sequence beginning at address A+1. This one, particularly in hand-coded assembly for weak architectures (that is, those with limited address space or `missing' instructions such as unconditional branches), will prove false often enough to be a problem. A common trick in 8080/Z80 code was the mysterious sequence: 0x1000: ld e,6 0x1002: ld bc,0x071e 0x1005: ld bc,0x081e 0x1008: <code> The trick is that a branch to 0x1003 turns out to be 0x1003: ld e,7 0x1005: ld bc,0x081e 0x1008: <code> and a branch to 0x1006 turns out to be 0x1006: ld e,8 0x1008: <code> and the code uses register `e' to do whatever it does. Although I have never seen a compiler use this trick, it would not be too difficult to arrange (if a register is dead, it can be used as the target of a `useless' instruction that exists merely to embed another instruction in the immediate data field). Peephole optimizers do code merging all the time; this is merely a (scary) variant on that. -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.