[comp.compilers] How can a disassembler tell code from data?

dwex@mtgzfs3.att.com (David E Wexelblat) (05/09/91)

I am working on fixing a rather broken disassembler for the 680x0 series
(which is irrelevant to my general problem, but may help find a specific
answer).  My problem is trying to disassemble code compiled with GCC,
which puts constant character strings into the text segment.  The program
correctly figures out that this stuff is not executable code by tracing
all of the paths through the code.  But it cannot tell the difference
between word and byte data.

I think this is a general problem with disassembling any non-split-I/D
program.  I was wondering if there are any techniques for determining that
a given piece of data should be interpreted as a character string as
opposed to word data.  I would like a general-case answer, but the
following constraints can be applied, if necessary:

	1) 680x0 processor
	2) C compiler 
		- AT&T UNIX-PC v3.51 (which doesn't generally do this)
		- gcc
	3) COFF format object files
		- stripped
		- with symbols
		- with relocation
		- with debugging

I had though about using 'strings' type algorithm, but this is prone to
generating garbage, so I'm looking for something better.
--
David Wexelblat             | dwex@mtgzz.att.com
AT&T Bell Laboratories      | ...!att!mtgzz!dwex
200 Laurel Ave - 4B-421     |
Middletown, NJ  07748       | (201) 957-5871    
[In the absence of extensive symbol table info, this sounds like a tough
problem. -John]
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

rfg@ncd.com (Ron Guilmette) (05/12/91)

In article <91-05-072@iecc.cambridge.ma.us> dwex@mtgzfs3.att.com (David E Wexelblat) writes:
>...  My problem is trying to disassemble code compiled with GCC,
>which puts constant character strings into the text segment...
>... I would like a general-case answer, but the
>following constraints can be applied, if necessary:
>
..
>	3) COFF format object files

The general case answer is to stop using COFF and use ELF instead.

In ELF, constant data can go into the .rodata or .rodata1 *sections*.  The
linker will normally combine all input .rodata sections (for all of the .o
files given to it as inputs) into one hunk of output .rodata stuff and it
will normally attach that to the output .text *segment*, however you can
use the MAPEFILE option to override this behavior and to get all of the
.rodata stuff placed into its own unique (LOADable) output segment.  You
could then just ignore that output segment when doing your disassembly.

Actually, you do not even need to use the MAPFILE option (necessarily).
As long as you do not strip the executable, it will include both a segment
header table *and* a section header table.  In the section header table
there will be one entry for the collected sum of all of the input .rodata
sections.  This header will indicate where (within the executable) the
start and end of all of the .rodata stuff is.  You could then just ignore
anything in that range.

You may be able to duplicate one or both of these techniques with COFF,
but I'm not sure.

-- 
// Ron ("Loose Cannon") Guilmette
// Internet: rfg@ncd.com      uucp: ...uunet!lupine!rfg
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.