[comp.arch] Object-code to object-code translation

mash@mips.COM (John Mashey) (11/09/88)
I mentioned something about object-to-object translation at MIPS and Ardent
in a recent posting, and several people sent mail wanting to know more.
Most of this is due to Earl Killian here.

The sections are:
1. MOXIE (waht we used in 1985, before hardware)
2. PIXIE & friends (what we use now)
3. Some notes on architectural issues

1. MOXIE - MIPS On a vaX Instruction Emulation (or something like that)
This was used, starting in mid-1985, to debug R2000 software well in
advance of having hardware, using just VAX-11/780s.   Probably the
most crucial thing it accomplished was to let the language group
bootstrap the compilers and related tools much earlier than could
otherwise have been accomplished.  Here were the steps:

	a) MIPS compilers are written in a mixture of C and Pascal.
	b) Compile them on a VAX, using cc and pastel, generating
		vax-binaries.
	c) Using the vax-binaries, recompile the compiler sources to generate
		mips-binaries.
	d) Use moxie to convert the mips-binaries to equivalent
		vax-binaries [this is the object-object conversion step].
	e) Recompile the compiler sources again, using the output of d)
		to generate mips-binaries of the compilers.
	f) Compare c) to e).  If they're different, you have a problem,
	and you should fix it.  If they're the same, but your programs
	break, also fix it.

Of course, you can use this technique on programs other than the compilers,
like the linker, debugger, benchmarks, etc.  The version used most
used the little-endian version for speed, although there was a way to
simulate big-endian code on the vax if desired.

Moxie could only handle user-level programs, not UNIX kernels,
since it convereted the code down to the system-call level,
but didn't try faking MMUs, caches, etc.  For the latter, we used a
more traditional instruction interpreter (sable, by Jim Moore),
for kernel and prom debugging.  However, the source debugger new how
to deal with moxified code, and talk to sable, so one could get good
debugging in either case.

The crucial plus for moxie was its speed: it could run substantial
real programs at about 50%-60% of normal speed, i.e., it turned a
780 into a 750, rather than a 780/100.  Straightforward instruction-by-
instruction interpretation, (including simulating full-associative TLBs...)
can easily be 100X or 200X slower than the real machine.

2. PIXIE (whose acronym, if any, I've forgotten), PROF, PIXSTATS
Pixie is the MIPS-based moxie equivalent.  We use it for instrumenting
programs so they:
	a) Can be easily profiled (prof)
	b) Be analyzed for architectural features (pixstats)
	(i.e., look at cycle counts that are independent of memory system)
	c) Be address-traced, for cache+memory simulations.

For example, to profile something, for example, dc:
	compile it normally, with no profiling options or libraries.
	"pixify" the linked executable:
		pixie dc
		THis generates a new executable, dc.pixie, with extra
		instructions inserted to count basic block executions,
		and a dc.Addrs file that keeps track of where things moved.
	run the pixified executable:
		dc.pixie
	that generates dc.Counts, which counts the basic blocks
	This would run 1.5-2X slower than normal.

	to profile it, run prof:
		prof -pixie dc
	it tells you:
		total instruction cycles per function
		average cycles per function call, per function
		total instruction cycles per source statement
		average cycles per source statement

	to do architectural analysis, use pixstats:
		pixstats dc
	this gives you gory detail about % laods, % stores, % nops,
		% stalls due to mult/divide/FP; average number of
		registers saved/restored per function call;
		register usage; dynamic & static instruction frequencies,
		etc, etc, etc, etc.

Pixie can do other things, like:
	adding address trace code to be fed into a cache/memory system
	simulator.
	counting branch taken/not-taken statistics

With various tweaks, the commands can be used to analyze proposed
extensions, or changes in pipeline designs, or even get quick estimates
of the performance implications of features in OTHER people's architectures.
(Earl once tweaked this to mimic the 88K, and ran Spice and Doduc thru
the process.)  We still use an instruction simulator for kernel analysis,
but we use these tools for all of the user-level code.

Again, the crucial win for this technqiue is speed: you can afford to
run benchmarks of 100Million cycles or more thru the whole process
when you feel like, and you can answer architectural questions by
trying variations and seeing what the answer is, rather than by
intuition.  [We look at quite a few design features that individually
might make only 1% difference, but that add up.]  This is especially
crucial in the forthcoming rounds of chip designs.

Some of these tools are used in Dave Patterson's architecture classes:
I hear there's a variant that converts MIPS code to SPARC code;
maybe someone will comment, especially if realistic benchmarks
are now running.

Finally, maybe someone from Ardent will comment.  They've used the
technique for various purposes.  One that I do recall was that while they
had some odd bug in the FP hardware, rather than having to jerk the compilers
around, they kept generating the standard code, then post-processed it
to work around the temporarily-broken hardware.

3. Architectural issues.

Some architectures are easier to translate from than others.
Condition-codes (or equivalent) turn out to be one of the most awkward
features in an architecture:
	a) Instructions that set cc's often set multiple bits,
	and it's sometimes nontrivial to figure out how much you must
	simulate them, and how they'll be used, efficiently.
	(You essentially need to look ahead until the cc is dead,
	and only create the bits that you really need.  Unless you
	can be sure that there are no branches into the middle of
	code, this can be hard.)
	Clearly, the closer the source and target machines, the better
	off you are; even subtle differences can cause simple
	implementations to be slow.
	b) Even for A-to-A translation (i.e., for pixie-style profiling),
	condition-codes can be a problem, unless they can be bypassed
	(as in SPARC, for example, which is OK here).  Specifically,
	you need instrumentation sequences that either are able to
	to avoid setting the cc, or else save it, do their work, and
	restore it.

R2000s have no condition codes, and the Set Less instructions set
a register to 0 or 1 (unlike the 88K compare, which sets a bunch of
bits at once according to different conditions), so it turns out to
be pretty easy to convert R2000 code into other things, although
the reverse is not necessarily true.  [We do run Insignia's SoftPC,
which emulates MS/DOS, but it uses different techniques,
which they can talk about if they feel like.]

SUMMARY
Object-to-object transformations have a bunch of uses, some of which
we sort of stumbled into, after getting started from the necessity
of trying to get software debugged before hardware, and being a little
startup that couldn't afford huge machines for simulation.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086