mash@mips.COM (John Mashey) (11/09/88)
I mentioned something about object-to-object translation at MIPS and Ardent in a recent posting, and several people sent mail wanting to know more. Most of this is due to Earl Killian here. The sections are: 1. MOXIE (waht we used in 1985, before hardware) 2. PIXIE & friends (what we use now) 3. Some notes on architectural issues 1. MOXIE - MIPS On a vaX Instruction Emulation (or something like that) This was used, starting in mid-1985, to debug R2000 software well in advance of having hardware, using just VAX-11/780s. Probably the most crucial thing it accomplished was to let the language group bootstrap the compilers and related tools much earlier than could otherwise have been accomplished. Here were the steps: a) MIPS compilers are written in a mixture of C and Pascal. b) Compile them on a VAX, using cc and pastel, generating vax-binaries. c) Using the vax-binaries, recompile the compiler sources to generate mips-binaries. d) Use moxie to convert the mips-binaries to equivalent vax-binaries [this is the object-object conversion step]. e) Recompile the compiler sources again, using the output of d) to generate mips-binaries of the compilers. f) Compare c) to e). If they're different, you have a problem, and you should fix it. If they're the same, but your programs break, also fix it. Of course, you can use this technique on programs other than the compilers, like the linker, debugger, benchmarks, etc. The version used most used the little-endian version for speed, although there was a way to simulate big-endian code on the vax if desired. Moxie could only handle user-level programs, not UNIX kernels, since it convereted the code down to the system-call level, but didn't try faking MMUs, caches, etc. For the latter, we used a more traditional instruction interpreter (sable, by Jim Moore), for kernel and prom debugging. However, the source debugger new how to deal with moxified code, and talk to sable, so one could get good debugging in either case. The crucial plus for moxie was its speed: it could run substantial real programs at about 50%-60% of normal speed, i.e., it turned a 780 into a 750, rather than a 780/100. Straightforward instruction-by- instruction interpretation, (including simulating full-associative TLBs...) can easily be 100X or 200X slower than the real machine. 2. PIXIE (whose acronym, if any, I've forgotten), PROF, PIXSTATS Pixie is the MIPS-based moxie equivalent. We use it for instrumenting programs so they: a) Can be easily profiled (prof) b) Be analyzed for architectural features (pixstats) (i.e., look at cycle counts that are independent of memory system) c) Be address-traced, for cache+memory simulations. For example, to profile something, for example, dc: compile it normally, with no profiling options or libraries. "pixify" the linked executable: pixie dc THis generates a new executable, dc.pixie, with extra instructions inserted to count basic block executions, and a dc.Addrs file that keeps track of where things moved. run the pixified executable: dc.pixie that generates dc.Counts, which counts the basic blocks This would run 1.5-2X slower than normal. to profile it, run prof: prof -pixie dc it tells you: total instruction cycles per function average cycles per function call, per function total instruction cycles per source statement average cycles per source statement to do architectural analysis, use pixstats: pixstats dc this gives you gory detail about % laods, % stores, % nops, % stalls due to mult/divide/FP; average number of registers saved/restored per function call; register usage; dynamic & static instruction frequencies, etc, etc, etc, etc. Pixie can do other things, like: adding address trace code to be fed into a cache/memory system simulator. counting branch taken/not-taken statistics With various tweaks, the commands can be used to analyze proposed extensions, or changes in pipeline designs, or even get quick estimates of the performance implications of features in OTHER people's architectures. (Earl once tweaked this to mimic the 88K, and ran Spice and Doduc thru the process.) We still use an instruction simulator for kernel analysis, but we use these tools for all of the user-level code. Again, the crucial win for this technqiue is speed: you can afford to run benchmarks of 100Million cycles or more thru the whole process when you feel like, and you can answer architectural questions by trying variations and seeing what the answer is, rather than by intuition. [We look at quite a few design features that individually might make only 1% difference, but that add up.] This is especially crucial in the forthcoming rounds of chip designs. Some of these tools are used in Dave Patterson's architecture classes: I hear there's a variant that converts MIPS code to SPARC code; maybe someone will comment, especially if realistic benchmarks are now running. Finally, maybe someone from Ardent will comment. They've used the technique for various purposes. One that I do recall was that while they had some odd bug in the FP hardware, rather than having to jerk the compilers around, they kept generating the standard code, then post-processed it to work around the temporarily-broken hardware. 3. Architectural issues. Some architectures are easier to translate from than others. Condition-codes (or equivalent) turn out to be one of the most awkward features in an architecture: a) Instructions that set cc's often set multiple bits, and it's sometimes nontrivial to figure out how much you must simulate them, and how they'll be used, efficiently. (You essentially need to look ahead until the cc is dead, and only create the bits that you really need. Unless you can be sure that there are no branches into the middle of code, this can be hard.) Clearly, the closer the source and target machines, the better off you are; even subtle differences can cause simple implementations to be slow. b) Even for A-to-A translation (i.e., for pixie-style profiling), condition-codes can be a problem, unless they can be bypassed (as in SPARC, for example, which is OK here). Specifically, you need instrumentation sequences that either are able to to avoid setting the cc, or else save it, do their work, and restore it. R2000s have no condition codes, and the Set Less instructions set a register to 0 or 1 (unlike the 88K compare, which sets a bunch of bits at once according to different conditions), so it turns out to be pretty easy to convert R2000 code into other things, although the reverse is not necessarily true. [We do run Insignia's SoftPC, which emulates MS/DOS, but it uses different techniques, which they can talk about if they feel like.] SUMMARY Object-to-object transformations have a bunch of uses, some of which we sort of stumbled into, after getting started from the necessity of trying to get software debugged before hardware, and being a little startup that couldn't afford huge machines for simulation. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086