srwmpnm@windy.dsir.govt.nz (03/09/91)
Ok folks, here are 8 methods for doing z80 emulation on a 68000, in software. (Well, 8 methods to get to decode a z80 instruction and get to the right emulation routine, anyway.) Trade-offs are speed, space and cleanliness. They all fall short of "compiling and optimising", but most of these methods will speed up most existing emulators. As you might expect, the largest and dirtiest code is usually the fastest (and least portable). The same methods should work with emulation of 6502, PDP-11 and any other 16-bit processors. In all methods, I assume there is a 64kb block of memory representing the z80's address space, allocated by AllocMem (say), and pointed to by "z80ram". ------------------------------------------------------------------------------- Method 1: The "standard" method: I call this method "standard" because it's used in both of the CP/M z80 emulators I know about. The general idea is to decode the current instruction and jump to the appropriate emulation routine via a vector table. That is, like a CASE statement with 256 selections. The code is clean and re-entrant. ; Setup move.l z80ram,a2 ; load pseudopc lea.l optabl(pc),a1 ; a1 always points to optabl lea.l mloop(pc),a3 ; a3 always points to mloop ; Main loop (decode) starts here mloop: moveq #0,d0 ; 4 Execute appropriate subroutine. move.b (a2)+,d0 ; 8 Grab the next opcode and inc pc. asl #2,d0 ;10 D0 high word is still zero! move.l 0(a1,d0.w),a0 ;18 Get address of routine from table jmp (a0) ; 8 Do the subroutine. ;48 total cycles to decode even optabl: dc.l nop00,lxib,staxb,inxb,inrb,dcrb,mvib,rlc dc.l ... Each z80 instruction emulation routine ends with: jmp (a3) ------------------------------------------------------------------------------- Method 2: The "position-independent" method: This is slightly quicker, the executable is more than 1500 bytes smaller, and you get another register to play with in the emulator (a1 in this case). I currently use this method (or close to it) in my Spectrum emulator. The code is clean and re-entrant. move.l z80ram,a2 ; load pseudopc lea.l mloop(pc),a3 ; a3 always points to mloop mloop: moveq #0,d0 ; 4 clear opcode word move.b (a2)+,d0 ; 8 get opcode byte add.w d0,d0 ; 4 2 bytes per entry move.w optabl(pc,d0.w),d0 ;14 get offset of routine jmp optabl(pc,d0.w) ;14 do instruction ;44 total to decode even optabl: dc.w nop00-optabl,lxib-optabl,staxb-optabl,inxb-optabl dc.w inrb-optabl,dcrb-optabl,mvib-optabl,rlc-optabl dc.w ... Each instruction emulation routine ends with: jmp (a3) ------------------------------------------------------------------------------- Method 3: The "decode-at-end-of-instruction" method: (There are really 2 methods described here.) Take either method 1 or method 2. Instead of ending each emulation routine with "jmp (a3)", end each one with a complete copy of the code from mloop to the indirect jmp. There is no longer a main loop, because each instruction jumps directly to the next one. This method is slightly faster, takes maybe twice as much code, is clean, and is re-entrant. It also saves yet another reserved register, in this case a3. (Personally, I find that a z80 emulator needs as many free registers as you can get your fingers on.) ------------------------------------------------------------------------------- Method 4: The "threaded jsr's" method: Warning: This method uses self-modifying, non-re-entrant code, and therefore is not recommended. This code is hazardous to your cache! (No flames please --- read on). Introduce a 390kb contiguous block of code (called thread) which looks like this: thread: jsr patch ; 0 jsr patch ; 1 ... jsr patch ; 65535 jmp thread That is, there is a jsr instruction for each byte in the z80's address space. This is in addition to z80ram. To start the emulator, you transfer control to "thread". What the "patch" routine does is to replace the current "jsr patch" with "jsr this_routine", where this_routine is the emulation routine for the corresponding opcode in z80ram. Then patch jmps to the this_routine to execute the instruction and to return to the next jsr in the thread. After a while, patch will no longer be called (except by z80 self modifying code), and every jsr made will be to emulate a z80 opcode directly. Whenever a z80 instruction writes to RAM, it patches the corresponding "jsr this_routine" with "jsr patch". As a variation, it could patch "jsr this_routine" with "jsr new_routine", but that would probably be slower in general. Advantage: It would be faster than methods 1 to 3, --- I think, --- especially in the Spectrum emulator, which has to do a lot of work with every write to RAM to check for ROM and video RAM anyway. The main reason for the extra speed is that it no longer has to decode the opcode on every instruction. There are the extra overheads of call and return though, and extra work to do on every RAM write. Disadvantages: 1: The code breaks C='s self-modifying code law. To run on Amiga's with caches, it would have to either disable the caches or update them manually after every patch. The code is extremely dirty, not re-entrant, and definitely not recommended; 2: You need 390k contiguous memory (plus another 64k somewhere else, plus whatever else you need for video). Other characteristics: Code would run slowly the first time round the loop, then speed up. -------------------------------------------------------------------------- Method 5: The "replicated code" method. Warning: This also uses self-modifying, non-re-entrant code and is therefore not recommended. Thread consists of 65536 blocks of code, each long enough to emulate the trickiest z80 instruction. Initially it contains 65536 copies of patch. (You will need A LOT of contiguous memory.) What patch does is to actually copy the code for the opcode over itself, then transfer control to the beginning of itself. (Tricky, but it can be done.) Every emulation routine finishes with a "bra.s next_instr" so they are all really the same length. That saves the call and return overhead. If an emulation routine is too long, then just use a jmp to somewhere. In practice, you would probably start with: jsr patch bra.s next_instr in every slot, rather than a complete copy of patch. Z80 RAM writes would copy the above code to the corresponding slot, if necessary, rather than copying the whole patch routine. Short of "compiling and optimising", this is the fastest method I can think of, but it is incredibly space-wasting, self-modifying, extremely dirty, and definitely not recommended. -------------------------------------------------------------------------- Method 6: The "threaded vector table" method: Ok, now to fix the self-modifying code problem. Take method 4 (threaded jsr's), but use a 262kb vector table in a private data segment, instead of a thread in the code segment. vectors: dc.l patch ; 0 dc.l patch ; 1 ... dc.l patch ; 65535 dc.l jmp_thread The main instruction loop looks like: lea.l vectors,a0 lea.l mloop(pc),a2 mloop: move.l (a0)+,a1 ;12 cycles jmp (a1) ; 8 cycles and every instruction finishes with "jmp (a2)". A0 is acting as a "pseudo-pc" into the vector table. Of course patch performs the same functions as before (except it is no longer self modifying, it just patches a vector). The vector table still needs to be updated by every write to Z80 RAM. The code is re-entrant provided each task has a separate copy of the vector table. -------------------------------------------------------------------------- Method 7: The "position-independent threaded vector table" method: Same as method 6, except that now the private data segment is: thread: dc.w patch-base ; 0 dc.w patch-base ; 1 ... dc.w patch-base ; 65535 dc.w jmp_thread-base and the main loop is: lea.l thread,a0 lea.l mloop(pc),a1 mloop: move.w (a0)+,d0 ; 8 cycles jmp base(pc,d0.w) ;14 cycles base: patch: ... op00: ... op01: ... jmp_thread: ... Now it is position-independent, only 128kb contiguous memory, the executable is 1500 bytes smaller, and it is slightly slower (only by 2 cycles per z80 instruction though). The code is re-entrant provided each task has a separate copy of the vector table. -------------------------------------------------------------------------- Method 8: The "decode-at-end-of-instruction threaded vector table" method: Same as method 6 except that every opcode emulation routine finishes with: move.l (a0)+,a1 jmp (a1) instead of "jmp (a2)". Now isn't that faster? And it saves a2 for more important things. Unfortunately you can't do exactly the same thing to method 7 unless you can write a complete z80 emulator in 256 bytes 8-) . But you could take method 7 and end each emulation routine with: mloop: move.w (a0)+,d0 lea.l base(pc),a1 jmp 0(a1,d0.w) instead. The code is re-entrant provided each task has a separate copy of the vector table. -------------------------------------------------------------------------- Personally I'm considering using one of the methods 6, 7 or 8 in the next version of the Spectrum emulator (probably method 8) (That is, if I ever get enough spare time without more interesting things to do.) I'll probably make the source public domain. That will use more Amiga RAM, but should go faster (I hope). Any guesses as to which method will be the fastest, and still fit comfortably in a 512k machine? Unfortunately I don't think any of the methods (except the first 3) are suitable for an 8088 emulator because of the huge memory requirements. I'm interested in any ideas anyone might have along these lines. The discussion of "compiling and optimising" is very interesting, but I don't see how the details would work. In particular, how do you cope with self-modifying code, code loaders, overlays etc? Peter McGavin. (srwmpnm@wnv.dsir.govt.nz) Disclaimer: I haven't tested any of the above ideas (except 1 and 2). If you see any bugs, point them out.
rjc@wookumz.ai.mit.edu (Ray Cromwell) (03/12/91)
One method that might help to speed up your emulator is to recode the spectrum rom into 68000 calls to AmigaDOS. I don't know if the Spectrum is very i/o bound. For C64 emulators this doesn't provide much improvement since most C64 programs, even productivety software, bypasses the rom's. But for CP/M this is a great improvement. Charlie Gibb's CP/M emulator almost runs at full speed! I'm serious. I have a C128 with CP/M and my Cp/M emulator usually beats my 128 at at unsqueezing/libraring programs. If IBM programs are I/O bound(they don't hit the hardware) an emulator could go very fast. Unfortunately, this means writing MS-DOS and BIOS in 68k code. I have also heard that alot of IBM programs poll the keyboard, and bang on the serial port hardware directly. This could be a problem.
srwmpnm@windy.dsir.govt.nz (03/12/91)
Here are 4 more methods for doing z80 instruction decoding on a 68000. These methods are all based on fixed-size instruction emulation routines. There are some general comments, on handling multiple-byte opcodes, writing to hardware registers, and flag handling, at the end of this post. These 4 methods are faster than methods 1..3 (see previous post), they do not have the RAM-write overheads of methods 4..8, and they do not require tables of opcode routine address offsets. Some of these methods might be superior to all previous methods. In all these methods, each z80 instruction emulation routine is fixed at 256 bytes in length. (See earlier post by Ian Farquhar.) They are coded in the sequence op_80, op_81, ... op_ff, op_00, op_01, ... op_7f. So there is exactly 64k of code. If a routine is shorter than 256 bytes, the extra space is wasted. If a routine is longer than 256 bytes then it will need a jsr to somewhere. ------------------------------------------------------------------------------- Method 9: The "self-modifying fixed-size routine" method: Warning: Self modifying code follows. This method is almost the same as in Ian Farquhar's earlier c.s.a.e post. setup: lea.l op_00(pc),a1 ; a1 always points to op_00 move.l z80ram,a2 ; load pseudopc mloop: move.b (a2)+,1$-op_00+4(a1) ;16 patch $ff in jmp instruction, inc pc 1$: jmp $ff00(a1) ;10 Jump to routine. ;26 total cycles Every instruction emulation routine ends with a copy of mloop, rather than a jump back to mloop. The move.b patches the high byte of the offset in the second instruction, so the jump goes to the right routine. (1$-op_00+4 might be 1$-op_00+2 --- I haven't checked.) The code is extremely fast (maybe the fastest yet), but it does not work on Amigas with memory caches. I thought of making it even faster by permanently setting a4 to the address of the byte to patch, and then using: mloop: move.b (a2)+,(a4) ;12 Patch $ff in jmp instruction, inc pc jmp $ff00(a1) ;10 Jump to routine. ;22 total cycles but you run out of reserved registers if there are lots of copies of mloop. (I.e, you can't use the decode-at-end-of-instruction technique.) Using "jmp (a3)" at the end of every routine makes it slower. ------------------------------------------------------------------------------- Method 10: The "standard fixed-size routine" method: Now we eliminate self-modifying code. The main loop is coded into the wasted space at the end of op_ff (so that it is within 128 bytes of op_00 --- Remember that op_00 is in the middle of the 64k code block). setup: subq.l #2,sp ; make room for scratch area on stack clr.w (sp) ; low byte of scratch area is always 0 lea.l mloop(pc),a3 ; a3 always points to mloop move.l z80ram,a2 ; load pseudopc mloop: move.b (a2)+,(sp) ;12 Opcode to scratch high byte, inc pc move.w (sp),d0 ; 8 high byte is opcode, low byte is 0 jmp op_00(pc,d0.w) ;14 Jump to routine. Each z80 instruction emulation routine ends with: jmp (a3) ;10 ;44 total cycles Unfortunately it's quite a bit slower. We can do better... ------------------------------------------------------------------------------- Method 11: The "decode-at-end-of-instruction fixed-size routine" method: Register a3 always points to op_00 instead of to mloop, and we have: setup: subq.l #2,sp ; make room for scratch area on stack clr.w (sp) ; low byte of scratch area is always 0 lea.l op_00(pc),a3 ; a3 always points to op_00 move.l z80ram,a2 ; load pseudopc mloop: move.b (a2)+,(sp) ;12 Opcode to scratch high byte, inc pc move.w (sp),d0 ; 8 high byte is opcode, low byte is 0 jmp 0(a3,d0.w) ;14 Jump to routine. ;34 total cycles Each z80 instruction emulation routine ends with a copy of the decode routine. This is faster than method 10, and mloop can be coded anywhere. Can we avoid using scratch memory and still be as fast? Think about how you might do this before you read on. An 8-bit shift of register d0 avoids using scratch memory, but is slower (on a plain 68000). The next method shows how to make the decode faster and avoid using scratch memory, but it (possibly) introduces overhead elsewhere. ------------------------------------------------------------------------------- Method 12: The "stretched-address-space fixed-size-routine" method: This method assumes that the z80 address space (z80ram) is stretched to 128k, so that each byte in the z80's address space takes up a word in the Amiga. The low order byte of every word must always be 0. setup: move.l z80ram,a2 ; load pseudopc lea.l op_00(pc),a3 ; a3 always points to op_00 mloop: move.w (a2)+,d0 ; 8 Opcode to d0 high byte, inc pc jmp 0(a3,d0.w) ;14 Jump to routine. ;22 total cycles For best results, every routine ends with the mloop code (decode at end of instruction). The instruction decode is faster than method 11, but now many instructions will have extra work to do to convert byte z80 addresses to word amiga addresses. Still, this code looks good enough to try. Miscellaneous hint #9: To convert a byte offset to a word offset, use "add.w d0,d0", not "lsl.w #1,d0". Another miscellaneous hint: Maybe there's a use for movep here. You could maintain 2 copies of the z80 address space --- one in 64k and the other in 128k. Then it's just a simple matter of writing a byte to both places whenever the z80 does a write. That gets rid of the overhead of converting between offset types on memory reads. But now our method is starting to look like threaded code (method 8) again. The threaded code method uses the 128k block to store the offset to the handling routine, rather than storing the opcode itself. The overhead in doing a memory write is the same in both methods, and threaded code has other advantages (like not having to pad code to 256 bytes, multiple-byte opcodes handled better). So we're back to threaded code again. ------------------------------------------------------------------------------- A note on multiple-byte opcode instructions: The z80 uses these. They are prefixed with $cb, $dd, $ed and $fd. There are also $ddcb and $fdcb prefixed instructions. For fixed-size methods, (methods 9..12), the fastest way to cope with long instructions is to use more tables. But that means multiplying the number of tables by 5 or 7. 64k of code has just jumped to 320k or 448k. That's no good on a small machine. Also, if you reserve a register to point to op_00 in each table, that's 5 or 7 registers gone. Oops. ------------------------------------------------------------------------------- Some notes on threaded code: I spent last evening trying threaded code (method 8) in the Spectrum emulator. (See previous post.) I got a 20..40% speed improvement over the position- independent standard method (256-way CASE statement). It's still several times slower than a real Spectrum, unfortunately. There is some slow code in places where there wasn't before, so there is scope for more improvement. Unfortunately I introduced some bugs during the systematic changes, and they are proving hard to track down. Everything in the Spectrum ROM seems to work ok. Some machine-code programs that worked before have stopped working. Threaded code is extremely fast for multiple-opcode instructions. Control is vectored directly to the right routine first time, without having to decode multiple tables. A problem with this is that if a Spectrum program overwrites the second opcode of a multiple-byte instruction ($cb, $dd, $ed, $fd) without writing to the first byte, then the emulator doesn't cope. ------------------------------------------------------------------------------- Note on hardware registers: Handling writes to hardware registers in the Spectrum isn't really very hard. The z80 has a separate IO address space with a separate set of instructions for handling it. (The same is true of the 8088.) The only thing to watch for is writing to video RAM. It is fixed size and at a fixed place, so it takes 2 tests. (There doesn't seem to be a faster way of doing a single bit test --- the video RAM doesn't end on a power-of-2 boundary.) I have a separate task (sort of) which uses the blitter to keep the screen up-to-date with the video RAM. So when there is a write to video RAM, the emulator task doesn't have to do much. It just flags the blitter task "Hey, there's something to update in character row n, when you wake up". The blitter task doesn't slow the emulator down much, because it's mostly running on another processor, and it sleeps when there's nothing to update. ------------------------------------------------------------------------------- Some notes on flag handling: Both of the CP/M emulators I know about spend a lot of time handling z80 flags (condition codes). After just about every instruction they do a "move sr,d0" or "move ccr,d0" or call GetCC() to get the 68000 flags, then they do a table lookup to translate them to z80 format. After every logical instruction (not, or, xor etc), a second table lookup is done to set the z80 parity flag. (The 68000 does not have a parity flag.) These table lookups are slow. In fact, they often take several times as long as the guts of the instruction itself. Both these table lookups are totally unnecessary! It's faster to save the flags in 68000 format (in a register). Routines that test flags simply test the corresponding 68000 flag. For logical instructions, simply save the parity byte away somewhere, and set another bit in the register to say to use the parity byte and not v flag. The parity testing instructions (e.g, "jp po,nn") look at that bit, and then test either the v flag or the saved parity byte. The only times you need to translate flags between z80 and 68000 formats are in "push af" and "pop af" instructions. I got a 10..20% speedup in my Spectrum emulator this way. ------------------------------------------------------------------------------- I said in my previous post that threaded code for 8088 emulation would use too much memory to be practical. In fact it would be perfectly practical on an Amiga equipped with 3 Mbytes or more. Peter McGavin. (srwmpnm@wnv.dsir.govt.nz)
srwmpnm@windy.dsir.govt.nz (03/12/91)
> One method that might help to speed up your emulator is to recode >the spectrum rom into 68000 calls to AmigaDOS. Yes, that works extremely well for CP/M and IBM emulators. But the Spectrum ROM does not have (many) well defined entry points, which makes that harder to do. The emulator has to be prepared for programs that jump in to the ROM anywhere, or which use any part of it as data. With threaded code (method 8, 1st post), what needs to be done, at each known ROM entry point, is to set the vector for that location to point to a system-call handling routine instead of to an instruction handling routine. That totally eliminates the overhead required in other methods to check for system calls. Most machine-code programs on the Spectrum ignore the ROM and bang on the hardware to do IO. Fortunately the Spectrum hardware (keyboard, screen) is relatively easy to emulate. I would prefer some sort of automatic compiler to recoding the ROM in 68000 by hand. I haven't looked at the ROM code, and I probably shouldn't anyway. Peter McGavin. (srwmpnm@wnv.dsir.govt.nz)
Radagast@cup.portal.com (sullivan - segall) (03/13/91)
>------------------------------------------------------------------------------ - >Note on hardware registers: > >Handling writes to hardware registers in the Spectrum isn't really very hard. >The z80 has a separate IO address space with a separate set of instructions fo r >handling it. (The same is true of the 8088.) The only thing to watch for is >writing to video RAM. It is fixed size and at a fixed place, so it takes 2 >tests. (There doesn't seem to be a faster way of doing a single bit test --- >the video RAM doesn't end on a power-of-2 boundary.) I have a separate task >(sort of) which uses the blitter to keep the screen up-to-date with the video >RAM. So when there is a write to video RAM, the emulator task doesn't have to >do much. It just flags the blitter task "Hey, there's something to update in >character row n, when you wake up". The blitter task doesn't slow the emulato r >down much, because it's mostly running on another processor, and it sleeps whe n >there's nothing to update. > >------------------------------------------------------------------------------ - If video writes are the more rare case (and I imagine they are) use the bit test even though the memory doesn't end on a bit boundary. If the bit test fails you can continue on with your important code quickly. If it doesn't fail, you only have the extra overhead of checking whether it is really a hit, or only a close call. There are many places in an emulator where two checks are faster than one. When it comes to decoding op-codes, it is usually a good idea to make the faster path the same as the manufacturers faster path. Not neccessarily because that is a better design so much as because many programs are optimized to use the faster paths more often. Worst case emulation time is increased, but over all performance is improved dramatically -Sullivan_-_Segall (a.k.a. Radagast) _______________________________________________________________ /V\ E-Credibility: (n -- ME) The unguaranteed likelyhood that ' the electronic mail you are reading is genuine rather than someone's made up crap. _______________________________________________________________ Mail to: ...sun!portal!cup.portal.com!radagast or radagast@cup.portal.com