kds@mipos3.UUCP (Ken Shoemaker ~) (05/08/87)
A week or so ago, Geoff Kuenning posted a note to comp.sys.intel asking about how the 386 pipeline was organized. I thought that that information would be interesting to people in comp.arch, also, so here it is: The 386 is broken up into lots of little functional units. These are the bus unit, the instruction fetcher, the instruction cracker, the instruction execution unit, the linear address generator, the page translator and the arithmetic unit. In terms of instruction execution, they could be put into three groups: the first group does instruction preparation (the fetcher and cracker), the second group does instruction execution (the execution unit, linear address generator and the arithmetic unit) and the last group does bus interface tasks at the bidding of the previous two groups (the paging unit and the bus unit). There are various places where instructions can sit in the process of execution. There is a 12-byte prefetch queue of uncracked instructions. These are taken 1,2 or 4 bytes at a time and are cracked and the cracked instructions are put in a 3 instruction deep cracked instruction fifo. Both of these fifos have zero fall through time. Filling the prefetch queue is done by an autonomous unit whenever there is a spare bus cycle and there are empty slots. The instructions are cracked whenever there are bytes ready to be cracked and there is a free slot in the cracked instruction queue. The rate at which instructions are cracked is dependent on the instruction. It pulls the opcodes themselves, i.e., those bytes that specify the instruction execution at a rate of one byte per clock. Immediate data and offsets are pulled out completely in one clock, i.e., a 32-bit immediate value takes one clock to get pulled out of the prefetch queue and be put in the cracked instruction queue. For example, a simple move immediate to register instruction would take 2 clocks to crack, one for the instruction specification and one for the immediate data "crack" assuming that the appropriate parts of the instruction are available to the instruction cracker when it needs them: the complete instruction need not be present for the instruction cracker to begin on an instruction. Likewise, a register to register arithmetic operation would take 2 clocks to crack, the first for the opcode byte and the second for the mod/rm byte. This instruction cracking time is done in parallel with previous instruction execution and, except in the case of jumps taken, is not included in the instruction times given in the data sheet. In the case of a jump, that is where the "m" comes from in the specification of 7 + m for the execution time in the jump taken case. All frequently executed instructions that access memory, e.g., loads, stores, arithmetic ops, jumps, etc., have the option of beginning their execution in the clock before the previous instruction finishes execution. This is called the early start clock. During this clock the address pipeline is set up to begin the bus cycle. The actual address calculation is done in specialized hardware which works in the first clock of the instruction, i.e., not the early start clock. If one of the registers to be used in the effective address calculation was written at the end of the previous instruction, that value is written both to the register and to the address calculation hardware at the end of the previous instruction if possible (it may not be possible if, for example, only 8-bits of the register are being written while all 32-bits of the register are being used by the address calculation unit). In the first clock of a non-memory reference instruction, the register file and alu are set up. During the second clock, the register file is read out, and the alu operates. At the end of the second clock the result is written back to the register file. Hence, most register/register operations on the 386 are 2 clocks long. When memory references are involved, it is a little more difficult. As discussed above, the address generation hardware is set up the clock before the instruction actually begins. Then during the first clock of the instruction execution, the linear address generation begins. Linear address generation can take 1 or 2 clocks, depending on the number of components in the linear address. The 386 supports 1, 2 or 3 components plus the segment base to generate the linear address. If 1 or 2 components are used to generate the linear address, the linear address generation takes 1 clock; if 3 components are used the linear address generation takes 2 clocks, i.e., there is a 3 input adder in the address generation hardware. During this time, the segmentation access and limit checks take place. During the clock after the linear address is calculated, the linear to physical address translation occurs using the paging hardware. This takes one clock, and the physical addresses are presented on the external address pins in the clock after this. So, for example, if you have a two component effective address calculation memory load, the final physical address will be presented onto the external address pins at the beginning of the third clock into the instruction. If you are running with a zero wait state system data will be returned at the end of the 4th clock, and the instruction can terminate: the instruction takes 4 clocks. If it is an arithmetic instruction, the arithmetic instruction can then continue, which takes two clocks in addition: the instruction takes 6 clocks. For writes, the processor only waits as long as insuring that all the protection checks complete (through the segmentation/linear address generation hardware and the paging hardware) so a store to memory takes two clocks: the processor goes onto the next instruction before the write completes on the external bus. With respect to collisions, there are two places where they can occur. The first place is concerning access to the external bus: it is possible that a prefetch cycle and a processor-generated cycle could be generated concurrently. In this case, the processor-generated cycle is always given priority. If you are running with a 0 wait state system, the processor will never have to wait to gain access to the external bus because of prefetcher contention, because the internal arbitration logic knows far enough ahead of time when an instruction is going to request the bus that it can prevent another prefetch bus cycle from being generated if it won't complete before the processor's bus cycle is ready to be run. You should be able to figure out what happens in a multiple wait state system! The other place on contention is in accessing the register file. This can happen if you are in a previous instruction modifying a register that is being used as a addressing component in the next instruction, and is handled as described above. Note that for back to back arithmetic instructions that you won't have collisions, since the register read out for the instruction won't happen at the same time as the write back from the previous instruction. The above doesn't go into detail about how split data cycles are handled, about faults, or how long microcode sequences are supported. These shouldn't be an issue for deciding on certain instruction sequences to be generated by compilers, except, of course, data transfers do go faster when they are aligned (which should be obvious!) I should add, also, that special hardware is present to deal with multiple back-to-back push and pop operations such as might be encountered on procedure entry and exit. With the above covered, I can begin to address the points brought out in the original note. > On the Intel 80386, this calculation cannot be done. For example, loading > from memory into a register takes four clocks, because the processor must > wait for the data to become actually available. By contrast, storing only > takes two clocks, because the memory access can proceed in parallel with the > next instruction. However, consider the sequence (Unix assembler operand > order): > > mov %eax,ax_save / Store eax in ax_save > mov ax_save,%eax / Reload eax from ax_save > > How long does this actually take? The book would lead you to believe 6 > clocks (2 for the store, 4 for the load). However, it is obvious that > the load cannot proceed if the store is happening in parallel. The actual > execution time is going to be 8 or 10 clocks; I haven't been able to find well, if you believe what I have said above, then it will take 6 clocks: the first 2 clocks generate the write cycle, the next 2 clocks generate the read cycle and actually perform the write to memory and the next 2 clocks read back the value from memory. > Similarly, there is no information that indicates how instruction fetches > interact with operand accesses. Nor is there a specification of how > the instruction-decode pipeline reloads after branches; from the manual > one would conclude that you can speed up this: > > bra target > ... > target: data16; seg-override; mov 2[%eax,%edx,4],%eax > > by inserting a "nop" at the target, because the "m" variable in the branch > execution time would be reduced from 4 or 5 to 1. This seems highly > unlikely to be the actual case. you should be able to figure out how long it actually takes to crack an instruction from what was presented above. You are right about you thinking that that inserting a nop won't speed it up. You should also be able to figure out that the more valid data you can get into the processor after a branch, the faster the cracker can get started, so if you put all your branch targets on 32-bit boundaries, you can get the instructions started executing faster after the branch. > I called up Intel about this, and an applications engineer promised to > try to locate information (he was skeptical about its existence) and send > it to me. Since a week has passed with no answer, I presume he couldn't > find anything. He *did* inform me that the timing for branch instructions > in the manual is just plain wrong (the 386 always starts instruction fetches > at a "paragraph," or 16-byte, boundary,so you must also add target_addr mod 16 > to the branch instruction time). almost true, as described above. The 386 doesn't start instruction fetches on a 16-byte boundary, but rather on a 4-byte (32-bit) boundary, and throws away bytes below the actual address jumped to. This doesn't add any time to the branch cycle, since the data bus is 32-bits anyway. > So first a warning: the instruction times in the 386 manual are for > "ideal" examples only, and real code will almost always take longer than > the manual indicates. The problem is worse, of course, in systems with > slow memory. If you really care about timings, you will have to measure > your code with a scope or an ICE (don't have a 386 ICE? Neither do I; > they're rather hard to come by just yet). "your milage may vary." Seriously, the numbers are valid and are given as the number of clocks it takes the microcode to execute the sequence for the instruction assuming zero wait-state memory+the number of clocks it takes to get the next instruction ready to execute in the case of jumps. If you look at the above, I'm sure you will see how the real world can make these numbers seem somewhat unrealistic in some systems. > And second, a sincere request: are there any Intel engineers in this > group who can give us a *complete* specification of the pipeline dependencies > in the 386? It's not really that hard; check out an old 6600 or 7600 manual > (or, presumably, a new Cray-1 or -2 manual) for an example of what needs > to be covered. There really are those of us out here who care. > -- > Geoff Kuenning geoff@ITcorp.com {hplabs,ihnp4}!trwrb!desint!geoff I hope this is what you want. As you see, on the 386, the details of the pipeline aren't nearly as important to know in determining how to make certain instruction sequences run fast. In your example, the main contribution is the speed of the external memory system, which has nothing to do with the pipeline details. Also, if you read the notes on the instruction timings for the 386 in the data sheet, I'm sure you can see why some of the notes are there, and you might have even been able to figure out most of this for yourself. I'm not sure why the information wasn't presented in the above form, however. If you have any more questions, please feel free to respond via mail. If there are enough questions (have I managed to confuse anyone out there that cares?) I can post a followup. -- The above views are personal. ...and they whisper and they chatter, but it really doesn't matter. Ken Shoemaker, Microprocessor Design, Intel Corp., Santa Clara, California uucp: ...{hplabs|decwrl|amdcad|qantel|pur-ee|scgvaxd|oliveb}!intelca!mipos3!kds csnet/arpanet: kds@mipos3.intel.com
toma@tekgvs.TEK.COM (Thomas Almy) (05/08/87)
In article <648@mipos3.UUCP> kds@mipos3.UUCP (Ken Shoemaker ~) writes: >A week or so ago, Geoff Kuenning posted a note to comp.sys.intel asking about >how the 386 pipeline was organized. I thought that that information would >be interesting to people in comp.arch, also, so here it is: [...] I really appreciated this information, I have been confused, stumped and mystified over the operation of the 386 since getting one. The documentation sure left a lot to be desired. > >> Similarly, there is no information that indicates how instruction fetches >> interact with operand accesses. Nor is there a specification of how >> the instruction-decode pipeline reloads after branches; from the manual >> one would conclude that you can speed up this: [...] >> by inserting a "nop" at the target, because the "m" variable in the branch >> execution time would be reduced from 4 or 5 to 1. This seems highly >> unlikely to be the actual case. > >you should be able to figure out how long it actually takes to crack an >instruction from what was presented above. You are right about you thinking >that that inserting a nop won't speed it up. You should also be able to >figure out that the more valid data you can get into the processor after >a branch, the faster the cracker can get started, so if you put all >your branch targets on 32-bit boundaries, you can get the instructions >started executing faster after the branch. Yet I discovered that what Geoff had said was true to a degree! In a loop where the first instruction is fetches from memory, the loop performance will increase if the start of the loop is offset from the 32 bit boundary! And inserting a noop in some cases did increase performance! I also found that the memory interleaving causes inconsistant results depending on the position of the instruction relative to the position of the data! > >> So first a warning: the instruction times in the 386 manual are for >> "ideal" examples only, and real code will almost always take longer than >> the manual indicates. > >"your milage may vary." Seriously, the numbers are valid and are given >as the number of clocks it takes the microcode to execute the >sequence for the instruction assuming zero wait-state memory+the number >of clocks it takes to get the next instruction ready to execute in the case >of jumps. > >I hope this is what you want. As you see, on the 386, the details >of the pipeline aren't nearly as important to know in determining >how to make certain instruction sequences run fast. In your >example, the main contribution is the speed of the external memory system, >which has nothing to do with the pipeline details. I am afraid I have to agree with Geoff's feelings about needing to measure the time for code. Consider this test performed by a friend of mine on his 386 system (running in 286 protected mode) and varified by me (running in 386 protected mode). Both systems are Intel 386/AT motherboards with one wait state + interleave. My system has paging enabled. The exercise involves filling up a 64k segment with as many copies of an instruction that will fit, and then executing it. Thus the pipeline will be filled, and a steady state instruction time will be calculated (in my case, interrupts were disabled as well). The two byte instruction "MOV EAX,EBX" (AX,BX in my friends case) took 2.1 clocks on average. The manual says 2 clocks. What is consuming the extra .1 clock? Even with the wait state, the instruction queues should always be full. (3 clocks for each instruction word fetch + 1 for paging? = 4 clocks, but two instructions are fetched per clock). The next instruction tested was "CLD", a single byte status flag clearing instruction. The manual again says 2 clocks, but we both measured EXACTLY 3 clocks. Unless the manual is misprinted, there is even less reason for this. Tom Almy Tektronix, Inc.
phil@amdcad.AMD.COM (Phil Ngai) (05/10/87)
In article <2271@tekgvs.TEK.COM> toma@tekgvs.UUCP (Thomas Almy) writes: >Thus the pipeline will be filled, and a steady state instruction time will >be calculated (in my case, interrupts were disabled as well). The two byte >instruction "MOV EAX,EBX" (AX,BX in my friends case) took 2.1 clocks on >average. The manual says 2 clocks. What is consuming the extra .1 clock? Just a shot in the dark, but does the board use dynamic memory and could refresh be slowing the program down? -- Phil Ngai, {ucbvax,decwrl,allegra}!amdcad!phil or amdcad!phil@decwrl.dec.com