[comp.arch] 386 pipeline details

kds@mipos3.UUCP (Ken Shoemaker ~) (05/08/87)

A week or so ago, Geoff Kuenning posted a note to comp.sys.intel asking about 
how the 386 pipeline was organized.  I thought that that information would
be interesting to people in comp.arch, also, so here it is:

The 386 is broken up into lots of little functional units.  These are the
bus unit, the instruction fetcher, the instruction cracker, the instruction 
execution unit, the linear address generator, the page translator and the 
arithmetic unit.  In terms of instruction execution, they could be put 
into three groups: the first group does instruction preparation (the
fetcher and cracker), the second group does instruction execution (the
execution unit, linear address generator and the arithmetic unit) and
the last group does bus interface tasks at the bidding of the previous
two groups (the paging unit and the bus unit).

There are various places where instructions can sit in the process of
execution.  There is a 12-byte prefetch queue of uncracked instructions.  These
are taken 1,2 or 4 bytes at a time and are cracked and the cracked instructions
are put in a 3 instruction deep cracked instruction fifo.  Both of these fifos
have zero fall through time.  Filling the prefetch queue is done by an
autonomous unit whenever there is a spare bus cycle and there are empty
slots.  The instructions are cracked whenever there are bytes ready to be
cracked and there is a free slot in the cracked instruction queue.  The rate
at which instructions are cracked is dependent on the instruction.  It pulls
the opcodes themselves, i.e., those bytes that specify the instruction
execution at a rate of one byte per clock.  Immediate data and offsets are 
pulled out completely in one clock, i.e., a 32-bit immediate value takes one 
clock to get pulled out of the prefetch queue and be put in the cracked 
instruction queue.  

For example, a simple move immediate to register instruction would take 
2 clocks to crack, one for the instruction specification and one for 
the immediate data "crack" assuming that the appropriate parts of the 
instruction are available to the instruction cracker when it needs them: the
complete instruction need not be present for the instruction cracker to
begin on an instruction.   Likewise, a register to register arithmetic
operation would take 2 clocks to crack, the first for the opcode byte and
the second for the mod/rm byte.

This instruction cracking time is done in parallel with previous instruction 
execution and, except in the case of jumps taken, is not included in 
the instruction times given in the data sheet.  In the case
of a jump, that is where the "m" comes from in the specification of 7 + m
for the execution time in the jump taken case.

All frequently executed instructions that access memory, e.g., loads, stores,
arithmetic ops, jumps, etc., have the option of beginning their execution in
the clock before the previous instruction finishes execution.  This is called
the early start clock.  During this clock the address pipeline is set up to
begin the bus cycle.  The actual address calculation is done in specialized
hardware which works in the first clock of the instruction, i.e., not the
early start clock.  If one of the registers to be used in the effective
address calculation was written at the end of the previous instruction, that
value is written both to the register and to the address calculation hardware
at the end of the previous instruction if possible (it may not be possible
if, for example, only 8-bits of the register are being written while all
32-bits of the register are being used by the address calculation unit).

In the first clock of a non-memory reference instruction, the register file
and alu are set up.  During the second clock, the register file is read out, 
and the alu operates.  At the end of the second clock the result is
written back to the register file.  Hence, most register/register operations
on the 386 are 2 clocks long.  When memory references are involved, it is
a little more difficult.  As discussed above, the address generation hardware
is set up the clock before the instruction actually begins.  Then during the
first clock of the instruction execution, the linear address generation
begins.  Linear address generation can take 1 or 2 clocks, depending on
the number of components in the linear address.  The 386 supports 1, 2 or
3 components plus the segment base to generate the linear address.  If
1 or 2 components are used to generate the linear address, the linear
address generation takes 1 clock; if 3 components are used the linear 
address generation takes 2 clocks, i.e., there is a 3 input adder in the
address generation hardware.  During this time, the segmentation access
and limit checks take place.  During the clock after the linear address
is calculated, the linear to physical address translation occurs using
the paging hardware.  This takes one clock, and the physical addresses
are presented on the external address pins in the clock after this.  So,
for example, if you have a two component effective address calculation memory
load, the final physical address will be presented onto the external address
pins at the beginning of the third clock into the instruction.  If you
are running with a zero wait state system data will be returned at the
end of the 4th clock, and the instruction can terminate: the instruction takes
4 clocks.  If it is an arithmetic instruction, the arithmetic instruction can
then continue, which takes two clocks in addition: the instruction takes
6 clocks.  For writes, the processor only waits as long as insuring that all
the protection checks complete (through the segmentation/linear address
generation hardware and the paging hardware) so a store to memory takes
two clocks: the processor goes onto the next instruction before the
write completes on the external bus.

With respect to collisions, there are two places where they can occur.  The
first place is concerning access to the external bus: it is possible
that a prefetch cycle and a processor-generated cycle could be generated
concurrently.  In this case, the processor-generated cycle is always given
priority.  If you are running with a 0 wait state system, the processor
will never have to wait to gain access to the external bus because of
prefetcher contention, because the internal arbitration logic knows far
enough ahead of time when an instruction is going to request the bus that
it can prevent another prefetch bus cycle from being generated if it won't
complete before the processor's bus cycle is ready to be run.  You should
be able to figure out what happens in a multiple wait state system!  The
other place on contention is in accessing the register file.  This can happen
if you are in a previous instruction modifying a register that is being used
as a addressing component in the next instruction, and is handled as described
above.  Note that for back to back arithmetic instructions that you won't
have collisions, since the register read out for the instruction won't happen
at the same time as the write back from the previous instruction.

The above doesn't go into detail about how split data cycles
are handled, about faults, or how long microcode sequences are supported.  
These shouldn't be an issue for deciding on certain instruction sequences
to be generated by compilers, except, of course, data transfers do go
faster when they are aligned (which should be obvious!)  I should add,
also, that special hardware is present to deal with multiple back-to-back
push and pop operations such as might be encountered on procedure entry
and exit.

With the above covered, I can begin to address the points brought out in
the original note.

> On the Intel 80386, this calculation cannot be done.  For example, loading
> from memory into a register takes four clocks, because the processor must
> wait for the data to become actually available.  By contrast, storing only
> takes two clocks, because the memory access can proceed in parallel with the
> next instruction.  However, consider the sequence (Unix assembler operand
> order):
> 
> 	mov	%eax,ax_save		/ Store eax in ax_save
> 	mov	ax_save,%eax		/ Reload eax from ax_save
> 
> How long does this actually take?  The book would lead you to believe 6
> clocks (2 for the store, 4 for the load).  However, it is obvious that
> the load cannot proceed if the store is happening in parallel.  The actual
> execution time is going to be 8 or 10 clocks;  I haven't been able to find

well, if you believe what I have said above, then it will take 6 clocks:
the first 2 clocks generate the write cycle, the next 2 clocks generate the
read cycle and actually perform the write to memory and the next 2 clocks
read back the value from memory.

> Similarly, there is no information that indicates how instruction fetches
> interact with operand accesses.  Nor is there a specification of how
> the instruction-decode pipeline reloads after branches;  from the manual
> one would conclude that you can speed up this:
> 
> 	bra target
> 	...
> target:	data16; seg-override; mov 2[%eax,%edx,4],%eax
> 
> by inserting a "nop" at the target, because the "m" variable in the branch
> execution time would be reduced from 4 or 5 to 1.  This seems highly
> unlikely to be the actual case.

you should be able to figure out how long it actually takes to crack an
instruction from what was presented above.  You are right about you thinking
that that inserting a nop won't speed it up.  You should also be able to 
figure out that the more valid data you can get into the processor after 
a branch, the faster the cracker can get started, so if you put all 
your branch targets on 32-bit boundaries, you can get the instructions 
started executing faster after the branch.

> I called up Intel about this, and an applications engineer promised to
> try to locate information (he was skeptical about its existence) and send
> it to me.  Since a week has passed with no answer, I presume he couldn't
> find anything. He *did* inform me that the timing for branch instructions
> in the manual is just plain wrong (the 386 always starts instruction fetches
> at a "paragraph," or 16-byte, boundary,so you must also add target_addr mod 16
> to the branch instruction time).

almost true, as described above.  The 386 doesn't start instruction fetches
on a 16-byte boundary, but rather on a 4-byte (32-bit) boundary, and throws
away bytes below the actual address jumped to.  This doesn't add any time
to the branch cycle, since the data bus is 32-bits anyway.

> So first a warning:  the instruction times in the 386 manual are for
> "ideal" examples only, and real code will almost always take longer than
> the manual indicates.  The problem is worse, of course, in systems with
> slow memory.  If you really care about timings, you will have to measure
> your code with a scope or an ICE (don't have a 386 ICE?  Neither do I;
> they're rather hard to come by just yet).

"your milage may vary."  Seriously, the numbers are valid and are given 
as the number of clocks it takes the microcode to execute the
sequence for the instruction assuming zero wait-state memory+the number
of clocks it takes to get the next instruction ready to execute in the case
of jumps.  If you look at the above, I'm sure you will see how the real world
can make these numbers seem somewhat unrealistic in some systems.

> And second, a sincere request:  are there any Intel engineers in this
> group who can give us a *complete* specification of the pipeline dependencies
> in the 386?  It's not really that hard;  check out an old 6600 or 7600 manual
> (or, presumably, a new Cray-1 or -2 manual) for an example of what needs
> to be covered.  There really are those of us out here who care.
> -- 
> 	Geoff Kuenning   geoff@ITcorp.com   {hplabs,ihnp4}!trwrb!desint!geoff

I hope this is what you want.  As you see, on the 386, the details 
of the pipeline aren't nearly as important to know in determining 
how to make certain instruction sequences run fast.  In your
example, the main contribution is the speed of the external memory system,
which has nothing to do with the pipeline details.  Also, if you read
the notes on the instruction timings for the 386 in the data sheet, I'm
sure you can see why some of the notes are there, and you might have even
been able to figure out most of this for yourself.  I'm not sure why the
information wasn't presented in the above form, however.  If you have any
more questions, please feel free to respond via mail.  If there are enough
questions (have I managed to confuse anyone out there that cares?) I can
post a followup.
-- 
The above views are personal.

...and they whisper and they chatter, but it really doesn't matter.

Ken Shoemaker, Microprocessor Design, Intel Corp., Santa Clara, California
uucp: ...{hplabs|decwrl|amdcad|qantel|pur-ee|scgvaxd|oliveb}!intelca!mipos3!kds
csnet/arpanet: kds@mipos3.intel.com

toma@tekgvs.TEK.COM (Thomas Almy) (05/08/87)

In article <648@mipos3.UUCP> kds@mipos3.UUCP (Ken Shoemaker ~) writes:
>A week or so ago, Geoff Kuenning posted a note to comp.sys.intel asking about 
>how the 386 pipeline was organized.  I thought that that information would
>be interesting to people in comp.arch, also, so here it is: [...]

I really appreciated this information, I have been confused, stumped and
mystified over the operation of the 386 since getting one.  The documentation
sure left a lot to be desired.

>
>> Similarly, there is no information that indicates how instruction fetches
>> interact with operand accesses.  Nor is there a specification of how
>> the instruction-decode pipeline reloads after branches;  from the manual
>> one would conclude that you can speed up this: [...]
>> by inserting a "nop" at the target, because the "m" variable in the branch
>> execution time would be reduced from 4 or 5 to 1.  This seems highly
>> unlikely to be the actual case.
>
>you should be able to figure out how long it actually takes to crack an
>instruction from what was presented above.  You are right about you thinking
>that that inserting a nop won't speed it up.  You should also be able to 
>figure out that the more valid data you can get into the processor after 
>a branch, the faster the cracker can get started, so if you put all 
>your branch targets on 32-bit boundaries, you can get the instructions 
>started executing faster after the branch.

Yet I discovered that what Geoff had said was true to a degree!  In a loop
where the first instruction is fetches from memory, the loop performance
will increase if the start of the loop is offset from the 32 bit boundary!
And inserting a noop in some cases did increase performance!

I also found that the memory interleaving causes inconsistant results 
depending on the position of the instruction relative to the position of
the data!

>
>> So first a warning:  the instruction times in the 386 manual are for
>> "ideal" examples only, and real code will almost always take longer than
>> the manual indicates.
>
>"your milage may vary."  Seriously, the numbers are valid and are given 
>as the number of clocks it takes the microcode to execute the
>sequence for the instruction assuming zero wait-state memory+the number
>of clocks it takes to get the next instruction ready to execute in the case
>of jumps.  
>
>I hope this is what you want.  As you see, on the 386, the details 
>of the pipeline aren't nearly as important to know in determining 
>how to make certain instruction sequences run fast.  In your
>example, the main contribution is the speed of the external memory system,
>which has nothing to do with the pipeline details.  

I am afraid I have to agree with Geoff's feelings about needing to measure
the time for code.

Consider this test performed by a friend of mine on his 386 system (running in
286 protected mode) and varified by me (running in 386 protected mode).  Both
systems are Intel 386/AT motherboards with one wait state + interleave.  My 
system has paging enabled.  The exercise involves filling up a 64k segment
with as many copies of an instruction that will fit, and then executing it.
Thus the pipeline will be filled, and a steady state instruction time will
be calculated (in my case, interrupts were disabled as well).  The two byte
instruction "MOV EAX,EBX" (AX,BX in my friends case) took 2.1 clocks on 
average.  The manual says 2 clocks.  What is consuming the extra .1 clock?
Even with the wait state, the instruction queues should always be full.
(3 clocks for each instruction word fetch + 1 for paging? = 4 clocks, but
two instructions are fetched per clock).

The next instruction tested was "CLD", a single byte status flag clearing
instruction.  The manual again says 2 clocks, but we both measured EXACTLY
3 clocks.  Unless the manual is misprinted, there is even less reason for
this.

Tom Almy
Tektronix, Inc.

phil@amdcad.AMD.COM (Phil Ngai) (05/10/87)

In article <2271@tekgvs.TEK.COM> toma@tekgvs.UUCP (Thomas Almy) writes:
>Thus the pipeline will be filled, and a steady state instruction time will
>be calculated (in my case, interrupts were disabled as well).  The two byte
>instruction "MOV EAX,EBX" (AX,BX in my friends case) took 2.1 clocks on 
>average.  The manual says 2 clocks.  What is consuming the extra .1 clock?

Just a shot in the dark, but does the board use dynamic memory and
could refresh be slowing the program down? 

-- 
Phil Ngai, {ucbvax,decwrl,allegra}!amdcad!phil or amdcad!phil@decwrl.dec.com