[comp.sys.intel] 386 pipelines/timings: warning and question

geoff@desint.UUCP (04/30/87)
This article contains two parts:  first, a warning to people about
incompleteness in the 386 hardware manuals, and then a question to see if
anybody knows the complete answers.

I have extensive experience on pipelined computers.  I cut my teeth on
the CDC 6000 series, designed by Seymour Cray before he started his own
company.  With all previous high-performance computers, I have been used
to being able to open up the hardware or programmer's manual and find
a *complete* specification of the interdependencies of the various pipelines,
as well as instruction execution times, so that it is possible to calculate
the exact execution time of an instruction sequence.  For example, on the
CDC 6600, one can predict that of the two sequences:

	FX5	X1*X2	    |		FX5	X1*X2
	FX6	X3*X4	    |		FX7	X5+X0
	FX7	X5+X0	    |		FX6	X3*X4

the left-hand sequence will be faster than the right-hand one, because
the processor can do the two multiplies in parallel without waiting for
X5 to be computed as input to the addition (read "FX5" as "calculate
this floating expression into X5").

On the Intel 80386, this calculation cannot be done.  For example, loading
from memory into a register takes four clocks, because the processor must
wait for the data to become actually available.  By contrast, storing only
takes two clocks, because the memory access can proceed in parallel with the
next instruction.  However, consider the sequence (Unix assembler operand
order):

	mov	%eax,ax_save		/ Store eax in ax_save
	mov	ax_save,%eax		/ Reload eax from ax_save

How long does this actually take?  The book would lead you to believe 6
clocks (2 for the store, 4 for the load).  However, it is obvious that
the load cannot proceed if the store is happening in parallel.  The actual
execution time is going to be 8 or 10 clocks;  I haven't been able to find
out exactly.  This is not documented ANYWHERE in the 386 reference books.
Similarly, there is no information that indicates how instruction fetches
interact with operand accesses.  Nor is there a specification of how
the instruction-decode pipeline reloads after branches;  from the manual
one would conclude that you can speed up this:

	bra target
	...
target:	data16; seg-override; mov 2[%eax,%edx,4],%eax

by inserting a "nop" at the target, because the "m" variable in the branch
execution time would be reduced from 4 or 5 to 1.  This seems highly
unlikely to be the actual case.

I called up Intel about this, and an applications engineer promised to
try to locate information (he was skeptical about its existence) and send
it to me.  Since a week has passed with no answer, I presume he couldn't
find anything. He *did* inform me that the timing for branch instructions
in the manual is just plain wrong (the 386 always starts instruction fetches
at a "paragraph," or 16-byte, boundary, so you must also add target_addr mod 16
to the branch instruction time).

So first a warning:  the instruction times in the 386 manual are for
"ideal" examples only, and real code will almost always take longer than
the manual indicates.  The problem is worse, of course, in systems with
slow memory.  If you really care about timings, you will have to measure
your code with a scope or an ICE (don't have a 386 ICE?  Neither do I;
they're rather hard to come by just yet).

And second, a sincere request:  are there any Intel engineers in this
group who can give us a *complete* specification of the pipeline dependencies
in the 386?  It's not really that hard;  check out an old 6600 or 7600 manual
(or, presumably, a new Cray-1 or -2 manual) for an example of what needs
to be covered.  There really are those of us out here who care.
-- 
	Geoff Kuenning   geoff@ITcorp.com   {hplabs,ihnp4}!trwrb!desint!geoff