[comp.arch] High-Priority Instructions

schooler@oak.bbn.com (Richard Schooler) (07/25/90)

  The Motorola 88100's register file has only a single write port, and
thus on each cycle, only a single instruction can write a result.  This
is known as the "writeback slot".  When carefully hand-scheduling code
(and thinking about scheduling algorithms), the writeback slot can be a
real pain, mainly because of the fixed priority scheme: if instructions
of different latency contend for the writeback slot, then the integer
unit gets it first, then floating-point instructions, then loads
(roughly).  However, it often happens that loads or floating-point
instructions are on the critical path, and can get stalled because of
other instructions stealing the writeback slot.  Dealing with this is
painful by hand, and I haven't been able to come up with reasonable
algorithms, either.
  I thus propose that instructions have a "high-priority" bit, which the
compiler would place on critical-path instructions that would otherwise
get delayed.  Ties between instructions of the same priority would be
broken by the usual rules.  The scheduler could then mostly ignore the
writeback slot.  What do you think?

	-- Richard
	schooler@bbn.com

lewine@dg.dg.com (Don Lewine) (07/26/90)

In article <58428@bbn.BBN.COM> schooler@oak.bbn.com (Richard Schooler) writes:
>
>  The Motorola 88100's register file has only a single write port.
	{Stuff deleted}
>  I thus propose that instructions have a "high-priority" bit, which the
>compiler would place on critical-path instructions that would otherwise
>get delayed.  

There is nothing in the 88000 generic architecture that requires a single
write port.  Future 88000 family members may have multiple write ports.
Since the bit would require new silicon and a large amount of the opcode
space (the bit basicly doubles the number of opcodes) I would vote for 
additional write ports instead.

For the cases I have investigated the additional write ports are less logic
than your bit would be.

DISCLAIMER: This is my opinion as a computer architect.  It is not an
		announcement of anything by either Data General or Motorola!

colin@array.UUCP (Colin Plumb) (07/26/90)

I think that bit is just a trifle implementation-specific; a more
intelligent arbitration scheme would ignore it again.

How about write buffers between the functional unit outputs and the
register file?  The 29000 does this with async loads.  In fact, it is
always holding some register in the write buffer, because the only
write-back slots it uses are the ones from subsequent load
instructions... which, coincidentally, are the only instructions that
need to have the write buffer emptied!
-- 
	-Colin

marc@oahu.cs.ucla.edu (Marc Tremblay) (07/27/90)

In article <58428@bbn.BBN.COM> schooler@oak.bbn.com (Richard Schooler) writes:
>  The Motorola 88100's register file has only a single write port, and
>thus on each cycle, only a single instruction can write a result.  This
>is known as the "writeback slot".  When carefully hand-scheduling code
>(and thinking about scheduling algorithms), the writeback slot can be a
>real pain, mainly because of the fixed priority scheme: if instructions
>of different latency contend for the writeback slot, then the integer
>unit gets it first, then floating-point instructions, then loads
>(roughly).  However, it often happens that loads or floating-point
>instructions are on the critical path, and can get stalled because of
>other instructions stealing the writeback slot.

It looks like Motorola did not want to deal with functional unit
latencies when instructions are issued.
Otherwise they could use a result shift register where the "writeback slot"
is reserved in advance according to the latency of the functional unit used.
Conflicts are thus resolved in advance. Collisions cause stalling of
the issuing unit.
One of the advantages of a fixed priority scheme though, is that one could
have variable latency functional units (for example an early exit multiplier).
That kind of functional unit is hard to handle with a result shift register.

>  I thus propose that instructions have a "high-priority" bit, which the
>compiler would place on critical-path instructions that would otherwise
>get delayed.  Ties between instructions of the same priority would be
>broken by the usual rules.  The scheduler could then mostly ignore the
>writeback slot.  What do you think?

Let's see, instructions are held up because other instructions
have higher priority, they will proceed only if there is an empty slot.
If there is no empty slot, that means that other instructions are producing
useful work and that the write-back slot is running at full throughput
(since there is only one instruction issued per cycle, there can only be
one instruction result produced per cycle in the steady state).
If the write-back slot is running at full throughput that means that
data dependency is resolved and that the stalled instruction is not needed
at the moment.
If you assign a high-priority bit to instructions, that means that the flow
will be modified in the sense that an another instruction will be stalled
while the high priority instruction proceeds. In the general case that will
add up to the same number of cycles. There might be something to gain if when
exiting the sequence of instructions executing at full throughput, there is
some data dependency associated with the previously stalled instruction,
which takes a long time to resolve (long load latency for example).

_________________________________________________
Marc Tremblay
internet: marc@CS.UCLA.EDU
UUCP: ...!{uunet,ucbvax,rutgers}!cs.ucla.edu!marc

davec@nucleus.amd.com (Dave Christie) (07/27/90)

In <37310@shemp.CS.UCLA.EDU> marc@oahu.cs.ucla.edu (Marc Tremblay) writes:
>In article <58428@bbn.BBN.COM> schooler@oak.bbn.com (Richard Schooler) writes:
>> [description of problems with scheduling for writeback slot on 88k deleted]
>It looks like Motorola did not want to deal with functional unit
>latencies when instructions are issued.
>Otherwise they could use a result shift register where the "writeback slot"
>is reserved in advance according to the latency of the functional unit used.
>Conflicts are thus resolved in advance. Collisions cause stalling of
>the issuing unit.

Yep, a pretty straightforward thing to do control-wise.  You also want
to be able to forward either or both input operands from it, so it's a
bit more than a simple shift register, but nevertheless nice for handling 
medium latencies.  You might want to revert to a priority scheme for
divide though, or serialize.  Another benefit is that it keeps your 
register file updates in order.  Might have been just a tad too much 
realestate for them though, assuming they considered it.

>> [Richard's instruction-based priority bit scheme deleted]
>
>Let's see, instructions are held up because other instructions
>have higher priority, they will proceed only if there is an empty slot.
>If there is no empty slot, that means that other instructions are producing
>useful work and that the write-back slot is running at full throughput

But if a certain write slot doesn't satisfy a dependency that exists at
decode then issue stalls, which will cause an empty write slot downstream
and lower throughput.  You would typically want to separate a FP instruction
from a instruction that depended on it; if you throw as many integer
instructions in between as necessary to compensate for the FP latency,
those integer instructions just end up stretching out the FP latency
by taking priority for writeback, and the extra FP latency isn't hidden
at all - ouch!  The proper priority scheme, IMHO, would give preference
to earlier instructions in the sequence, which is of course the ones with
the longer latency.  I'd like the hear their rational for this.  Maybe
it was arbitrarily chosen and they just expected the compilers to keep
it all moving (bet they learned a lesson there!).  There's an example
of rescheduling in an article on the 88K in June's IEEE Micro, but it
conveniently avoids write slot contention with an appropriate mix of 
various latency operations. This of course does happen in the real world,
but there's no mention of what happens when things don't work out so nicely.

In any case I personally wouldn't advocate putting something in the 
instruction set architecture simply to occasionally deal with the 
vagaries of a particular implementation.

----------------------
Dave Christie           My opinions only.

daveg@near.cs.caltech.edu (Dave Gillespie) (07/29/90)

I don't have the book handy, but as I recall the highest priority on the
88000's writeback slot was given to the single-cycle instructions.  I always
figured this was so that these instructions could be guaranteed to complete
in one cycle.  So, yes, other strategies like first-come-first-served might
work better, but they would have complicated the part of the processor that
was trying to be simplest and fastest.

Just my guess,

								-- Dave
--
Dave Gillespie
  256-80 Caltech Pasadena CA USA 91125
  daveg@csvax.caltech.edu, ...!cit-vax!daveg

marvin@oakhill.UUCP (Marvin Denman) (07/31/90)

In article <1990Jul27.161856.25701@mozart.amd.com> davec@nucleus.amd.com (Dave Christie) writes:
>In <37310@shemp.CS.UCLA.EDU> marc@oahu.cs.ucla.edu (Marc Tremblay) writes:
>>In article <58428@bbn.BBN.COM> schooler@oak.bbn.com (Richard Schooler) writes:
>>> [description of problems with scheduling for writeback slot on 88k deleted]
>>It looks like Motorola did not want to deal with functional unit
>>latencies when instructions are issued.
>>Otherwise they could use a result shift register where the "writeback slot"
>>is reserved in advance according to the latency of the functional unit used.
>>Conflicts are thus resolved in advance. Collisions cause stalling of
>>the issuing unit.
>
>Yep, a pretty straightforward thing to do control-wise.  You also want
>to be able to forward either or both input operands from it, so it's a
>bit more than a simple shift register, but nevertheless nice for handling 
>medium latencies.  You might want to revert to a priority scheme for
>divide though, or serialize.  Another benefit is that it keeps your 
>register file updates in order.  Might have been just a tad too much 
>realestate for them though, assuming they considered it.

I think that posters proposal of a shift register was intended for writeback 
reservations only, but I may have misunderstood. At the time we designed the
88100 we did not seriously consider a shift register of results which is what
Dave Christie seems to be talking about because of circuit complexities.  We 
definitely considered a shift register for write back result reservations, but 
due to several considerations we decided that the arbitration scheme was more
flexible.  The divide latencies would have required a very long shift register
and load cache misses would have required stalling all of the pipelines
negating one of the benefits of interlocks.  The implementation of shift 
register writeback reservation scheme would have been fairly simple so that
is not why we did not do it.

>
>>> [Richard's instruction-based priority bit scheme deleted]
>>
>>Let's see, instructions are held up because other instructions
>>have higher priority, they will proceed only if there is an empty slot.
>>If there is no empty slot, that means that other instructions are producing
>>useful work and that the write-back slot is running at full throughput
>
>But if a certain write slot doesn't satisfy a dependency that exists at
>decode then issue stalls, which will cause an empty write slot downstream
>and lower throughput.  You would typically want to separate a FP instruction
>from a instruction that depended on it; if you throw as many integer
>instructions in between as necessary to compensate for the FP latency,
>those integer instructions just end up stretching out the FP latency
>by taking priority for writeback, and the extra FP latency isn't hidden
>at all - ouch!  The proper priority scheme, IMHO, would give preference
>to earlier instructions in the sequence, which is of course the ones with
>the longer latency.  I'd like the hear their rational for this.  Maybe
>it was arbitrarily chosen and they just expected the compilers to keep
>it all moving (bet they learned a lesson there!).  There's an example
>of rescheduling in an article on the 88K in June's IEEE Micro, but it
>conveniently avoids write slot contention with an appropriate mix of 
>various latency operations. This of course does happen in the real world,
>but there's no mention of what happens when things don't work out so nicely.

Your point about just stalling the floating point writeback even further by
inserting integer instructions is valid to a degree.  Inserting only integer
instructions between when the fp operation starts and is used will delay
the fp result, but it does hide all of the latency except the 1 extra clock
that the sequencer stalls to let the data write back before it is used.  Even
more important is that in most realistic code the instructions that you can
insert will have some loads or maybe even branches that will free up writeback 
slots.  Of course if you do not have enough independent instructions to 
completely hide the latency it does not matter which scheme you use because
you will be waiting for the data anyway.

As far as the rationale behind prioritization it is as follows.  Single cycle
instructions are given the highest priority to avoid having to build pipelines
and unnecessarily stall single stage function units.  The memory unit has
lowest priority because it is not known soon enough to arbitrate for a
writeback slot whether the cache will be hit or missed.  If it had a higher
priority then writeback slots would be wasted completely during every
cache miss.  In between these two are the floating point units whose
priorities were much more arbitrary.  The multiplier was arbitrarily given
higher priority than the adder.  When the multiplier had contention from 
within for integer and floating point results we went with the integer result.
This decision was somewhat arbitrary but I seem to remember some arguments
that if both imul and fmul were occurring the imul was probably being used in
an array address calculation and should have a higher priority.  Given these
constraints, I don't think any other ordering would have significantly
higher performance over a mix of code. 

-- 
Marvin Denman
Motorola 88000 Design
cs.utexas.edu!oakhill!marvin

marc@oahu.cs.ucla.edu (Marc Tremblay) (07/31/90)

In article <3612@yogi.oakhill.UUCP> marvin@yogi.UUCP (Marvin Denman) writes:
>In <37310@shemp.CS.UCLA.EDU> marc@oahu.cs.ucla.edu (Marc Tremblay) writes:
>>It looks like Motorola did not want to deal with functional unit
>>latencies when instructions are issued.
>>Otherwise they could use a result shift register where the "writeback slot"
>>is reserved in advance according to the latency of the functional unit used.
>>Conflicts are thus resolved in advance. Collisions cause stalling of
>>the issuing unit.
>
>I think that posters proposal of a shift register was intended for writeback 
>reservations only, but I may have misunderstood. At the time we designed the
>88100 we did not seriously consider a shift register of results which is what
>Dave Christie seems to be talking about because of circuit complexities.  We 
>definitely considered a shift register for write back result reservations, but 
>due to several considerations we decided that the arbitration scheme was more
>flexible.

You are right, my original posting concerned a shift register for write
back result reservation.

>Inserting only integer
>instructions between when the fp operation starts and is used will delay
>the fp result, but it does hide all of the latency except the 1 extra clock
>that the sequencer stalls to let the data write back before it is used.  Even
>more important is that in most realistic code the instructions that you can
>insert will have some loads or maybe even branches that will free up writeback 
                                           ^^^^^^^^
					   pretty weird code! 

>slots.  Of course if you do not have enough independent instructions to 
>completely hide the latency it does not matter which scheme you use because
>you will be waiting for the data anyway.

Good point.

...
[Marvin explains the rationale behind the prioritization scheme for the 88100]

Thanks for giving us the "inside" point of view.
A variation of another arbitration scheme described in
Mike Johnson's dissertation "Super-Scalar Processor Design" (Stanford 89),
could be used to avoid starvation for floating-point results.
They simulated an arbiter that gives priority as follows:

	- there are two arbiters, one for integer results,
  	  one for floating-point results.
	- top priority is given to requests that have been
	  active for more than one cycle (functional units
	  make their request one cycle before they finish).
	- integer: top: ALU			 floating: top: add
			Shifter					multiply
			branch (return address)			divide
		bottom:	loads				bottom:	convert

In this way, long latencies will be avoided since old requests are treated
first. Johnson claims that the added complexity for such an arbiter
(compared to one that does not involve time) is small, but of course
no layouts metrics are given...

_________________________________________________
Marc Tremblay
internet: marc@CS.UCLA.EDU
UUCP: ...!{uunet,ucbvax,rutgers}!cs.ucla.edu!marc