schooler@oak.bbn.com (Richard Schooler) (07/25/90)
The Motorola 88100's register file has only a single write port, and thus on each cycle, only a single instruction can write a result. This is known as the "writeback slot". When carefully hand-scheduling code (and thinking about scheduling algorithms), the writeback slot can be a real pain, mainly because of the fixed priority scheme: if instructions of different latency contend for the writeback slot, then the integer unit gets it first, then floating-point instructions, then loads (roughly). However, it often happens that loads or floating-point instructions are on the critical path, and can get stalled because of other instructions stealing the writeback slot. Dealing with this is painful by hand, and I haven't been able to come up with reasonable algorithms, either. I thus propose that instructions have a "high-priority" bit, which the compiler would place on critical-path instructions that would otherwise get delayed. Ties between instructions of the same priority would be broken by the usual rules. The scheduler could then mostly ignore the writeback slot. What do you think? -- Richard schooler@bbn.com
lewine@dg.dg.com (Don Lewine) (07/26/90)
In article <58428@bbn.BBN.COM> schooler@oak.bbn.com (Richard Schooler) writes: > > The Motorola 88100's register file has only a single write port. {Stuff deleted} > I thus propose that instructions have a "high-priority" bit, which the >compiler would place on critical-path instructions that would otherwise >get delayed. There is nothing in the 88000 generic architecture that requires a single write port. Future 88000 family members may have multiple write ports. Since the bit would require new silicon and a large amount of the opcode space (the bit basicly doubles the number of opcodes) I would vote for additional write ports instead. For the cases I have investigated the additional write ports are less logic than your bit would be. DISCLAIMER: This is my opinion as a computer architect. It is not an announcement of anything by either Data General or Motorola!
colin@array.UUCP (Colin Plumb) (07/26/90)
I think that bit is just a trifle implementation-specific; a more intelligent arbitration scheme would ignore it again. How about write buffers between the functional unit outputs and the register file? The 29000 does this with async loads. In fact, it is always holding some register in the write buffer, because the only write-back slots it uses are the ones from subsequent load instructions... which, coincidentally, are the only instructions that need to have the write buffer emptied! -- -Colin
marc@oahu.cs.ucla.edu (Marc Tremblay) (07/27/90)
In article <58428@bbn.BBN.COM> schooler@oak.bbn.com (Richard Schooler) writes: > The Motorola 88100's register file has only a single write port, and >thus on each cycle, only a single instruction can write a result. This >is known as the "writeback slot". When carefully hand-scheduling code >(and thinking about scheduling algorithms), the writeback slot can be a >real pain, mainly because of the fixed priority scheme: if instructions >of different latency contend for the writeback slot, then the integer >unit gets it first, then floating-point instructions, then loads >(roughly). However, it often happens that loads or floating-point >instructions are on the critical path, and can get stalled because of >other instructions stealing the writeback slot. It looks like Motorola did not want to deal with functional unit latencies when instructions are issued. Otherwise they could use a result shift register where the "writeback slot" is reserved in advance according to the latency of the functional unit used. Conflicts are thus resolved in advance. Collisions cause stalling of the issuing unit. One of the advantages of a fixed priority scheme though, is that one could have variable latency functional units (for example an early exit multiplier). That kind of functional unit is hard to handle with a result shift register. > I thus propose that instructions have a "high-priority" bit, which the >compiler would place on critical-path instructions that would otherwise >get delayed. Ties between instructions of the same priority would be >broken by the usual rules. The scheduler could then mostly ignore the >writeback slot. What do you think? Let's see, instructions are held up because other instructions have higher priority, they will proceed only if there is an empty slot. If there is no empty slot, that means that other instructions are producing useful work and that the write-back slot is running at full throughput (since there is only one instruction issued per cycle, there can only be one instruction result produced per cycle in the steady state). If the write-back slot is running at full throughput that means that data dependency is resolved and that the stalled instruction is not needed at the moment. If you assign a high-priority bit to instructions, that means that the flow will be modified in the sense that an another instruction will be stalled while the high priority instruction proceeds. In the general case that will add up to the same number of cycles. There might be something to gain if when exiting the sequence of instructions executing at full throughput, there is some data dependency associated with the previously stalled instruction, which takes a long time to resolve (long load latency for example). _________________________________________________ Marc Tremblay internet: marc@CS.UCLA.EDU UUCP: ...!{uunet,ucbvax,rutgers}!cs.ucla.edu!marc
davec@nucleus.amd.com (Dave Christie) (07/27/90)
In <37310@shemp.CS.UCLA.EDU> marc@oahu.cs.ucla.edu (Marc Tremblay) writes: >In article <58428@bbn.BBN.COM> schooler@oak.bbn.com (Richard Schooler) writes: >> [description of problems with scheduling for writeback slot on 88k deleted] >It looks like Motorola did not want to deal with functional unit >latencies when instructions are issued. >Otherwise they could use a result shift register where the "writeback slot" >is reserved in advance according to the latency of the functional unit used. >Conflicts are thus resolved in advance. Collisions cause stalling of >the issuing unit. Yep, a pretty straightforward thing to do control-wise. You also want to be able to forward either or both input operands from it, so it's a bit more than a simple shift register, but nevertheless nice for handling medium latencies. You might want to revert to a priority scheme for divide though, or serialize. Another benefit is that it keeps your register file updates in order. Might have been just a tad too much realestate for them though, assuming they considered it. >> [Richard's instruction-based priority bit scheme deleted] > >Let's see, instructions are held up because other instructions >have higher priority, they will proceed only if there is an empty slot. >If there is no empty slot, that means that other instructions are producing >useful work and that the write-back slot is running at full throughput But if a certain write slot doesn't satisfy a dependency that exists at decode then issue stalls, which will cause an empty write slot downstream and lower throughput. You would typically want to separate a FP instruction from a instruction that depended on it; if you throw as many integer instructions in between as necessary to compensate for the FP latency, those integer instructions just end up stretching out the FP latency by taking priority for writeback, and the extra FP latency isn't hidden at all - ouch! The proper priority scheme, IMHO, would give preference to earlier instructions in the sequence, which is of course the ones with the longer latency. I'd like the hear their rational for this. Maybe it was arbitrarily chosen and they just expected the compilers to keep it all moving (bet they learned a lesson there!). There's an example of rescheduling in an article on the 88K in June's IEEE Micro, but it conveniently avoids write slot contention with an appropriate mix of various latency operations. This of course does happen in the real world, but there's no mention of what happens when things don't work out so nicely. In any case I personally wouldn't advocate putting something in the instruction set architecture simply to occasionally deal with the vagaries of a particular implementation. ---------------------- Dave Christie My opinions only.
daveg@near.cs.caltech.edu (Dave Gillespie) (07/29/90)
I don't have the book handy, but as I recall the highest priority on the 88000's writeback slot was given to the single-cycle instructions. I always figured this was so that these instructions could be guaranteed to complete in one cycle. So, yes, other strategies like first-come-first-served might work better, but they would have complicated the part of the processor that was trying to be simplest and fastest. Just my guess, -- Dave -- Dave Gillespie 256-80 Caltech Pasadena CA USA 91125 daveg@csvax.caltech.edu, ...!cit-vax!daveg
marvin@oakhill.UUCP (Marvin Denman) (07/31/90)
In article <1990Jul27.161856.25701@mozart.amd.com> davec@nucleus.amd.com (Dave Christie) writes: >In <37310@shemp.CS.UCLA.EDU> marc@oahu.cs.ucla.edu (Marc Tremblay) writes: >>In article <58428@bbn.BBN.COM> schooler@oak.bbn.com (Richard Schooler) writes: >>> [description of problems with scheduling for writeback slot on 88k deleted] >>It looks like Motorola did not want to deal with functional unit >>latencies when instructions are issued. >>Otherwise they could use a result shift register where the "writeback slot" >>is reserved in advance according to the latency of the functional unit used. >>Conflicts are thus resolved in advance. Collisions cause stalling of >>the issuing unit. > >Yep, a pretty straightforward thing to do control-wise. You also want >to be able to forward either or both input operands from it, so it's a >bit more than a simple shift register, but nevertheless nice for handling >medium latencies. You might want to revert to a priority scheme for >divide though, or serialize. Another benefit is that it keeps your >register file updates in order. Might have been just a tad too much >realestate for them though, assuming they considered it. I think that posters proposal of a shift register was intended for writeback reservations only, but I may have misunderstood. At the time we designed the 88100 we did not seriously consider a shift register of results which is what Dave Christie seems to be talking about because of circuit complexities. We definitely considered a shift register for write back result reservations, but due to several considerations we decided that the arbitration scheme was more flexible. The divide latencies would have required a very long shift register and load cache misses would have required stalling all of the pipelines negating one of the benefits of interlocks. The implementation of shift register writeback reservation scheme would have been fairly simple so that is not why we did not do it. > >>> [Richard's instruction-based priority bit scheme deleted] >> >>Let's see, instructions are held up because other instructions >>have higher priority, they will proceed only if there is an empty slot. >>If there is no empty slot, that means that other instructions are producing >>useful work and that the write-back slot is running at full throughput > >But if a certain write slot doesn't satisfy a dependency that exists at >decode then issue stalls, which will cause an empty write slot downstream >and lower throughput. You would typically want to separate a FP instruction >from a instruction that depended on it; if you throw as many integer >instructions in between as necessary to compensate for the FP latency, >those integer instructions just end up stretching out the FP latency >by taking priority for writeback, and the extra FP latency isn't hidden >at all - ouch! The proper priority scheme, IMHO, would give preference >to earlier instructions in the sequence, which is of course the ones with >the longer latency. I'd like the hear their rational for this. Maybe >it was arbitrarily chosen and they just expected the compilers to keep >it all moving (bet they learned a lesson there!). There's an example >of rescheduling in an article on the 88K in June's IEEE Micro, but it >conveniently avoids write slot contention with an appropriate mix of >various latency operations. This of course does happen in the real world, >but there's no mention of what happens when things don't work out so nicely. Your point about just stalling the floating point writeback even further by inserting integer instructions is valid to a degree. Inserting only integer instructions between when the fp operation starts and is used will delay the fp result, but it does hide all of the latency except the 1 extra clock that the sequencer stalls to let the data write back before it is used. Even more important is that in most realistic code the instructions that you can insert will have some loads or maybe even branches that will free up writeback slots. Of course if you do not have enough independent instructions to completely hide the latency it does not matter which scheme you use because you will be waiting for the data anyway. As far as the rationale behind prioritization it is as follows. Single cycle instructions are given the highest priority to avoid having to build pipelines and unnecessarily stall single stage function units. The memory unit has lowest priority because it is not known soon enough to arbitrate for a writeback slot whether the cache will be hit or missed. If it had a higher priority then writeback slots would be wasted completely during every cache miss. In between these two are the floating point units whose priorities were much more arbitrary. The multiplier was arbitrarily given higher priority than the adder. When the multiplier had contention from within for integer and floating point results we went with the integer result. This decision was somewhat arbitrary but I seem to remember some arguments that if both imul and fmul were occurring the imul was probably being used in an array address calculation and should have a higher priority. Given these constraints, I don't think any other ordering would have significantly higher performance over a mix of code. -- Marvin Denman Motorola 88000 Design cs.utexas.edu!oakhill!marvin
marc@oahu.cs.ucla.edu (Marc Tremblay) (07/31/90)
In article <3612@yogi.oakhill.UUCP> marvin@yogi.UUCP (Marvin Denman) writes: >In <37310@shemp.CS.UCLA.EDU> marc@oahu.cs.ucla.edu (Marc Tremblay) writes: >>It looks like Motorola did not want to deal with functional unit >>latencies when instructions are issued. >>Otherwise they could use a result shift register where the "writeback slot" >>is reserved in advance according to the latency of the functional unit used. >>Conflicts are thus resolved in advance. Collisions cause stalling of >>the issuing unit. > >I think that posters proposal of a shift register was intended for writeback >reservations only, but I may have misunderstood. At the time we designed the >88100 we did not seriously consider a shift register of results which is what >Dave Christie seems to be talking about because of circuit complexities. We >definitely considered a shift register for write back result reservations, but >due to several considerations we decided that the arbitration scheme was more >flexible. You are right, my original posting concerned a shift register for write back result reservation. >Inserting only integer >instructions between when the fp operation starts and is used will delay >the fp result, but it does hide all of the latency except the 1 extra clock >that the sequencer stalls to let the data write back before it is used. Even >more important is that in most realistic code the instructions that you can >insert will have some loads or maybe even branches that will free up writeback ^^^^^^^^ pretty weird code! >slots. Of course if you do not have enough independent instructions to >completely hide the latency it does not matter which scheme you use because >you will be waiting for the data anyway. Good point. ... [Marvin explains the rationale behind the prioritization scheme for the 88100] Thanks for giving us the "inside" point of view. A variation of another arbitration scheme described in Mike Johnson's dissertation "Super-Scalar Processor Design" (Stanford 89), could be used to avoid starvation for floating-point results. They simulated an arbiter that gives priority as follows: - there are two arbiters, one for integer results, one for floating-point results. - top priority is given to requests that have been active for more than one cycle (functional units make their request one cycle before they finish). - integer: top: ALU floating: top: add Shifter multiply branch (return address) divide bottom: loads bottom: convert In this way, long latencies will be avoided since old requests are treated first. Johnson claims that the added complexity for such an arbiter (compared to one that does not involve time) is small, but of course no layouts metrics are given... _________________________________________________ Marc Tremblay internet: marc@CS.UCLA.EDU UUCP: ...!{uunet,ucbvax,rutgers}!cs.ucla.edu!marc