[comp.arch] [HS]W interlocks

mac@mrk.ardent.com (Michael McNamara) (02/12/89)

	I like a machine with hardware interlocks for compatibility,
and a compiler with instruction scheduling so that code compiled with
the newest compiler on the newest box runs as fast as possible.

	It may seem like sacrilege to for the compiler not to just
rely completely on the expensive hardware interlocks you built, but
faster code can be crafted by the compiler scheduling code that will
rarely/never experience a hardware interlock.*  All the cycles that an
instruction is delayed by hardware interlock are cycles that the
machine could be issuing other operations.

	Then when the new box comes out, the old binary will run on
the new machine, and generate the same answers as before, but not as
quickly as if the code were recompiled with the new compiler (Or the
old compiler with a new machine decription table). The old binary will
either 1) experience hardware interlocks due to a slower relative
operation/memory latency/cache read/fill in the new machine, or 2)
operations will be issued later than they could have been as the new
machine's operation/memory/cache speed ratios are different than the
old's.

	If the compiler inserts only the architectually required nops
(empty branch/load delay slots) then delays due to 2) will be reduced;
this is certainly a resonable place for the compiler to take advantage
of hardware interlocks.  IE, the compiler should only delay the
issuance of data dependent instructions by moving other NON data
dependent constrained instruction(s) above the instruction. If there
isn't anything else to move before the instruction, DON'T insert nops;
let the hardware interlock scoreboard this operation, and hence a
later faster machine can run the same binary faster.

---------
	* I observed the benefits first hand of a code constructed by
a compiler that really understood it's machine while at Cydrome.  The
machine/compiler pair got 15 MFLOPS out of a peak 25 on 100x100
Linpack, an efficiecy of 60%; it got 5.8 MFLOPS out of peak 25 MFLOPS
on the 24 Livermore Loops, 23% efficiency.  Few other machines come
close to these efficiencies.  
	Of course, it takes a while to write a compiler that so
completely understands a machine, and if you try to build both the
compiler and the hardware as a startup company, you can run out of
time.  Other companies have been more successful by taking a academic
research compiler, and building a machine around that [Hi Bob C.]

[disclaimer]
Michael McNamara 
  mac@ardent.com

mslater@cup.portal.com (Michael Z Slater) (02/13/89)

The MIPS processors (R2000 and R3000) are the only commercial uPs I'm aware
of that are not fully interlocked; are there others?  (Not counting delayed
branches, of course, which everyone does.)

The MIPS architecture definition has one load delay slot.  Processors that
have longer load latency will simply require interlocks.  John Hennessy
contends that it will never make sense to build a processor with no load
delay slot.  As I understand it, his argument is that even with on-chip cache,
the register file will be faster to access than the cache, and if there is no
delay slot, then the machine isn't running as fast as it could and would be
better off with a faster clock and a load delay slot.

Anyone disagree?  Will there be pipelined uPs that have no delay slot?

Michael Slater    mslater@cup.portal.com

tim@crackle.amd.com (Tim Olson) (02/14/89)

In article <14619@cup.portal.com> mslater@cup.portal.com (Michael Z Slater) writes:
| The MIPS processors (R2000 and R3000) are the only commercial uPs I'm aware
| of that are not fully interlocked; are there others?  (Not counting delayed
| branches, of course, which everyone does.)
| 
| The MIPS architecture definition has one load delay slot.  Processors that
| have longer load latency will simply require interlocks.  John Hennessy
| contends that it will never make sense to build a processor with no load
| delay slot.  As I understand it, his argument is that even with on-chip cache,
| the register file will be faster to access than the cache, and if there is no
| delay slot, then the machine isn't running as fast as it could and would be
| better off with a faster clock and a load delay slot.
| 
| Anyone disagree?  Will there be pipelined uPs that have no delay slot?

If an on-chip I-cache can be built that will supply an instruction in a
single-cycle (which it *has* to, in order to run at 1 inst/cycle), why
can't a D-cache with the same characteristics exist?  If there is a load
in the execute stage, then TLB translation can occur in parallel with
D-cache lookup, resulting in a value that can be forwarded to the ALU
for use in the very next instruction.

A single delay slot, with good scheduling, still causes about a 5% to 6%
pipeline stall (or equivalent nop execution) which could be reduced with
a fast on-chip D-cache. 


	-- Tim Olson
	Advanced Micro Devices
	(tim@crackle.amd.com)

aglew@mcdurb.Urbana.Gould.COM (02/15/89)

>The MIPS processors (R2000 and R3000) are the only commercial uPs I'm aware
>of that are not fully interlocked; are there others?  (Not counting delayed
>branches, of course, which everyone does.)

I have been told (by a MIPSco guy making a presentation) that the second
generation MIPS processor does in fact have some interlocks, in areas
where the new implementation had longer latencies than the old.
For example, the delay slot should, in fact, be 2 instructions,
but they interlock at 1.

I'm sure that somebody from MIPS will correct me.

My point is: the question is no longer whether your machine is entirely
hardware interlocked, or entirely software interlocked;
it is, what combination of hardware and software interlocks
gets the job done.

mahar@weitek.UUCP (Mike Mahar) (02/15/89)

In article <14619@cup.portal.com> mslater@cup.portal.com (Michael Z Slater) writes:
>The MIPS processors (R2000 and R3000) are the only commercial uPs I'm aware
>of that are not fully interlocked; are there others?  (Not counting delayed
>branches, of course, which everyone does.)
>
The Weitek XL processor doesn't have any interlocks.  There is no delay slot
on loads for this machine.  The Multiply and Divide instructions take 6 and 16
cycles respectivly but there is still no interlock.  The floating point 
instructions take 2 or three cycles and may be pipelined.

			Mike Mahar
-- 
"The bug is in the package somewhere". |	Mike Mahar
 - Anyone who has used Ada	       | UUCP: {turtlevax, cae780}!weitek!mahar

earl@wright.mips.com (Earl Killian) (02/15/89)

In article <24435@amdcad.AMD.COM>, tim@crackle (Tim Olson) writes:
>If an on-chip I-cache can be built that will supply an instruction in a
>single-cycle (which it *has* to, in order to run at 1 inst/cycle), why
>can't a D-cache with the same characteristics exist?  If there is a load
>in the execute stage, then TLB translation can occur in parallel with
>D-cache lookup, resulting in a value that can be forwarded to the ALU
>for use in the very next instruction.
>
>A single delay slot, with good scheduling, still causes about a 5% to 6%
>pipeline stall (or equivalent nop execution) which could be reduced with
>a fast on-chip D-cache. 

You can easily build a data cache with the same latency as your
instruction cache.  But you need to provide an address to that data
cache, and it is the latency of the address formation + access that
creates the 1-cycle minimum delay that John Hennessy referred to.

Your statement is really only true in the context of the 29000 and
similar machines, which have no address add stage (addresses are simply
the contents of a register), and not for the MIPS instruction set, where
the address is formed from a base register plus a signed 16-bit
displacement.  This "feature" of the 29000 is unusual, and I think it is
mistake.  You certainly can't use the fact it is possible to implement a
delayless 29000 load to justify putting load interlocks into the MIPS
architecture!

I think Slater's question should have been "Will there ever be MIPS
instruction set implementations that have no delay slot?" instead of
"Will there be pipelined uPs that have no delay slot?" because the
higher-level question was "What does MIPS lose by having a load delay
slot instead of a load interlock?".  I agree with Hennessy that the load
delay slot will never cost MIPSco performance, except for a small
increase in the I-cache miss rate.
--
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086

andrew@frip.wv.tek.com (Andrew Klossner) (02/22/89)

[]

	"If an on-chip I-cache can be built that will supply an
	instruction in a single-cycle (which it *has* to, in order to
	run at 1 inst/cycle), why can't a D-cache with the same
	characteristics exist?"

The I-cache has a strong hint as to which instruction will next be
fetched, and can have it ready.  The D-cache has no such hint.

And, on many machines, even the I-cache will take an extra cycle to
supply an instruction if you surprise it by taking an unexpected branch
(or by not taking an expected branch).

  -=- Andrew Klossner   (uunet!tektronix!orca!frip!andrew)      [UUCP]
                        (andrew%frip.wv.tek.com@relay.cs.net)   [ARPA]

mdeale@polyslo.CalPoly.EDU (Hmmm) (02/24/89)

In article <11058@tekecs.GWD.TEK.COM> andrew@frip.wv.tek.com (Andrew Klossner) writes:
>>[]
>>	"If an on-chip I-cache can be built that will supply an
>>	instruction in a single-cycle (which it *has* to, in order to
>>	run at 1 inst/cycle), why can't a D-cache with the same
>>	characteristics exist?"
>
>The I-cache has a strong hint as to which instruction will next be
>fetched, and can have it ready.  The D-cache has no such hint.

   Take the Am29K for example. In Burst mode we do know what the
next instruction address will be. Pipeline mode too. However, the
D bus also has these two modes.

>And, on many machines, even the I-cache will take an extra cycle to
>supply an instruction if you surprise it by taking an unexpected branch
>(or by not taking an expected branch).

   This is one way to leave Burst mode. But why can't I use some of
those fast 10-12ns SRAMs from Performance Semi. or Micron Tech.?
[assuming cache control can respond quick enough]

>  -=- Andrew Klossner   (uunet!tektronix!orca!frip!andrew)      [UUCP]
>                        (andrew%frip.wv.tek.com@relay.cs.net)   [ARPA]

Myron
#mdeale@polyslo.calpoly.edu
#don't collect, connect.