mac@mrk.ardent.com (Michael McNamara) (02/12/89)
I like a machine with hardware interlocks for compatibility, and a compiler with instruction scheduling so that code compiled with the newest compiler on the newest box runs as fast as possible. It may seem like sacrilege to for the compiler not to just rely completely on the expensive hardware interlocks you built, but faster code can be crafted by the compiler scheduling code that will rarely/never experience a hardware interlock.* All the cycles that an instruction is delayed by hardware interlock are cycles that the machine could be issuing other operations. Then when the new box comes out, the old binary will run on the new machine, and generate the same answers as before, but not as quickly as if the code were recompiled with the new compiler (Or the old compiler with a new machine decription table). The old binary will either 1) experience hardware interlocks due to a slower relative operation/memory latency/cache read/fill in the new machine, or 2) operations will be issued later than they could have been as the new machine's operation/memory/cache speed ratios are different than the old's. If the compiler inserts only the architectually required nops (empty branch/load delay slots) then delays due to 2) will be reduced; this is certainly a resonable place for the compiler to take advantage of hardware interlocks. IE, the compiler should only delay the issuance of data dependent instructions by moving other NON data dependent constrained instruction(s) above the instruction. If there isn't anything else to move before the instruction, DON'T insert nops; let the hardware interlock scoreboard this operation, and hence a later faster machine can run the same binary faster. --------- * I observed the benefits first hand of a code constructed by a compiler that really understood it's machine while at Cydrome. The machine/compiler pair got 15 MFLOPS out of a peak 25 on 100x100 Linpack, an efficiecy of 60%; it got 5.8 MFLOPS out of peak 25 MFLOPS on the 24 Livermore Loops, 23% efficiency. Few other machines come close to these efficiencies. Of course, it takes a while to write a compiler that so completely understands a machine, and if you try to build both the compiler and the hardware as a startup company, you can run out of time. Other companies have been more successful by taking a academic research compiler, and building a machine around that [Hi Bob C.] [disclaimer] Michael McNamara mac@ardent.com
mslater@cup.portal.com (Michael Z Slater) (02/13/89)
The MIPS processors (R2000 and R3000) are the only commercial uPs I'm aware of that are not fully interlocked; are there others? (Not counting delayed branches, of course, which everyone does.) The MIPS architecture definition has one load delay slot. Processors that have longer load latency will simply require interlocks. John Hennessy contends that it will never make sense to build a processor with no load delay slot. As I understand it, his argument is that even with on-chip cache, the register file will be faster to access than the cache, and if there is no delay slot, then the machine isn't running as fast as it could and would be better off with a faster clock and a load delay slot. Anyone disagree? Will there be pipelined uPs that have no delay slot? Michael Slater mslater@cup.portal.com
tim@crackle.amd.com (Tim Olson) (02/14/89)
In article <14619@cup.portal.com> mslater@cup.portal.com (Michael Z Slater) writes: | The MIPS processors (R2000 and R3000) are the only commercial uPs I'm aware | of that are not fully interlocked; are there others? (Not counting delayed | branches, of course, which everyone does.) | | The MIPS architecture definition has one load delay slot. Processors that | have longer load latency will simply require interlocks. John Hennessy | contends that it will never make sense to build a processor with no load | delay slot. As I understand it, his argument is that even with on-chip cache, | the register file will be faster to access than the cache, and if there is no | delay slot, then the machine isn't running as fast as it could and would be | better off with a faster clock and a load delay slot. | | Anyone disagree? Will there be pipelined uPs that have no delay slot? If an on-chip I-cache can be built that will supply an instruction in a single-cycle (which it *has* to, in order to run at 1 inst/cycle), why can't a D-cache with the same characteristics exist? If there is a load in the execute stage, then TLB translation can occur in parallel with D-cache lookup, resulting in a value that can be forwarded to the ALU for use in the very next instruction. A single delay slot, with good scheduling, still causes about a 5% to 6% pipeline stall (or equivalent nop execution) which could be reduced with a fast on-chip D-cache. -- Tim Olson Advanced Micro Devices (tim@crackle.amd.com)
aglew@mcdurb.Urbana.Gould.COM (02/15/89)
>The MIPS processors (R2000 and R3000) are the only commercial uPs I'm aware >of that are not fully interlocked; are there others? (Not counting delayed >branches, of course, which everyone does.) I have been told (by a MIPSco guy making a presentation) that the second generation MIPS processor does in fact have some interlocks, in areas where the new implementation had longer latencies than the old. For example, the delay slot should, in fact, be 2 instructions, but they interlock at 1. I'm sure that somebody from MIPS will correct me. My point is: the question is no longer whether your machine is entirely hardware interlocked, or entirely software interlocked; it is, what combination of hardware and software interlocks gets the job done.
mahar@weitek.UUCP (Mike Mahar) (02/15/89)
In article <14619@cup.portal.com> mslater@cup.portal.com (Michael Z Slater) writes: >The MIPS processors (R2000 and R3000) are the only commercial uPs I'm aware >of that are not fully interlocked; are there others? (Not counting delayed >branches, of course, which everyone does.) > The Weitek XL processor doesn't have any interlocks. There is no delay slot on loads for this machine. The Multiply and Divide instructions take 6 and 16 cycles respectivly but there is still no interlock. The floating point instructions take 2 or three cycles and may be pipelined. Mike Mahar -- "The bug is in the package somewhere". | Mike Mahar - Anyone who has used Ada | UUCP: {turtlevax, cae780}!weitek!mahar
earl@wright.mips.com (Earl Killian) (02/15/89)
In article <24435@amdcad.AMD.COM>, tim@crackle (Tim Olson) writes: >If an on-chip I-cache can be built that will supply an instruction in a >single-cycle (which it *has* to, in order to run at 1 inst/cycle), why >can't a D-cache with the same characteristics exist? If there is a load >in the execute stage, then TLB translation can occur in parallel with >D-cache lookup, resulting in a value that can be forwarded to the ALU >for use in the very next instruction. > >A single delay slot, with good scheduling, still causes about a 5% to 6% >pipeline stall (or equivalent nop execution) which could be reduced with >a fast on-chip D-cache. You can easily build a data cache with the same latency as your instruction cache. But you need to provide an address to that data cache, and it is the latency of the address formation + access that creates the 1-cycle minimum delay that John Hennessy referred to. Your statement is really only true in the context of the 29000 and similar machines, which have no address add stage (addresses are simply the contents of a register), and not for the MIPS instruction set, where the address is formed from a base register plus a signed 16-bit displacement. This "feature" of the 29000 is unusual, and I think it is mistake. You certainly can't use the fact it is possible to implement a delayless 29000 load to justify putting load interlocks into the MIPS architecture! I think Slater's question should have been "Will there ever be MIPS instruction set implementations that have no delay slot?" instead of "Will there be pipelined uPs that have no delay slot?" because the higher-level question was "What does MIPS lose by having a load delay slot instead of a load interlock?". I agree with Hennessy that the load delay slot will never cost MIPSco performance, except for a small increase in the I-cache miss rate. -- UUCP: {ames,decwrl,prls,pyramid}!mips!earl USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086
andrew@frip.wv.tek.com (Andrew Klossner) (02/22/89)
[] "If an on-chip I-cache can be built that will supply an instruction in a single-cycle (which it *has* to, in order to run at 1 inst/cycle), why can't a D-cache with the same characteristics exist?" The I-cache has a strong hint as to which instruction will next be fetched, and can have it ready. The D-cache has no such hint. And, on many machines, even the I-cache will take an extra cycle to supply an instruction if you surprise it by taking an unexpected branch (or by not taking an expected branch). -=- Andrew Klossner (uunet!tektronix!orca!frip!andrew) [UUCP] (andrew%frip.wv.tek.com@relay.cs.net) [ARPA]
mdeale@polyslo.CalPoly.EDU (Hmmm) (02/24/89)
In article <11058@tekecs.GWD.TEK.COM> andrew@frip.wv.tek.com (Andrew Klossner) writes: >>[] >> "If an on-chip I-cache can be built that will supply an >> instruction in a single-cycle (which it *has* to, in order to >> run at 1 inst/cycle), why can't a D-cache with the same >> characteristics exist?" > >The I-cache has a strong hint as to which instruction will next be >fetched, and can have it ready. The D-cache has no such hint. Take the Am29K for example. In Burst mode we do know what the next instruction address will be. Pipeline mode too. However, the D bus also has these two modes. >And, on many machines, even the I-cache will take an extra cycle to >supply an instruction if you surprise it by taking an unexpected branch >(or by not taking an expected branch). This is one way to leave Burst mode. But why can't I use some of those fast 10-12ns SRAMs from Performance Semi. or Micron Tech.? [assuming cache control can respond quick enough] > -=- Andrew Klossner (uunet!tektronix!orca!frip!andrew) [UUCP] > (andrew%frip.wv.tek.com@relay.cs.net) [ARPA] Myron #mdeale@polyslo.calpoly.edu #don't collect, connect.