johnl@haddock.UUCP (12/12/83)
#R:ucbesvax:27900003:haddock:9500003:000:1617 haddock!johnl Dec 11 16:04:00 1983 I second M. Turner's remarks. Cache memory isn't the same as registers. There are speedup tricks people use on register files that don't make sense for caches. For example, the PDP-11/60 has two copies of the register file, so that two register fetches can be active at once. You need to cycle both when storing a register, but register reads outnumber writes enough in the PDP-11 that it's a win. Storing the whole cache twice (which makes it 1/2 as large as a singly stored cache) is not likely to be a good idea, since you can't predict anywhere near as much about memory use as you can about register use. As another example, some machines with a regular register architecture use this trick: Instructions typically pick up one or two register values, do something useful, and then store a value back. To increase instruction overlap, as it stores the result register from one instruction, it sees if the same register is used in the next instruction and gets the value from the data paths as it's stored, avoiding the register read. The 360/91 and perhaps other machines tried to do this to memory references in general, but it's a lot more practical on a small regular address space (the registers) than on a big wad of memory. Finally, register addresses take fewer bits in an instruction than general memory references, so instructions are smaller, so you can pick up more of them with a given number of instruction fetches, lessening interference between instruction and data access. John Levine, decvax!yale-co!ima!johnl, ucbvax!cbosgd!ima!johnl, {allegra|floyd|amd70}!ima!johnl, Levine@YALE.ARPA
andrew@orca.UUCP (Andrew Klossner) (12/12/83)
"The 360/91 and perhaps other machines tried to do this to memory references in general ..." The designers of the IBM 360/91 not only tried but succeeded. A standard assembler programming trick was to compress instruction loops down to 16 bytes or less, so that the machine could keep the entire loop in its internal registers and not have to fetch them from memory. This feature combined with the pipelining (multiple instructions executing simultaneously) reduced the cycle time for many loops down to the time for the "branch backward" instruction that terminated them. Of course, pipelining can be a disaster in an educational environment. For many years the 360/91 was the only machine available to novice programmers, who had to deal with vague messages: "something was wrong at approximately this point in your program ... the specific fault could have been any one of many problems (odd addressing, write to foreign memory, etc.). Many students quickly learned to terminate all PL/I statements with two semicolons, rather than one; the "null statement" between the two semicolons was translated into a single NOOP, which causes the 360/91 to clear its pipeline, so that errors can be isolated to a specific PL/I statement. -- Andrew Klossner (decvax!tektronix!orca!andrew) [UUCP] (orca!andrew.tektronix@rand-relay) [ARPA]
ucbesvax.turner@ucbcad.UUCP (12/15/83)
#N:ucbesvax:27900003:000:2645 ucbesvax!turner Dec 8 12:23:00 1983 I don't like the idea of putting registers in an on-board cache memory (and then translating register references to full memory addresses). Some reasons why: - it increases the amount of control logic required to interpret a register reference. One must not only extract the reference, but add it to a full-address-space pointer, and hand it through the the cache-address translator. As we will see below, this might involve serializing register access--involving yet more control logic. - one advantage of a true register file is that one can use dual-ported memory to gain speed by allowing overlapped fetches. Making a whole cache (~256..~4K bytes) out of dual-ported memory would be rather expensive. The only other way to achieve overlapped fetching in the cache would be to interleave the cache RAM--and that's only a statistical speed-up. There will still be cases where register access must be serialized *unless* the interleave factor is equal to the number of registers. This seems like a high cost to pay just to get register- to-register operations that are (nearly) as fast as they are in processors that don't map registers to memory. One does NOT contort the design of a cache around the architecture! In fact, I am in favor of quite the opposite, for the special case of single-chip microprocessors: violate the rule of transparency to the extent of adding instructions that address issues of control and optimization of caches, then contort the compiler (somewhat) around these instructions. - runaway pointers can trash your whole context, making it very hard to debug programs with that problem. Sure you could trap such accesses if they were inappropriate. But again, that means clapping on some special frob to test for indirect addressing of register- mapped memory. With a special supervisor control bit, perhaps, so that one *can* do it when one wants to. And a partridge in a pear tree. It all adds up. Assuming that this discussion is concerned ONLY with the kind of cache one puts on a single-chip microprocessor, I think people should realize that you don't just say "oh, and let's add this". On a chip, everything steals something from everything else. (In a TTL design, maybe you just have to beef up the power supply a little to add new features. Eventually you run out of board space. Bare silicon is a rather different medium.) Don't let VLSI and its small packages fool you. To take full advantage of a million transistors on one die is going to be at least as hard as designing Cray machines. --- Michael Turner (ucbvax!ucbesvax.turner)