[net.arch] uP caches, cont'd.

johnl@haddock.UUCP (12/12/83)

#R:ucbesvax:27900003:haddock:9500003:000:1617
haddock!johnl    Dec 11 16:04:00 1983

I second M. Turner's remarks.  Cache memory isn't the same as registers.
There are speedup tricks people use on register files that don't make
sense for caches.  For example, the PDP-11/60 has two copies of the
register file, so that two register fetches can be active at once.  You
need to cycle both when storing a register, but register reads outnumber
writes enough in the PDP-11 that it's a win.  Storing the whole cache
twice (which makes it 1/2 as large as a singly stored cache) is not
likely to be a good idea, since you can't predict anywhere near as much
about memory use as you can about register use.

As another example, some machines with a regular register architecture
use this trick:  Instructions typically pick up one or two register
values, do something useful, and then store a value back.  To increase
instruction overlap, as it stores the result register from one
instruction, it sees if the same register is used in the next instruction
and gets the value from the data paths as it's stored, avoiding the
register read.  The 360/91 and perhaps other machines tried to do this to
memory references in general, but it's a lot more practical on a small
regular address space (the registers) than on a big wad of memory.

Finally, register addresses take fewer bits in an instruction than
general memory references, so instructions are smaller, so you can pick
up more of them with a given number of instruction fetches, lessening
interference between instruction and data access.

John Levine, decvax!yale-co!ima!johnl, ucbvax!cbosgd!ima!johnl,
{allegra|floyd|amd70}!ima!johnl, Levine@YALE.ARPA

andrew@orca.UUCP (Andrew Klossner) (12/12/83)

	"The 360/91 and perhaps other machines tried to do this to
	memory references in general ..."

The designers of the IBM 360/91 not only tried but succeeded.  A
standard assembler programming trick was to compress instruction loops
down to 16 bytes or less, so that the machine could keep the entire
loop in its internal registers and not have to fetch them from memory.
This feature combined with the pipelining (multiple instructions
executing simultaneously) reduced the cycle time for many loops down to
the time for the "branch backward" instruction that terminated them.

Of course, pipelining can be a disaster in an educational environment.
For many years the 360/91 was the only machine available to novice
programmers, who had to deal with vague messages: "something was wrong
at approximately this point in your program ... the specific fault
could have been any one of many problems (odd addressing, write to
foreign memory, etc.).  Many students quickly learned to terminate all
PL/I statements with two semicolons, rather than one; the "null
statement" between the two semicolons was translated into a single
NOOP, which causes the 360/91 to clear its pipeline, so that errors can
be isolated to a specific PL/I statement.

  -- Andrew Klossner   (decvax!tektronix!orca!andrew)      [UUCP]
                       (orca!andrew.tektronix@rand-relay)  [ARPA]

ucbesvax.turner@ucbcad.UUCP (12/15/83)

#N:ucbesvax:27900003:000:2645
ucbesvax!turner    Dec  8 12:23:00 1983

I don't like the idea of putting registers in an on-board cache memory
(and then translating register references to full memory addresses).
Some reasons why:

- it increases the amount of control logic required to interpret a
  register reference.  One must not only extract the reference, but
  add it to a full-address-space pointer, and hand it through the
  the cache-address translator.  As we will see below, this might
  involve serializing register access--involving yet more control
  logic.

- one advantage of a true register file is that one can use dual-ported
  memory to gain speed by allowing overlapped fetches.  Making a whole
  cache (~256..~4K bytes) out of dual-ported memory would be rather
  expensive.  The only other way to achieve overlapped fetching in
  the cache would be to interleave the cache RAM--and that's only a
  statistical speed-up.  There will still be cases where register access
  must be serialized *unless* the interleave factor is equal to the number
  of registers.  This seems like a high cost to pay just to get register-
  to-register operations that are (nearly) as fast as they are in
  processors that don't map registers to memory.  One does NOT contort the
  design of a cache around the architecture!  In fact, I am in favor of
  quite the opposite, for the special case of single-chip microprocessors:
  violate the rule of transparency to the extent of adding instructions that
  address issues of control and optimization of caches, then contort the
  compiler (somewhat) around these instructions.

- runaway pointers can trash your whole context, making it very hard
  to debug programs with that problem.  Sure you could trap such
  accesses if they were inappropriate.  But again, that means clapping
  on some special frob to test for indirect addressing of register-
  mapped memory.  With a special supervisor control bit, perhaps, so
  that one *can* do it when one wants to.  And a partridge in a pear
  tree.  It all adds up.

Assuming that this discussion is concerned ONLY with the kind of cache
one puts on a single-chip microprocessor, I think people should realize
that you don't just say "oh, and let's add this".  On a chip, everything
steals something from everything else.  (In a TTL design, maybe you just
have to beef up the power supply a little to add new features.  Eventually
you run out of board space.  Bare silicon is a rather different medium.)

Don't let VLSI and its small packages fool you.  To take full advantage
of a million transistors on one die is going to be at least as hard as
designing Cray machines.
---
Michael Turner (ucbvax!ucbesvax.turner)