[net.arch] MC68030 Cache Organization

aglew@ccvaxa.UUCP (09/30/86)

Motorola 68030 Cache Organization
---------------------------------

Can someone explain to me why the change in cache organization between
the 68020 and the 68030 is such a win? I don't need numbers, I'd just like
a rationalization that explains the mechanism.

NB. I'm not talking about the separate address/data lines to the I-cache
- that's obviously an improvement. What I refer to is the comment in 
_Electronics_ that goes like this:

    To improve the likelihood of cache hits, Motorola is also reorganizing
    the 256-byte instruction cache into 16 entries of four long words each
    with 4 bytes per word. The 68020 instruction cache consists of 64 entries
    each of one long word... The reorganized instruction cache, along with
    the new burst mode addressing methods, should double the cache hit ratio
    and reduce the number of times the 68030 must access the system bus.

First off, reducing the number of entries that can be independently
associated seems to be a loss, not a win. But, have they changed the cache
structure - is it fully associative now, where it wasn't before? Maybe they
just needed fewer entries so that they could do it fast enough for a 1-cycle
access with the separate A/D buses.

Do the entries have to be strictly aligned, on a multiple of 16 byte
boundary, or can they be skewed? I'd suspect the former. If so, this means
that there will be advantages to aligning the top of your inner loops on a
16 byte boundary. NOPs, anybody?

Why the emphasis on "four long words each with 4 bytes per word"? I assume
that the 4 long words reflect how the cache line is filled, by a Modulo 4
burst mode memory access. That's probably one of the big advantages of
this cache organization - it doesn't increase cache hit ratio so much as
decrease the time necessary to make good a cache miss, so that you can get
back to work quicker. Also, if you are sequentially accessing memory, as
you do in a linear instruction stream, you may have obtained the next word,
due to a burst mode line fill, before the processor asks for it - whereas
if you weren't prefetching you'd have another miss, and even if you were
prefetching but were using a slower memory access, you might not have it
ready in time.

The emphasis on `bytes' in the instruction cache probably means that it is
easier for the execution unit to pull funny sized instructions out of the
cache. Ahh, the joys of variable length instruction sets!

The orientation to longer lines, filled faster by burst mode, is probably
a good thing for an instruction cache, but one wonders whether it is so
good for a data cache. Probably is for floating point numbers, which by
themselves can fill up a cache line, or for matrix processing or graphics
where you do a lot of sequential access to data, but maybe not so good
for systems that use a lot of pointer accesses to random fields in
structures, picking out, say, only one byte on every cache line filled.
Could Motorola have given us a 64 entry 1 word per line data cache, like
the 68020's instruction cache?

(Oh, another thing: TLB address translation is done in parallel with
cache access. Does this mean that the cache is virtual? Does it do 
invalidations according to physical addresses off the external bus,
or what?)

Summing up, I see these as the tradeoffs that came into the 68030 cache:
    LOSS    fewer independent entries
    GAIN    faster association on the fewer entries?
    GAIN    faster filling using burst mode
    Longer cache lines
	GAIN	for instructions
	GAIN	for numerical and sequentially accessed data
	LOSS	for pointer/structure oriented programs?

Am I missing or confused about anything? 

Andy "Krazy" Glew. Gould CSD-Urbana.    USEnet:  ihnp4!uiucdcs!ccvaxa!aglew
1101 E. University, Urbana, IL 61801    ARPAnet: aglew@gswd-vms

simoni@Shasta.STANFORD.EDU (Richard Simoni) (10/02/86)

In article <5100146@ccvaxa> aglew@ccvaxa.UUCP writes:
>
>Motorola 68030 Cache Organization
>---------------------------------
>Oh, another thing: TLB address translation is done in parallel with
>cache access. Does this mean that the cache is virtual?

This doesn't necessarily follow.  Address translation is often done in
parallel with cache access by using only the low-order bits of the virtual
address (i.e., the bits that indicate the offset within the page) to address
the cache.  This is possible because these offset bits do not change in
the virtual-to-physical mapping.  When the cache access is complete, the
tag (which is a physical page number) is compared with the result of the
address translation (which happened in parallel with the cache access) to
see if a hit occurred in the cache.

The problem with this scheme is that it can be difficult to build a large
cache since the page size limits the number of bits that can be used to
address the cache.  The size of the cache can be increased by making the
cache set-associative and/or by increasing the page size (thereby
increasing the number of bits that can address the cache).  Of course, an
on-chip cache (as in the 68030 case) will not be very large, anyway.

Rich Simoni
Center for Integrated Systems
Stanford University
simoni@sonoma.stanford.edu
...!decwrl!glacier!shasta!simoni

johnl@ima.UUCP (John R. Levine) (10/04/86)

In article <5100146@ccvaxa> aglew@ccvaxa.UUCP writes:
>
>Motorola 68030 Cache Organization
>---------------------------------
>
>Can someone explain to me why the change in cache organization between
>the 68020 and the 68030 is such a win? ...
>
>    To improve the likelihood of cache hits, Motorola is also reorganizing
>    the 256-byte instruction cache into 16 entries of four long words each
>    with 4 bytes per word. The 68020 instruction cache consists of 64 entries
>    each of one long word... The reorganized instruction cache, along with
>    the new burst mode addressing methods, should double the cache hit ratio
>    and reduce the number of times the 68030 must access the system bus.

According to an article in Digital Design, the big win with this kind of
cache design is that it takes advantage of nibble mode RAM chips that can
cycle four sequential bits out very fast.  It means you can get four times the
data in a bus transaction in much less than four times the time.  Since much
read access is sequential anyway (instruction execution, or scanning a string
or a table) it's a big win.
-- 
John R. Levine, Javelin Software Corp., Cambridge MA +1 617 494 1400
{ ihnp4 | decvax | cbosgd | harvard | yale }!ima!johnl, Levine@YALE.EDU
The opinions expressed herein are solely those of a 12-year-old hacker
who has broken into my account and not those of any person or organization.