[comp.arch] SPARC and multiprocessing

jeff@Alliant.COM (Jeff Collins) (04/29/88)

	Here's an interesting question for the SPARC guru's of the world:

	Given that the SPARC must use a virtual cache to get optimal
	performance, how does one build a multiprocessor with a SPARC?

	As far as I know, no one has solved the virtual cache coherency
	problem yet...


	

guy@gorodish.Sun.COM (Guy Harris) (04/29/88)

> 	Given that the SPARC must use a virtual cache to get optimal
> 	performance, how does one build a multiprocessor with a SPARC?

Excuse me?  Where is it "given that the SPARC must use a virtual cache to get
optimal performance?"  It is the case that the SPARC requires some form of very
fast memory to get optimal performance, but that's true of a hell of a lot of
machines these days.  The virtual cache permits you to bypass the MMU on a
cache hit, but I don't know that this is *required* for SPARC - or for any of a
number of other microprocessor architectures.

It may be the case that with the current SPARC implementations, with no on-chip
MMU, that it's easier to get high performance with a virtual cache than with a
physical cache, or even that you can't get optimal performance *for those
implementations* with a physical cache (although I suspect the latter is not
true).  This certainly doesn't say that the SPARC *architecture* requires a
virtual cache.

Another way of putting this is that I have no particular reason to believe that
all high-end SPARC machines built by Sun will have virtual caches (it is
already the case that not all SPARC machines built by Sun have virtual caches;
the 4/110 SCRAM memory acts more like a physical cache than a virtual one).

gert@prls.UUCP (Gert Slavenburg) (04/29/88)

> As far as I know, no one has solved the virtual cache coherency
> problem yet...

Coherency can be maintained in a virtual address cache in exactly the same
way as it is maintained in a multiprocessor : pick you favorite 'bus watch'
style multiprocessor consistency protocol. Now use this protocol at the
back (bus) end of a virtual address cache, where requests go out in terms
of physical addresses, applying it TO YOURSELF AS WELL AS TO OTHERS. For
example, if you have an ownership like protocol, aquiring ownership
over a main memory slot (the memory unit that goes into a cache line)
requires 'negotiation' with other caches that hold a copy of this slot,
including yourself if you hold such copies under different virtual address
names.

In order to implement this, the virtually addressed cache needs to be able
to 'observe' (and/or 'interact with' in some ownership schemes) bus 
transactions that occur on those physical addresses of which this cache
holds copies. Again, this can be done in a variety of ways, either involving
a double set of tags or some form of reverse address translation (this may
be a one to many mapping).

The architecture of a multiprocessor with virtual address caches that applies
the above ideas in some unconventional ways was presented at the 13th
Symposium on Computer Architecture in Tokyo (1986) (`Software-controlled
caches in the VMP Multiprocessor', by D.R. Cheriton, G.A. Slavenburg and
P.D. Boyle). At this year's 15th symposium on Computer Architecture in Hawaii,
measurement results of the system, as in operation at Stanford University,
will be presented.

  just thought you might like to know that this problem has been solved,

  Gerrit A. Slavenburg

walter@garth.UUCP (Walter Bays) (04/30/88)

In article <1671@alliant.Alliant.COM> jeff@alliant.UUCP (Jeff Collins) writes:
>	Given that the SPARC must use a virtual cache to get optimal
>	performance, how does one build a multiprocessor with a SPARC?
>	As far as I know, no one has solved the virtual cache coherency
>	problem yet...

Clipper uses 'bus watch' to invalidate references to stale data, when
used in multiprocessor (including CPU-IOP) modes, and when using
'copy-back' (as opposed to write-through) cache modes.  The newly
announced Motorola 88000 uses a similar scheme, called 'bus snoop'.

With SPARC, Clipper without CAMMU chips, or 88000 without CAMMU chips,
you implement your own cache, and can build whatever you choose.
-- 
------------------------------------------------------------------------------
Any similarities between my opinions and those of the
person who signs my paychecks is purely coincidental.
E-Mail route: ...!pyramid!garth!walter
USPS: Intergraph APD, 2400 Geng Road, Palo Alto, California 94303
Phone: (415) 852-2384
------------------------------------------------------------------------------

jeff@Alliant.COM (Jeff Collins) (04/30/88)

In article <51321@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes:
>> 	Given that the SPARC must use a virtual cache to get optimal
>> 	performance, how does one build a multiprocessor with a SPARC?
>
>Excuse me?  Where is it "given that the SPARC must use a virtual cache to get
>optimal performance?"  It is the case that the SPARC requires some form of very
>fast memory to get optimal performance, but that's true of a hell of a lot of
>machines these days.  The virtual cache permits you to bypass the MMU on a
>cache hit, but I don't know that this is *required* for SPARC - or for any of a
>number of other microprocessor architectures.
>
>It may be the case that with the current SPARC implementations, with no on-chip
>MMU, that it's easier to get high performance with a virtual cache than with a
>physical cache, or even that you can't get optimal performance *for those
>implementations* with a physical cache (although I suspect the latter is not
>true).  This certainly doesn't say that the SPARC *architecture* requires a
>virtual cache.
>

	Sorry, Guy but you misunderstood the purpose of my posting.  I was not
	attempting to attack the SPARC.  I am simply attempting to see if
	anyone has given any thought the the problem of using the SPARC with a
	virtual cache in a multiprocessor.  It is my understanding (and I
	admit that you are closer to the issue than I am) that it is not easy,
	and may not be possible, to put a virtual to physical translation
	before the SPARC's cache and still get no-wait-state performance.
	(Please note that I didn't say that it was impossible to do this, only
	that the performance would be degraded.)

	If one looks at other current microprocessors (NS32532,	Mot. 88000,
	MIPS 3000, etc.) they all have standard MMUs available with the part
	and have a fairly credible story of how to make them perform in a
	multiprocessor.  There is no standard MMU (to my knowledge) available
	for the SPARC, and at least the current implementations seem to have a
	major drawback when attempting to put them in a multiprocessor.

	The question still remains - if you want to run the SPARC with a
	virtual cache in a multiprocessor - how do you do it?  Other questions
	that get to the same issue are:

	    - How does one get no-wait-state performance in a multiprocessor
	      using the SPARC?
	    - Is there a standard MMU going to be available, and how will it
	      work with a multiprocessor?
	    - Is there a solution to the virtual cache problems?
	    - Is there an announced version of the SPARC that allows time
	      between address-ready and data-ready to have an MMU before the
	      cache?

mat@amdahl.uts.amdahl.com (Mike Taylor) (04/30/88)

In article <1671@alliant.Alliant.COM>, jeff@Alliant.COM (Jeff Collins) writes:
> 
> 	Here's an interesting question for the SPARC guru's of the world:
> 
> 	Given that the SPARC must use a virtual cache to get optimal
> 	performance, how does one build a multiprocessor with a SPARC?
> 
> 	As far as I know, no one has solved the virtual cache coherency
> 	problem yet...
> 

Amdahl machines starting with the 580 series in 1982 have used a coherent
virtual cache.
-- 
Mike Taylor                        ...!{ihnp4,hplabs,amdcad,sun}!amdahl!mat

[ This may not reflect my opinion, let alone anyone else's.  ]

dre%ember@Sun.COM (David Emberson) (04/30/88)

1)  There is nothing inherent in the SPARC architecture that requires the use
    of a virtual cache to obtain high performance.

2)  It is possible to implement snoopy caches with caches that appear to the
    processors as virtual caches.  'Nuff said.

3)  We have a very good architectural solution to this problem for SPARC
    which we are developing now.  We are not yet prepared to go public with it
    but it has been presented to some selected customers on a non-disclosure
    basis.  It is somewhat frustrating to us not to have an announced solution,
    but we'd rather do it right than do something that does not scale well
    (Yes, I did get my hands on the 88200 data sheet!).  All I can say is that
    your patience will be rewarded.

4)  When the pieces are in place, you can bet that the technology will be
    open (i.e. available to all) and will embrace the concept of standard-
    ization that is so dear to us here at Sun.  In other words, it is the
    stated policy of the company that we will make the components available
    so that if you want to build a SPARC machine--even one that competes with
    ours--Sun will encourage you to do so.  A license from Sun is not required
    to use SPARC components.

I may be shot for talking about this stuff, but I think it is important to
the success of SPARC that Sun not be perceived as ignoring this important
technology.  Support for multiprocessors has already been announced as a
deliverable item in Phase III of the AT&T-Sun Unix merge.

		Dave Emberson (dre@sun.com)

guy@gorodish.Sun.COM (Guy Harris) (04/30/88)

> 	Sorry, Guy but you misunderstood the purpose of my posting.  I was not
> 	attempting to attack the SPARC.

No, I didn't.  I merely pointed out that you were, as far as I could tell,
making an unwarranted assumption at the start of your discussion; this
assumption unnecessarily colored the rest of your discussion.  David Emberson
of Sun has noted that there is nothing in the *architecture* that demands a
virtual cache.  Current *implementations* may make a virtual cache the best, or
only, way to get maximum performance, but that's a different matter.  If you
were, in fact, referring to the current chips, rather than the SPARC
architecture, I apologize.

>       I am simply attempting to see if anyone has given any thought the
>       the problem of using the SPARC with a virtual cache in a
>       multiprocessor.

David Emberson has also already given an answer to this question; the
answer is "yes, Sun has".  Unfortunately, as he indicated, he's not in a
position to discuss it in detail now.

bcase@Apple.COM (Brian Case) (05/01/88)

In article <1671@alliant.Alliant.COM> jeff@alliant.UUCP (Jeff Collins) writes:
>
>	Here's an interesting question for the SPARC guru's of the world:
>
>	Given that the SPARC must use a virtual cache to get optimal
>	performance, how does one build a multiprocessor with a SPARC?
>
>	As far as I know, no one has solved the virtual cache coherency
>	problem yet...

Read about the SPUR project being done at Berkeley.  They have a large
virtual cache, in-cache address translation (only done on misses), and
the system concept is a small (about 10) multiprocessor.  Some neat
ideas.

brooks@lll-crg.llnl.gov (Eugene D. Brooks III) (05/01/88)

In article <1680@alliant.Alliant.COM> jeff@alliant.UUCP (Jeff Collins) writes:
>	    - Is there an announced version of the SPARC that allows time
>	      between address-ready and data-ready to have an MMU before the
>	      cache?
>
This is the key to bringing these RISC chips into the realm of real
supercomputing and will of course happen as it will be driven by market
pressures.  You need to allow an "arbitrary" time between address-ready
and data-ready to allow successful use in a multiprocessor environment
where the latency of memory is more or less undetermined due to conflicts
in the shared memory subsystem.  Basically, part of the address ready lines
include a tag which indentifies the request, and when the response to the
request returns the copy of the tag that arrives with it allows the cpu to
figure out what to do with the data.  The number of tag bits limits the
number of outstanding requests for the cpu.  One is likely to sequence the
tag bits in order for speed so the number of outstanding requests will be
further limited by flucuations in arrival order of the responses.  If the
request gets satisfied by the cache it comes back with a low latency, but if
it goes to main memory (shared memory in a multiprocessor) it might have a
substantial latency.  By allowing many requests to be pending at once one
can get "no wait state" performance in the same sense that internal pipelining
delivers "no wait state" performance for the internal cpu functions.  The cpu
must be able to efficiently handle the fact that requests come back out of
order, which means that simple fifo's a la the WM machine won't do.


Rest assured that some future SPARC, MOT88000, Clipper, ..., implementation
will provide this capability by some means as it will be the only way to
further increase performance in the face of the memory latency of a shared
memory multiprocessor.  Whether MIPS will pull this off is not clear, their
basis design principle (religion) of not having harware interlocks would seem
orthogonal to doing it, in short they will find that their basic design
princible was wrong when they try to pipeline cache misses in a shared memory
environment.

hankd@pur-ee.UUCP (Hank Dietz) (05/01/88)

I hate to add to this pile of news, but why hasn't anyone talked about the
fact that processors like SPARC are not really designed for large-scale
multiprocessing, e.g., they have no provision for "hiding" big, stochastic,
memory reference delays across a log n stage interconnection network, etc.?
I think it's pretty uninteresting to talk about multi-processor systems
which are small enough that "snooping caches" work; as many of you have
pointed-out, that's been essentially a non-problem for quite some years.
How about some discussion of what processors do when simple cache protocols
aren't good enough or are not implementable?

It seems we nearly got into such a discussion when WM was brought up, but WM
isn't really billed as being a processor design for large-scale
multiprocessors (read MIMDs), hence people didn't seem to notice that it was
addressing the stochastic delay memory reference problem so endemic to big
MIMDs.  Consider all those other processors with microtasking or other
out-of-order or multiple-memory-pipeline structures.  Anyone care to get a
discussion along these lines going?

						-hankd

mike@arizona.edu (Mike Coffin) (05/02/88)

From article <8029@pur-ee.UUCP>, by hankd@pur-ee.UUCP (Hank Dietz):
> I hate to add to this pile of news, but why hasn't anyone talked about the
> fact that processors like SPARC are not really designed for large-scale
> multiprocessing, e.g., they have no provision for "hiding" big, stochastic,
> memory reference delays across a log n stage interconnection network, etc.?
> [...]
> Anyone care to get a discussion along these lines going?
> 						-hankd

One machine that attacked this problem was the Denelcor HEP.  Although
it's now defunct, my impression is that its demise had little to do
with technical merit.
-- 

Mike Coffin				mike@arizona.edu
Univ. of Ariz. Dept. of Comp. Sci.	{allegra,cmcl2,ihnp4}!arizona!mike
Tucson, AZ  85721			(602)621-4252

mbutts@mntgfx.mentor.com (Mike Butts) (05/02/88)

From article <1671@alliant.Alliant.COM>, by jeff@Alliant.COM (Jeff Collins):
> 
> 	Given that the SPARC must use a virtual cache to get optimal
> 	performance, how does one build a multiprocessor with a SPARC?
> 
> 	As far as I know, no one has solved the virtual cache coherency
> 	problem yet...
> 

Here's another log for the fire... Apollo says in their marketing booklet about
the new DN10000 architecture (which has 1 to 4 15-MIPS-RISC processors sharing
main memory):

"...the Series 10000's caches incorporate the best features of both fully physical
and fully virtual caches.  The result is a new *virtually indexed, physically 
tagged write-through cache* that lets cache RAM access proceed entirely overlapped
in time with any virtual-to-physical address translation.  This overlap reduces
memory access pipeline depth, guaranteeing single-cycle execution and eliminating
the translation penalty typical with physically indexed designs.

The physical tags allow cache validation across processors.  The addressing
scheme, based on the virtual address coupled with the physical tag, maintains
coherency and allows data to be shared by multiple processors."
-- 
Mike Butts, Research Engineer         KC7IT           503-626-1302
Mentor Graphics Corp., 8500 SW Creekside Place, Beaverton OR 97005
...!{sequent,tessi,apollo}!mntgfx!mbutts OR  mbutts@pdx.MENTOR.COM
These are my opinions, & not necessarily those of Mentor Graphics.

jeff@Alliant.COM (Jeff Collins) (05/03/88)

In article <623@garth.UUCP> walter@garth.UUCP (Walter Bays) writes:
>In article <1671@alliant.Alliant.COM> jeff@alliant.UUCP (Jeff Collins) writes:
>>	Given that the SPARC must use a virtual cache to get optimal
>>	performance, how does one build a multiprocessor with a SPARC?
>>	As far as I know, no one has solved the virtual cache coherency
>>	problem yet...
>
>Clipper uses 'bus watch' to invalidate references to stale data, when
>used in multiprocessor (including CPU-IOP) modes, and when using
>'copy-back' (as opposed to write-through) cache modes.  The newly
>announced Motorola 88000 uses a similar scheme, called 'bus snoop'.
>
>With SPARC, Clipper without CAMMU chips, or 88000 without CAMMU chips,
>you implement your own cache, and can build whatever you choose.
>-- 

	Actually I am familiar with the Clipper and the 88000.  I know how they
	support multiprocessing.  The point here was that these chips put the
	cache after the MMU.  This means that the caches contain physical
	address and the tag stores record physical addresses.  When an address
	goes across the bus it is not a big deal to "watch it" and determine
	if the address is in the cache or not.  Current implementations of the
	SPARC, on the other hand, get optimal performance using a virtual
	cache - ie. the cache is BEFORE the MMU.  It is a much more
	complicated procedure to "watch" these addresses.  With a virtual
	cache, each physical address on the bus must first be translated to
	it's corresponding virtual address through some sort of reverse TLB,
	then the address can be provided to the bus watcher to see if the data
	is in the cache.

	This presents a number of problems that are not present in the case of
	the 88000 or the Clipper (or any other micro that uses physical
	caches).  One of the problems is aliasing - this is when multiple
	virtual addresses refer to the same physical address.  In order to
	properly snoop, all of the aliases must be checked for in the cache.
	Another problem, is simply the implementation of the reverse TLB -
	what happens when the reverse TLB misses?  Which set of page tables
	should the TLB walk?

walter@garth.UUCP (Walter Bays) (05/04/88)

>>In article <1671@alliant.Alliant.COM> jeff@alliant.UUCP (Jeff Collins) writes:
>>>	Given that the SPARC must use a virtual cache to get optimal
>>>	performance, how does one build a multiprocessor with a SPARC?
>>>	As far as I know, no one has solved the virtual cache coherency
>>>	problem yet...

>>[I replied about 'bus watch' citing Clipper and 88000.]

In article <1699@alliant.Alliant.COM> jeff@alliant.UUCP (Jeff Collins) writes:
>	Actually I am familiar with the Clipper and the 88000.  I know how they
>	support multiprocessing.  The point here was that these chips put the
>	cache after the MMU.  [Good description of the problems in 'bus
>	watching' with virtual caches.]

You're right.  I missed your point.  Clipper has a physical cache, while
the Sun 4 SPARC has a virtual cache.
-- 
------------------------------------------------------------------------------
Any similarities between my opinions and those of the
person who signs my paychecks is purely coincidental.
E-Mail route: ...!pyramid!garth!walter
USPS: Intergraph APD, 2400 Geng Road, Palo Alto, California 94303
Phone: (415) 852-2384
------------------------------------------------------------------------------

gruger@convex.UUCP (05/04/88)

>/* Written  5:51 pm  May  2, 1988 by jeff@alliant.Sun.Com
>	...  It is a much more
>	complicated procedure to "watch" these addresses.  With a virtual
>	cache, each physical address on the bus must first be translated to
>	it's corresponding virtual address through some sort of reverse TLB,
>	then the address can be provided to the bus watcher to see if the data
>	is in the cache.
>	...In order to
>	properly snoop, all of the aliases must be checked for in the cache.
>	Another problem, is simply the implementation of the reverse TLB -
>	what happens when the reverse TLB misses?  Which set of page tables
>	should the TLB walk?

Nah, it ain't that hard.  One can have a virtually mapped, physically tagged
cache.  As the virtual cache is filled with data, the translated physical
address from the MMU is written into the tag RAMs.  The physical tags are 
then used by the bus watcher to selectively invalidate the cache entries.

This is in fact how cache coherency is maintained in the Convex C-2 machines.
Each of the CPUs has a set of remote invalidation tag memories which watches
all of the other physical address buses (3 other CPUs and one I/O).  When 
there is a hit, the validity bits are cleared for that particular entry in 
the virtually mapped cache.  

Jeff Gruger
(ihnp4!convex!gruger)

przemek@gondor.cs.psu.edu (Przemyslaw Klosowski) (05/04/88)

In article <1671@alliant.Alliant.COM> jeff@alliant.UUCP (Jeff Collins) writes:
>
>	As far as I know, no one has solved the virtual cache coherency
>	problem yet...
>
See ``An in-cache address translation mechanism'' by D.A. Wood et all. in: 
Proc 13th annual ACM/IEEE symposium on comp. arch. Tokyo June 1986.
They describe SPUR solution to the problem. Basically it consists of forcing 
Global Virtual address (which is almost but not quite virtual addr) to be the
samo for all shared data. 
Other solutions used Reverse TLB's to retrieve virtual address and then use 
normal (i.e. bus snooping techniques). 
Last but not least, why not delegate responsibility to software? It is only 
a question of finding a convenient paradigm.

				przemek@psuvaxg.bitnet
				psuvax1!gondor!przemek



				przemek@psuvaxg.bitnet
				psuvax1!gondor!przemek

przemek@gondor.cs.psu.edu (Przemyslaw Klosowski) (05/04/88)

In article <51321@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes:
>> 	Given that the SPARC must use a virtual cache to get optimal
>> 	performance, how does one build a multiprocessor with a SPARC?
>
>Excuse me?  Where is it "given that the SPARC must use a virtual cache to get
>optimal performance?"
IMHO the virtual cache ($) has serious advantages, one of which is that separate
TLB (which is nothing else but separate cache for translation data) is not 
necessary. And of course convenient overlapping of $ access  with V->R 
translation.  

				przemek@psuvaxg.bitnet
				psuvax1!gondor!przemek
 

				przemek@psuvaxg.bitnet
				psuvax1!gondor!przemek

jh@mancol.UUCP (John Hanley) (05/04/88)

In article <8029@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
>...why hasn't anyone talked about the fact that processors like SPARC are not
>really designed for large-scale multiprocessing, e.g., they have no provision
>for "hiding" BIG, stochastic, memory reference delays across a log n stage
>interconnection network, etc.? I think it's pretty uninteresting to talk about
>multi-processor systems which are small enough that "snooping caches" work.... 

My favorite method of keeping the CPU busy while a memory-read request is
traversing the network is the one used by the Denelcor HEP: context-switch
on a cache miss.  When a process requests that a disk block be read on a
time-sharing system, the scheduler tries to get useful work done during the
rotational latency by blocking the process and executing another one; when
a process requests a memory word on a HEP'ish system the CPU tries to get
useful work done during the network latency by executing another process.
This, of course, requires extremely light-weight processes so that time to
context-switch is comparable to time to execute any other instruction.
One way of doing this is to have a very memory-intensive architecture, with
almost no registers besides PC and PSW (and even the status word can be
dispensed with; c.f. recent comp.arch discussion).  Another method is to
sacrifice single-process speed for parellel speedup, by putting a multiplexor
in front of every single register, so that a context switch is effected by
simply changing the index register that addresses the MUX.  To prevent the
need for reloading the MMU's page-descriptors on every context-switch, it is
preferable for most switches to be between "threads" of the same program
(same virtual address space) rather than between processes running unrelated
programs.

Another tack is to have a conventional ("context switches are expensive")
processor that rarely waits on cache misses.  Writing to a memory location
is always fast because you don't have to wait around for it to finish.  Reads
cause problems.  Suppose you come to a code fragment that is about to do three
array references (low probability of being in the cache).  Rather than saying
LD A, <wait>, LD B, <wait>, LD C, <wait>, you could say PREFETCH A, PREFETCH B,
PREFETCH C, LD A, LD B, LD C.  If the time to execute the three non-blocking
prefetch instructions is comparable to the network latency, you win big, since
they execute during time that would have been spent idle anyway.  Code density
is shot to hell, and the compiler has to be _very_ smart about cache-hit
likelihoods (or else runtime profiling has to be done, which is tricky because
adding and removing where the PREFETCH instructions go is going to change the
pattern of what's in the cache when).

Something I haven't seen is the above PREFETCH instructions implemented in
hardware.  Call it an intelligent look-ahead cache, or an aux. CPU. 
Predictive memory requests are made not only on the instruction stream,
but also on the data stream, a few instructions ahead of time.  Is this
impractical because the aux. CPU has to be nearly as complicated as the CPU
itself, so you'd get better elapsed times from dual processors that spend a
lot of time waiting, rather than a single processor that hardly ever waits
on a cache miss?  (The aux. CPU could be on the same chip as the CPU -- do
any available processors do data prefetch as well as instruction prefetch?)
In some cases the address to prefetch simply can't be computed soon enough
(LD i, LD base_addr+4*i), but usually there's enough play in the data
dependencies that instructions can be rescheduled to allow predictions to be
made in time (LD i, do something else useful while simultaneously computing
base_addr+4*i, prefetch base_addr+4*i while doing some other useful things,
do the array reference (either by recalculating base_addr+4*i or by grabbing
the already computed result from the aux. CPU)). 

Since all we're interested in is reducing the percentage of cache misses, it
is by no means necessary to make the aux. CPU as intelligent as the primary
CPU; the aux. is permitted to give up on a complicated address calculation
and say, "I don't know," incurring only the penalty of a few wasted cycles
on the primary.  Is this a loose enough constraint to make the aux. CPU
paractical, or is the idea economically infeasible (the dollars for extra
compute power would be better spent on another processor) ?


             --John Hanley
              System Programmer, Manhattan College
              ..!cmcl2.nyu.edu!manhat!jh  or  hanley@nyu.edu   (CMCL2<=>NYU.EDU)

tve@alice.UUCP (05/05/88)

As far as the "new virtually indexed, physically tagged write-through
cache" in the Apollo goes, there ain't anything *new*. The name implies
they use the virtual address to address the cache and then compare
the tags to the result of the translation, which will have occured
simultaneously (remember: the TLB is a cache as well and thus has about
the same access time).
Since nothing comes for free (usually), there is a penalty using this
scheme: the cache set size is limited by the number of address bits
available before translation (i.e. the bits which aren't translated).
This usually means the cache set size is limited to the size of a page.
This number is usually around 4Kbytes these days. If you want a larger
cache, you have to make it set associative, which is a pain in discrete
logic (multiplexing etc...).
I think the virtual-index/physical-tag caches will become very attractive
when integrated on a chip, since on a chip, set associative caches are
easier to build and are are anyway (more or less) required for speed.

Thorsten von Eicken
AT&T Bell Laboratories
research!tve
tve@research.att.com

aglew@urbsdc.Urbana.Gould.COM (05/05/88)

>	Another problem, is simply the implementation of the reverse TLB -
>	what happens when the reverse TLB misses?  Which set of page tables
>	should the TLB walk?

If we are still talking about virtual caches, then you don't need a reverse TLB.
What you do is have two sets of cache directories, one physical, one virtual.
Use the virtual directory on local access to the cache.
Use the physical directory when snooping on the bus.
The bus carries physical addresses, not virtual.

The virtual/physical synonym problem can be solved in several ways.
(1) arrange the cache so all virtual addresses for the same physical location
map to the same set. Requires a bit of work when replacing synonyms in the 
same set, but is better than having to search the entire cache.
(2) still use a virtual to physical translation, but take it off the critical
path. Ie. access the cache using the virtual address, assuming that there are
no synonyms. But, at the same time, initiate the virtual to physical 
translation. Later, on the next cycle if your TLB is fast enough, or much later
if the TLB misses (may have to stall processor, but TLB misses are rare),
when you've got the physical address see if there was already such a location
in cache - backup and repair if stale data was read.
(3) Arrange with the OS so that condition (1) is satisfied, or even to maintain
the invariant that there are no synonyms, or that synonyms are used in
carefully controlled situations.

hankd@pur-ee.UUCP (Hank Dietz) (05/06/88)

In article <381@mancol.UUCP>, jh@mancol.UUCP (John Hanley) writes:
> In article <8029@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
> >...why hasn't anyone talked about the fact that processors like SPARC are not
> >really designed for large-scale multiprocessing, e.g., they have no provision
> >for "hiding" BIG, stochastic, memory reference delays across a log n stage
> >interconnection network, etc.? I think it's pretty uninteresting to talk about
> >multi-processor systems which are small enough that "snooping caches" work.... 
> 
> My favorite method of keeping the CPU busy while a memory-read request is
> traversing the network is the one used by the Denelcor HEP: context-switch
> on a cache miss.... [or alternatively....]
> LD A, <wait>, LD B, <wait>, LD C, <wait>, you could say PREFETCH A, PREFETCH B,
> PREFETCH C, LD A, LD B, LD C.  If the time to execute the three non-blocking
> prefetch instructions is comparable to the network latency, you win big, since
> they execute during time that would have been spent idle anyway....
> Something I haven't seen is the above PREFETCH instructions implemented in
> hardware.  Call it an intelligent look-ahead cache, or an aux. CPU. 
> Predictive memory requests are made not only on the instruction stream,
> but also on the data stream, a few instructions ahead of time....

Burton Smith, of HEP fame, is sort-of doing both in his latest machine; so
are we (CARP -- the Compiler-oriented Architecture Research group at Purdue).

I believe Burton's machine microtasks, a la HEP, but he also has a method
whereby many memory references (or other slow operations) can be initiated
without waiting for earlier ones to complete.  (I still don't know how much
of his design is in the public domain, so I can't say much more about it.)

The CARP machine doesn't microtask, but we do some very sneaky interrupt
enabling...  the CARP machine processor has provision for multiple delayed
operations to be initiated without waiting for earlier ones to complete, and
interrupts are only enabled when the processor has to wait LONGER than the
compiler expected.  The interrupt latency is high this way (perhaps dozens
of instructions between interrupt accept states), but this isn't such a bad
problem when you consider a multiprocessor machine where ANY processor could
service ANY interrupt.  There are actually two separate delayed operation
mechanisms in the CARP machine:  one for compile-time known delays and one
for delays where only the expected delay is known at compile-time.  For some
operations, the expected-delay-based mechanism is late targeting; i.e., the
destination register in register address space is not specified until the
item has arrived, hence the usable register address space is not reduced by
having multiple items pending (selection of a staging register is implicit
in the type of the delayed operation).

We look at it this way:  if you want to get high speedup by multiprocessing,
since not everything can be parallelized, we don't want to slow the
sequential parts by microtasking.  The result is that we implement machine
use priorities by dynamically changing the parallelism-width dedicated to
each task, and we concentrate on other mechanisms for hiding delays...
preferably mechanisms which do not "use-up" parallelism that we could have
used to achieve speedup through parallel execution.

						-hankd

daveb@geac.UUCP (David Collier-Brown) (05/06/88)

In article <381@mancol.UUCP> jh@mancol.UUCP (John Hanley) writes:
>Something I haven't seen is the above PREFETCH instructions implemented in
>hardware.  Call it an intelligent look-ahead cache, or an aux. CPU. 
>Predictive memory requests are made not only on the instruction stream,
>but also on the data stream, a few instructions ahead of time.  

Well, not on a RISC machine...  It is logically similar to (perhaps
identical to?) a pipeline on a CISC.  My old 'bun used to look
"forward" in the instruction stream with wild abandon, beacuse it
took so long to decode the instructions (:-)).

--dave (data, now, is quite a different matter) c-b
-- 
 David Collier-Brown.                 {mnetor yunexus utgpu}!geac!daveb
 Geac Computers International Inc.,   |  Computer Science loses its
 350 Steelcase Road,Markham, Ontario, |  memory (if not its mind) 
 CANADA, L3R 1B3 (416) 475-0525 x3279 |  every 6 months.

jeffs@pyrthoth (Jeff Sewall) (05/07/88)

In article <63900014@convex> gruger@convex.UUCP writes:
>
>>/* Written  5:51 pm  May  2, 1988 by jeff@alliant.Sun.Com
>>	...In order to
>>	properly snoop, all of the aliases must be checked for in the cache.
>
>Nah, it ain't that hard.  One can have a virtually mapped, physically tagged
>cache.  As the virtual cache is filled with data, the translated physical
>address from the MMU is written into the tag RAMs.  The physical tags are 
>then used by the bus watcher to selectively invalidate the cache entries.
>
>This is in fact how cache coherency is maintained in the Convex C-2 machines.
>Each of the CPUs has a set of remote invalidation tag memories which watches
>all of the other physical address buses (3 other CPUs and one I/O).  When 
>there is a hit, the validity bits are cleared for that particular entry in 
>the virtually mapped cache.  
>
>Jeff Gruger

This only works for small caches. Once the size of each set of the cache
exceeds your page size, the location of an entry in a physically mapped
"remote invalidation tag memory" will be different than the location of
that entry in the virtually mapped cache. This is because the page index
alone is no longer sufficient to address the cache.

This problem can be solved by passing part of the virtual page address
along with the physical address on memory requests. Then the remote tag
can be addressed with the virtual address and guarantee mapping to the
same location. But this works only if synonyms are not allowed. BTW, this
is the approach taken in the SPUR architecture.

I think that the original poster's question is still valid. Is there a
good solution to snooping a virtually mapped cache with the following
constraints:
(1) Synonyms are allowed
(2) The cache is large enough that the size of a set exceeds the page size

petolino%joe@Sun.COM (Joe Petolino) (05/07/88)

>>>	...In order to
>>>	properly snoop, all of the aliases must be checked for in the cache.
>>
>>Nah, it ain't that hard.  One can have a virtually mapped, physically tagged
>>cache.  As the virtual cache is filled with data, the translated physical
>>address from the MMU is written into the tag RAMs.  The physical tags are 
>>then used by the bus watcher to selectively invalidate the cache entries.
>
>This only works for small caches. Once the size of each set of the cache
>exceeds your page size, the location of an entry in a physically mapped
>"remote invalidation tag memory" will be different than the location of
>that entry in the virtually mapped cache. This is because the page index
>alone is no longer sufficient to address the cache.
>
>This problem can be solved by passing part of the virtual page address
>along with the physical address on memory requests. Then the remote tag
>can be addressed with the virtual address and guarantee mapping to the
>same location. But this works only if synonyms are not allowed. 
>
>            .    .    .                                     Is there a
>good solution to snooping a virtually mapped cache with the following
>constraints:
>(1) Synonyms are allowed
>(2) The cache is large enough that the size of a set exceeds the page size
						  ^^^
			      (you don't really mean 'set' here, do you :-) )

The problem of finding virtual aliases in a virtually-addressed cache is not
unique to multiprocessor systems.  A cache deeper than the page size can 
harbor more than one copy of the same line if unrestricted aliases are allowed.
That can cause problems even with a single processor.  I've seen a few
solutions to this problem:

1) Use a high enough degree of cache associativity so that you get the desired
   cache size with a one-page-deep cache.  This was used on the Amdahl 470/V7
   (eight-way set associative!).  This violates constraint 2 (above).

2) Restrict virtual addresses so that no two aliases map into different cache
   sets ('set' used differently here than above).  This is used by Sun systems. 
   It technically violates constraint 1 (but not by much).

3) When doing the snooping, sequentially search all the possible places
   where that data might be (this is done while waiting for memory to
   respond).  This is used in the Amdahl 5890, which has to look in four
   2-element instruction-cache sets and eight 2-element data-cache sets on
   each processor.

My first two examples are from single-processor machines.

-Joe

chris@mimsy.UUCP (Chris Torek) (05/07/88)

[Virtual caches vs bus snooping with physical addresses]

Another possibility---if perhaps somewhat unusual and maybe quite
difficult---would be to be able to spot invalid cache accesses and
fault the original instruction before it is allowed to complete.
That is, allow the instruction to execute using the cached data,
and if the cached data is stale, zap it before it is too late.

I imagine this would get nightmarish when debugging the hardware.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

news@dartvax.Dartmouth.EDU (USENET News System) (05/10/88)

Expires: 
References: <1521@pt.cs.cmu.edu> <28200135@urbsdc> <4921@bloom-beacon.MIT.EDU> <1671@alliant.Alliant.COM>
Sender: 
Reply-To: fagin@eleazar.dartmouth.edu (Barry S. Fagin)
Followup-To: 
Distribution: 
Organization: Dartmouth College, Hanover, NH
Keywords: 
From: fagin@eleazar.dartmouth.edu (Barry S. Fagin)
Path: eleazar.dartmouth.edu!fagin

Jeff Collins writes:

>	As far as I know, no one has solved the virtual cache coherency
>	problem yet...

	Check out the SPUR project at Berkeley, a RISC multiprocessor that
uses virtual address caches.

--Barry

bartlett@encore.UUCP (John Bartlett) (05/11/88)

In article <63900014@convex> gruger@convex.UUCP writes:
>
>>/* Written  5:51 pm  May  2, 1988 by jeff@alliant.Sun.Com
>>	...  It is a much more
>>	complicated procedure to "watch" these addresses.  With a virtual
>
>Nah, it ain't that hard.  One can have a virtually mapped, physically tagged
>cache.  As the virtual cache is filled with data, the translated physical
>address from the MMU is written into the tag RAMs.  The physical tags are 
>then used by the bus watcher to selectively invalidate the cache entries.


This sounds easy, but every time I have anaylzed this I have come to the 
conclusion that one of those tag stores has to be fully associative, to insure
that the two tag stores will always have the same addresses allocated.  Am I
missing something here?  

In our systems, we can't afford the realestate for a fully associative tag store
for each processor cache.


John Bartlett		{ihnp4,decvax,allegra,linus}!encore!bartlett
Encore Computer Corp.
257 Ceder Hill Street
Marlboro, Mass.  01752
(617) 460-0500

Opinions are not necessarily those of Encore Computer Corp.

gruger@convex.UUCP (05/12/88)

>/* Written  9:32 pm  May 10, 1988 by bartlett@encore.Sun.COM in convex:comp.arch */
>
>This sounds easy, but every time I have anaylzed this I have come to the 
>conclusion that one of those tag stores has to be fully associative, to insure
>that the two tag stores will always have the same addresses allocated.  Am I
>missing something here?  
>
>In our systems, we can't afford the realestate for a fully associative tag store
>for each processor cache.
>

Maybe there is some semantics problem here in the communication...
Which "two tag stores" are you referring to?   By "fully associative"
I take it you mean a true content addressable memory for the entire tag 
memory??  I believe you only need to have high associativity when
searching through multiple sets of your data cache.

Our cache structures consist of:
	a) a virtually addressed data RAM
	b) a virtually addressed validity RAM 
	c) a physically addressed tag RAM
The tag ram is only written as read data returns to the cache.  The 
tag ram is read only as remote processor writes occur, and if there
is a hit, the validity bits are cleared.

All these RAMs certainly take up a lot of space (although we manage
to pull a lot inside gate arrays).  We also reached the conclusion
that we could not afford the real estate of multiple cache sets and
the increased complexity/cost/low-pay-back.

Prior responses have discussed the increasing complexity of the tag RAM
as your data cache gets deeper - you have to have search more than just
one tag as your cache size increases beyond the page size.  We have found
it quite effective in a _vector_processing_ machine to have a fairly
small cache equal to page size for scalar operands only.  We bypass vector 
operands around the cache and invalidate entries if a vector load/store
encounters one.  There is NO performance improvement in running
vector data through a cache if you have enough basic bandwidth (which
not all parallel-vector machines have).   Large caches best serve plain
old scalar machines that have to stuff entire data arrays into cache
in order to achieve performance.

Jeff Gruger

bartlett@encore.UUCP (John Bartlett) (05/23/88)

In article <63900015@convex> gruger@convex.UUCP writes:
>Which "two tag stores" are you referring to?   

In our system we have two separate tag stores, one for processor access
to the cache, one for bus invalidation.  This is because the bus and
the processors run on different clocks.

>By "fully associative"
>I take it you mean a true content addressable memory for the entire tag 
>memory??  I believe you only need to have high associativity when
>searching through multiple sets of your data cache.
>
>Our cache structures consist of:
>	a) a virtually addressed data RAM
>	b) a virtually addressed validity RAM 
>	c) a physically addressed tag RAM
>The tag ram is only written as read data returns to the cache.  The 
>tag ram is read only as remote processor writes occur, and if there
>is a hit, the validity bits are cleared.
>
You must have some trick for insuring complete mapping between your virtual
index and your physical index.  In order to insure complete mapping, one
of them has to be fully associateive (yes CAM) does it not?  If you limit 
the combinations in some way to prevent this problem, don't you take a hit
on hit rate ? (oh, sorry 'bout the bad pun)


John Bartlett		{ihnp4,decvax,allegra,linus}!encore!bartlett
Encore Computer Corp.
257 Ceder Hill Street
Marlboro, Mass.  01752
(617) 460-0500

Opinions are not necessarily those of Encore Computer Corp.

gruger@convex.UUCP (05/25/88)

>/* Written  9:02 pm  May 22, 1988 by bartlett@encore.Sun.COM 
>You must have some trick for insuring complete mapping between your virtual
>index and your physical index.  In order to insure complete mapping, one
>of them has to be fully associateive (yes CAM) does it not?  If you limit 
>the combinations in some way to prevent this problem, don't you take a hit
>on hit rate ? (oh, sorry 'bout the bad pun)
>
>
>John Bartlett		{ihnp4,decvax,allegra,linus}!encore!bartlett
>Encore Computer Corp.

Its not a very fancy trick.  Physical index = virtual index.  Yes, the
cache has to be smaller than you might like it to maximize hit rate.
However, when you work with 5nsec ECL RAMs its difficult (impossible
today) to find any larger than 4K bits.  The raw cycle time this permits
more than compensates for a small sacrifice of hit rate.

Jeff Gruger
Convex Computer Corp.  {ihnp4!convex!gruger}