[comp.arch] Phys vs Virtual Addr Caches

lm@cottage.WISC.EDU (Larry McVoy) (07/16/87)

Here's a question.  Why do people build their caches to respond to physical
addresses instead of virtual addresses?  Another way to state the question
is: why not put the VM -> PM translation logic next to (in parallel with)
the data cache, rather than before it?

If you cache virtual addresses you can present the address to the cache
as soon as it is generated, no delay do translation.  At the same time you
are doing the cache lookup you can be doing the translation in case there
is a miss.

Am I missing something or is this the wave of the future?  

Thank you fer yer support,

Larry McVoy 	        lm@cottage.wisc.edu  or  uwvax!mcvoy

bader+@andrew.cmu.edu (Miles Bader) (07/16/87)

> Here's a question.  Why do people build their caches to respond to physical
> addresses instead of virtual addresses?  Another way to state the question
> is: why not put the VM -> PM translation logic next to (in parallel with)
> the data cache, rather than before it?

If different processes have different parts of their virtual address space
mapped to the same physical memory, a physical cache allows them to share the
same cache entries.  Also, each cache entry in a virtual cache has to have a
field describing which address map it's from if you don't want to have to
flush the cache upon context switch, etc.

				-Miles

petolino%joe@Sun.COM (Joe Petolino) (07/16/87)

>Here's a question.  Why do people build their caches to respond to physical
>addresses instead of virtual addresses? [ . . . ]
>If you cache virtual addresses you can present the address to the cache
>as soon as it is generated, no delay do translation.  At the same time you
>are doing the cache lookup you can be doing the translation in case there
>is a miss.
>
>Am I missing something or is this the wave of the future?  

Not only is it the wave of the future, but of the past and present as
well.  Designers of high-performance machines (IBM, Amdahl, and Sun, to name
only a few) have been using virtual-addressed caches for years, mainly for
the speed advantage noted above.

The reason that this is not done universally has to do with cache consistency
problems.  If a word's location in the cache is a function of the Virtual
address by which it was last accessed, it can be difficult to find that 
word again if another processor (e.g. a DMA controller), or even another
process running on the same processor, tries to access it by a different
virtual address.  Two processes who share a word but don't agree on where it
is in the cache are bound to confuse each other.

There are a number of tricks used to solve this problem so that virtual
addresses can be used to access a cache - I think this topic has been
discussed here before.

-Joe

tim@amdcad.AMD.COM (Tim Olson) (07/16/87)

In article <3904@spool.WISC.EDU> lm@cottage.WISC.EDU (Larry McVoy) writes:
+-----
| Here's a question.  Why do people build their caches to respond to physical
| addresses instead of virtual addresses?  Another way to state the question
| is: why not put the VM -> PM translation logic next to (in parallel with)
| the data cache, rather than before it?
+-----

The potential benefit of this (assuming an external mmu) is a decrease
in the latency from virtual address valid to cache access.  However,
there are also problems:

	1)	Cache tags must include a process-id field (more RAM for
		the tags, larger tag comparators) or the cache
		must be flushed on every context switch (very expensive
		for large caches.)

	2)	It is very hard to provide for cache consistency in a
		multiprocessor (or even uniprocessor + i/o, but less so)
		environment; it basically requires a reverse-mapping
		from physical address -> virtual address.

All in all, if you can hide the address translation time in a pipeline
stage, you are probably better off using physical caches.

	-- Tim Olson
	Advanced Micro Devices
	(tim@amdcad.amd.com)

ps@celerity.UUCP (Pat Shanahan) (07/16/87)

In article <3904@spool.WISC.EDU> lm@cottage.WISC.EDU (Larry McVoy) writes:
>Here's a question.  Why do people build their caches to respond to physical
>addresses instead of virtual addresses?  Another way to state the question
>is: why not put the VM -> PM translation logic next to (in parallel with)
>the data cache, rather than before it?
>
...
>
>Larry McVoy 	        lm@cottage.wisc.edu  or  uwvax!mcvoy

Virtual address caches are limited in their applications, because of the
difficulty in maintaining consistency. It is possible for a single item of
data to have several different addresses. For example, an area of System V
style shared memory can be attached at different virtual addresses by
different processes. If the system either has multiple caches, or does not
purge the cache on context switch, the same shared data can be in the cache
system with several different virtual addresses in different address spaces.

Suppose one of the processes modifies the data. The system has to ensure
that all cache copies of that data are either deleted or updated. If the
cache is real-addressed this is relatively easy.  If it is virtual addressed
the system has the problem of determining all the addresses the data might
have.

There is also a minor difficulty with using virtual addresses for caches
that are not purged on context switch. The virtual address has to be
extended by appending some form of context identifier so that equal virtual
addresses in different address spaces will not be confused.

Virtual addressed caches can work very well, for example for instruction
caches. Instruction modification is a much rarer event than data
modification and can be handled by doing general purges rather than purging
only the specific item.

-- 
	ps
	(Pat Shanahan)
	uucp : {decvax!ucbvax || ihnp4 || philabs}!sdcsvax!celerity!ps
	arpa : sdcsvax!celerity!ps@nosc

amos@nsta.UUCP (Amos Shapir) (07/16/87)

In article <3904@spool.WISC.EDU> lm@cottage.WISC.EDU (Larry McVoy) writes:
>Here's a question.  Why do people build their caches to respond to physical
>addresses instead of virtual addresses?

Well, not all do: CCI's 6/32 (also sold by Harris and Sperry) has virtual
caches; the trouble is, in Unix all user processes share the same
virtual space - they all start at their own virtual 0. Having a virtual
cache requires the kernel to either purge all cache at context switch,
or manage a complicated bookkeeping of who has what in which cache (CCI
do the latter).

>If you cache virtual addresses you can present the address to the cache
>as soon as it is generated, no delay do translation.

In machines with physical cache (such as NS32532), this is accomplished
by an auxiliary Translation Look-aside Buffer (TLB); it should be big enough
to be useful, yet small enough to be purged on every context switch
without a significant reduction of performance.

-- 
	Amos Shapir
National Semiconductor (Israel)
6 Maskit st. P.O.B. 3007, Herzlia 46104, Israel  Tel. (972)52-522261
amos%nsta@nsc.com @{hplabs,pyramid,sun,decwrl} 34 48 E / 32 10 N

rrs@amdahl.amdahl.com (Bob Snead) (07/17/87)

In article <3904@spool.WISC.EDU>, lm@cottage.WISC.EDU (Larry McVoy) writes:
> ........  Why do people build their caches to respond to physical
> addresses instead of virtual addresses? ...
> 
> Am I missing something or is this the wave of the future?  

In fact, it's the wave of the present.  Amdahl 580s have virtually
addressed caches.

Claimer:  "There is no way of exchanging information
that does not demand an act of judgment." - Jacob Bronowski

Disclaimer:  If you perceived opinions in what I have
written they are probably your own and certainly not
Amdahl Corp's.

Bob Snead
Future Computing Technologies
Amdahl Corp.
UUCP: ..!{ihnp4, hplabs, amd, sun, ...}!amdahl!rrs

roy@phri.UUCP (Roy Smith) (07/17/87)

In article <3904@spool.WISC.EDU> lm@cottage.WISC.EDU (Larry McVoy) writes:
> Why do people build their caches to respond to physical addresses instead
> of virtual addresses?

	The scheme Larry describes sure sounds like it would be a win but
for one problem.  How do you deal with invalidating cache lines when some
DMA I/O device writes into the corresponding main memory location?  The I/O
device is generating physical addresses but the cache is keying on virtual
addresses.  Now that I've posed the problem, I'll throw out some possible
answers.

	1) When you store to cache, install VA as the key and store the PA
along with the data.  CPU-generated addresses key on VA, I/O generated
addresses key on PA for purposes of invalidation.  Makes the cache wider
and adds more key-match logic.

	2) Do I/O with VA's instead of PA's and have DMA go through the VM
machinery.  Doesn't make the I/O device any more complicated to build (in
fact, doesn't change it one whit) but adds more complexity to the I/O
bus adaptor.

	Add to all this the fact that you can (and very well might want to)
have multiple VA's mapping to the same PA.  Nasty wrinkle all around.
-- 
Roy Smith, {allegra,cmcl2,philabs}!phri!roy
System Administrator, Public Health Research Institute
455 First Avenue, New York, NY 10016

montnaro@sprite.steinmetz (Skip Montanaro) (07/17/87)

Some folks from Sun presented an article on the virtual address cache
mechanism developed for the Sun-3/200 product line at the recent
Usenix conference in Phoenix. It presents the pros and cons of this
scheme (as I recall).

         Skip|  ARPA:      montanaro@ge-crd.arpa
    Montanaro|  UUCP:      montanaro@desdemona.steinmetz.ge.com

"How sweet it is!"	-- The Great One

fdr@apollo.uucp (Franklin Reynolds) (07/17/87)

In article <3904@spool.WISC.EDU> lm@cottage.WISC.EDU (Larry McVoy) writes:
>Here's a question.  Why do people build their caches to respond to physical
>addresses instead of virtual addresses?  Another way to state the question
>is: why not put the VM -> PM translation logic next to (in parallel with)
>the data cache, rather than before it?
>
>If you cache virtual addresses you can present the address to the cache
>as soon as it is generated, no delay do translation.  At the same time you
>are doing the cache lookup you can be doing the translation in case there
>is a miss.

This idea has merit and some people already build virtual caches.
Isn't the 68020 Icache virtual? I have heard rumors that the 
caches of the 68030 will be virtual. However, virtual caches are
tricky. In order to avoid excessive cache flushing you usually 
have to include some sort of address space identification tag 
for each entry. You also have to decide whether you want to be 
able to support the ability to different virtual addresses to the
same physical address (a very useful feature for systems that 
support shared memory or mapped files).

Franklin Reynolds, Apollo Computer
fdr@apollo.uucp
mit-eddie!apollo!fdr

jroberts@attvcr.UUCP (John Roberts) (07/17/87)

In article <3904@spool.WISC.EDU>, lm@cottage.WISC.EDU (Larry McVoy) writes:
> Here's a question.  Why do people build their caches to respond to physical
> addresses instead of virtual addresses?  Another way to state the question
> is: why not put the VM -> PM translation logic next to (in parallel with)
> the data cache, rather than before it?
> 
> Am I missing something or is this the wave of the future?  

Actually, the 3B2/600 (and I would imagine many other machines) use a
virtual cache scheme.  In the case of the 3B2, its 6K partitioned as
4K instruction and 2K data (I think).  As to how the potential pitfalls
of this scheme are handled, I don't know.  Perhaps someone with more
detailed knowledge will post something.

BTW, the 600 will have a multi-processor board this fall.  It's a slave,
but it still may complicate things.

-- 
John M. Roberts            AT&T Canada  Vancouver  BC
(604) 689-8911             {ihnp4!alberta,uw-beaver}!ubc-vision!attvcr!jroberts
What! Me Worry?		   attsayi fsh!would

corbin@encore.UUCP (07/17/87)

In article <3904@spool.WISC.EDU> lm@cottage.WISC.EDU (Larry McVoy) writes:
>Here's a question.  Why do people build their caches to respond to physical
>addresses instead of virtual addresses?  Another way to state the question
>is: why not put the VM -> PM translation logic next to (in parallel with)
>the data cache, rather than before it?
>
>Larry McVoy 	        lm@cottage.wisc.edu  or  uwvax!mcvoy

Take a look at Prime Computer's architecture, they have been doing virtual
caches for over 15 years.  Parallel lookup on cache and TLB is faster than
single threading them but this type of design can pose serious problems
with shared data if the architecture of the machine and the OS is not
done right.
-- 

Stephen Corbin
{ihnp4, allegra, linus} ! encore ! corbin

corbin@encore.UUCP (07/17/87)

In article <YUzCb7y00UkaU7k0Rj@andrew.cmu.edu> bader+@andrew.cmu.edu (Miles Bader) writes:
>> Here's a question.  Why do people build their caches to respond to physical
>> addresses instead of virtual addresses?  Another way to state the question
>> is: why not put the VM -> PM translation logic next to (in parallel with)
>> the data cache, rather than before it?
>
>If different processes have different parts of their virtual address space
>mapped to the same physical memory, a physical cache allows them to share the
>same cache entries.  Also, each cache entry in a virtual cache has to have a
>field describing which address map it's from if you don't want to have to
>flush the cache upon context switch, etc.
>
>				-Miles

The `Segmented Address Space` architecture of Prime systems solves the
problem of multiple cached entries of the same data and doesn't require
the address map identifier in the cache.  It works as follows:

	A specific number of segments in the virtual space are used for
	sharing and are common to the address space of all processes in
	the system.  For example, if segment 1000 is a share segment then
	multiple processes virtual segment 1000 will map to the same
	physical segment in memory.  Thus sharing is achieved, duplicate
	cached entries of the same data is avoided and the mapping for
	the shared data is maintained in one table.

The disadvantage of this approach is that the amount of private memory
each process has available is reduced by the defined amount of shared
space in the system.  For the Prime machines 1/4 of the virtual space
is allocated as shared data.  (actually 1/2 since the operating system
is embedded in 1/4 of the virtual space).

Steve		{ihnp4, allegra, linus} ! encore ! corbin

-- 

Stephen Corbin
{ihnp4, allegra, linus} ! encore ! corbin

kenm@sci.UUCP (Ken McElvain) (07/17/87)

In article <3904@spool.WISC.EDU>, lm@cottage.WISC.EDU (Larry McVoy) writes:
- Here's a question.  Why do people build their caches to respond to physical
- addresses instead of virtual addresses?  Another way to state the question
- is: why not put the VM -> PM translation logic next to (in parallel with)
- the data cache, rather than before it?
- 
- If you cache virtual addresses you can present the address to the cache
- as soon as it is generated, no delay do translation.  At the same time you
- are doing the cache lookup you can be doing the translation in case there
- is a miss.

There are quite a few machines that use virtual addresses for the data
cache (Amdahl,Sun,..).  That way the translation is only needed for
cache misses.  This creates major problems if the virtual address
space can be aliased (Two virtual addresses mapping to one physical address)
because IO is usually done with physical addresses and a reverse
translation is needed.  This is apparently not insoluable since the
Amdahl has this problem.   Multi-processor systems will have a like
problem.

Another thing you can do is to limit the cache set size to the virtual
page size.  Then the address translation can happen at the same time
as the tag set access and finish in time to compare with the output
of the tag rams.  This doesn't work too well with machines with small
page sizes like a Vax but can be reasonable if the page size is ~4Kb
and up.

Ken McElvain
decwrl!sci!kenm

ps@celerity.UUCP (Pat Shanahan) (07/20/87)

In article <1762@encore.UUCP> corbin@encore.UUCP (Steve Corbin) writes:
...
>
>
>The `Segmented Address Space` architecture of Prime systems solves the
>problem of multiple cached entries of the same data and doesn't require
>the address map identifier in the cache.  It works as follows:
>
>	A specific number of segments in the virtual space are used for
>	sharing and are common to the address space of all processes in
>	the system.  For example, if segment 1000 is a share segment then
>	multiple processes virtual segment 1000 will map to the same
>	physical segment in memory.  Thus sharing is achieved, duplicate
>	cached entries of the same data is avoided and the mapping for
>	the shared data is maintained in one table.
>
...
>Steve		{ihnp4, allegra, linus} ! encore ! corbin
>
>Stephen Corbin
>{ihnp4, allegra, linus} ! encore ! corbin

I'm curious about this. How would one use this to implement, for example,
System V shared memory? The shared memory interfaces seem to allow for
processes to attach the same block of shared memory at different addresses,
and for different processes to use the same virtual address for different
blocks of shared memory.

-- 
	ps
	(Pat Shanahan)
	uucp : {decvax!ucbvax || ihnp4 || philabs}!sdcsvax!celerity!ps
	arpa : sdcsvax!celerity!ps@nosc

thomson@uthub.UUCP (07/20/87)

In article <2798@phri.UUCP> roy@phri.UUCP (Roy Smith) writes:
>	The scheme Larry describes sure sounds like it would be a win but
>for one problem.  How do you deal with invalidating cache lines when some
>DMA I/O device writes into the corresponding main memory location?  The I/O
>device is generating physical addresses but the cache is keying on virtual
>addresses.  

These IOs were presumably scheduled by software, and the software
presumably knows where the device write was directed, so there should be no
difficulty in having the processor(s) do a software invalidate of the
appropriate virtual addresses once the IOs complete.

Note that the transient inconsistency between completion of a device
write to a location and the (possibly much) later software invalidation
does not pose a problem, since the software will already be structured
such that those locations are not read until the IO operation terminates.

-- 
		    Brian Thomson,	    CSRI Univ. of Toronto
		    utcsri!uthub!thomson, thomson@hub.toronto.edu

welland@cbmvax.UUCP (Bob Welland) (07/24/87)

>Here's a question.  Why do people build their caches to respond to physical
>addresses instead of virtual addresses? [ . . . ]
>If you cache virtual addresses you can present the address to the cache
>as soon as it is generated, no delay do translation.  At the same time you
>are doing the cache lookup you can be doing the translation in case there
>is a miss.
>
>Am I missing something or is this the wave of the future?  

There are a few reasons why people use physical address caches
instead of virtual address caches (to reverse the perspective):

1. Cache consistency is very difficult with virtual address caches.
   This is because virtual addresses are "private" to the process
   they are associated with. Physical addresses are the "normal form"
   for the system as a whole. Cache consistency is basically collision
   detection. Two detect a collision you need to compare addresses.
   Normal form addresses are easy to compare (equality) while private
   form addresses require a more complex comparison algorithm.

2. Extra tag space is needed for a process Id to distinguish colliding
   virtual addresses from different processes. It is also necessary
   to flush the cache when you reuse a process ID (or use a big PID field),
   This can be rather ugly.

3. Often it is possible to use the low order address bits (which in a
   paging system are untranslated) to access the cache in parallel
   to doing the address translation. In VLSI these paths can be very
   well matched. Usually you use a small content addressable memory
   (CAM) for address translation and a large RAM for the tags. This
   often means that address translation is "free" because it is done
   in parallel. This is easy to do in VLSI but quite difficult with
   random logic and so you do not see this approach taken often. The
   reason it is difficult in random logic is because it is very
   difficult (basically impossible) to build a CAM and so you end up
   using some more elaborate translation scheme (i.e. SUN's two level
   translation table) that makes translation time consuming.

Most of the people who build very fast caches do them discretely and
so they end up with the translate then cache dilemma described above.
As VLSI technology evolves, building more complex structures will
become possible allowing MMU and CACHE to be one and the same.

So in summary: Yes virtual caches are the wave of the present but not
(in my mind) the wave of the future.


					Robert Welland


Opinions expressed are my own and not those of Commodore.

aglew@ccvaxa.UUCP (07/27/87)

...> Physical vs. virtual caches.

Some of you may remember a discussion I started last year about systems
where all processes would live in the same virtual address space.
The bottom line was that UNIX fork() makes it highly desirable for
processes to be able to duplicate their address space (although there
are ways around it).

This was prompted by the desire to avoid the consistency problems
inherent in a virtual cache.

Andy "Krazy" Glew. Gould CSD-Urbana.    USEnet:  ihnp4!uiucdcs!ccvaxa!aglew
1101 E. University, Urbana, IL 61801    ARPAnet: aglew@gswd-vms.arpa

ross@hpihoah.HP.COM (Ross LaFetra) (07/28/87)

By the time I read this, many people have responded on a great many different
machines.  Most of the problems/solutions I'll discuss here have been seen
in pieces elsewhere.  But I'll discribe them as they pertain to one machine,
the Hewlett-Packard Precision Archicture, of which the HP9000/Series 840 is
an example (that is out there today):

First of all, for further reading, there is a great deal of information on
the hardware, software, architecture, and operating system (UNIX System V
compatable) in the Hewlett-Packard Journal over the last two to three years.

The HP9000/840 (I'll just call it the 840 from now on) has what is known as
virtual/physical caches (split I and D).  What this means is that the
cache is indexed virtually, but checked (by means on tags) physically.  This
has the advantage of using the cache and TLB (virtual to physical translation
cache) in parallel, which was the purpose of the original basenote.  It also
has the advantage of keeping the cache tag small, because of the limited
physical address space (32 bits on HPPA- HP Precision Architecture), rather
than the large virtual address space (48 bits on the 840, up to 64 bits on
HPPA).

The problem on consistency with shared memory isn't much of a problem.  Since
the machine supports a large number of spaces (32 bit segments of virtual
memory), you can assign each piece of shared memory its own virtual space.
There are 65536 spaces on the 840, so you are not likely to run out soon.
Thus, you can avoid the need to assign a two virtual addresses to the same
physical address.  This is prohibited on HPPA machine (by software convention).

Multi-processor cache consistency (not implemented on the 840 since it is a
uniprocessor) is not a problem either.  Each processor can brodcast the
virtual address as well as the physical address when accessing memory.  In
reality, it is even simpler:  only the cache index needs to be broadcast.

IO presents a bit of a problem, but it is solved from the software side.
Since the OS knows what the IO is going to do, the OS manages the cache
for the IO system.  There are a couple of opcodes that do this.

HPPA solves alot of the problems associated with purging the cache and TLB
on process switches by use of its large address space.  Each process is
assigned its own space(s), and no purgess of either the cache or TLB are
needed on process switches.  Only when spaces are removed is this action
needed.  This allows the 840 to use very large caches (128kb) and large
TLBs (4k entries).  Thus no time is wasted invalidating and reloading
the cache or TLB on process switches (invalidation is typically fast in
other machines I believe, but the reloading happens by missing the cache
and TLB.  Please note that physical caches don't need to do this).

Each process can use its own virtual address zero, because it is really
a virtual offset of zero, where the virtual address is a space and an offset
(each up to 32 bits, for a max of 64 bits in HPPA).

I hope this clarifies some of the issues.  I tried to gather them in one place.
I'm a little weak on the OS side of things, as I'm just a cache designer.

Ross La Fetra
hpda!ross