[net.arch] How Many Virtual Spaces

steve@gondor.UUCP (Stephen J. Williams) (04/19/86)

In article <909@umd5.UUCP> zben@umd5.UUCP (Ben Cranston) writes:
>In article <6581@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>>In article ?? somebody writes:
>
>>> .... But, to maintain multiple
>>> cache consistency where there are multiple virtual address spaces, caches
>>> have to keep (and be able to associate on) physical addresses.
>
>>A simpler approach is to have the "address" in the cache include a few bits
>>of process number.  This is increasingly common.  It amounts to putting
>>the processes in a common address space for caching purposes without giving
>>the individual processes the ability to address each other's data.
>
>But, this means that shared writable memory becomes tricky.  There are now
>several cache keys under which a given memory word might be stored, one for
>each process that might be accessing the memory.

Personnally, I have no trouble with the caches that I plan on.  They
are part of the I/O system, and as such, are kept in REAL memory.  It
is the kernal's job to decide how to use these caches, and it is the
kernal's job to multiplex them amongst the processes.  Note that the
data within the cache has to be give to one process or another, not
matter how many address spaces you have.

Shared writable memory is a different thing altogether.  Shared memory
is a form of communication between processes.  The way that I see this
implimented is by putting one page into the page tables of 2 or more
processes. (I'm speaking in terms of seperate address spaces, of course.)
This is not tricky.

--Scal

aglew@ccvaxa.UUCP (04/21/86)

I'm afraid this is another long post. I go into painful detail about the
advantages of one virtual address space for multiple cache systems;
essentially, the advantages are not qualitative, only quantitative, in that
they permit you to run a multiple cache system a bit faster with less
hardware.

>/* Written  2:52 pm  Apr 18, 1986 by petolino@chronon.UUCP */
>Sharing writable memory among address spaces in a cached system is indeed
>tricky.  As the above postings suggest, problems of consistency arise
>not only between copies of the same physical memory location residing in
>different caches in the system, but also among copies of a physical memory
>location residing in different places in a single cache (by 'consistency'
>we mean making sure that all cached copies of a physical memory location
>accurately reflect any writes that have been done to that location - the
>terms 'cache coherence' and 'data integrity' have also been used for this
>concept).
>
>Many solutions to these problems have been implemented and/or proposed.
(Paraphrased): (1) No shared memory; (2) don't cache shared memory; 
(3) single cache location (doesn't help inter); (4) temporarily `own' cached
shared memory locations.

I don't think you mentioned (unless that was what you meant by number 4) the
widely used multiple cache synchronization technique of write-through,
either with invalidate, or with update.

Add to this the Snoopy cache techniques from Xerox which attempt to avoid
unnecessary write-throughs.

The write-through techniques all involve having the cache listen to the
system bus (at the same time as they are responding to their own processor's
requests) and updating their own entries that correspond to traffic they see
on the bus. Yes, this requires multiporting. Update according to bus traffic
requires two full write ports; invalidate means that you only have to write
one bit, which the processor's cache port only reads, so it's considerably
less expensive.

Whatever, you have to be associating on the addresses that you see both at
both ports. Consider these configurations:

    Processor A					    Processor B
	Processor A's virtual address			virtual address
	Physical address				physical address
	     |----------------Physical Memory-------------------|

There are two places you can put caches for any particular processor: 
before the virtual address gets translated, or after. 

   Processor --virtual address--> :cache: --physical address--> Physical memory
   Processor --virtual address--> --physical address--> :cache: Physical memory

If you are only dealing with one processor that doesn't even have to worry
about things like DMA, and if your virtual mapping doesn't change very often
(depends on your application mix), then there is an advantage in caching on
virtual addresses rather than physical, since you can simply start the cache
at the same time as you start your translation (using your TLB or whatever)
and you can reduce the cycle time for cache hits by that much. Ie. caching
on virtual addresses can let you go a bit faster.

When you have multiple caches, if you want to do bus monitoring, you have
three possible configurations: 

(1) PP caching on your processor's physical addresses and listening to
physical addresses on the bus:

   Processor --virtual address--> --physical address--> :cache: Physical memory
							    \___/
							    /	\
   Processor --virtual address--> --physical address--> :cache: Physical memory

(2) VP caching on your virtual processor's virtual addresses, listening to
physical addresses on the bus:

   Processor --virtual address--> :cache: --physical address--> Physical memory
				     \______/
				     /	    \
   Processor --virtual address--> :cache: --physical address--> Physical memory

(3) VV caching on your processor's virtual addresses and listening to
other processor's virtual addresses:

   Processor --virtual address--> :cache: --physical address--> Physical memory
                             \______/
	                     /	    \
   Processor --virtual address--> :cache: --physical address--> Physical memory

(The fourth possibility has no advantages).

PP what is usually done, but it has the same `problem' (whether it is a
problem depends on how much you are worried about speed) that the cache
doesn't get started until the address has been translated.

VP is alright, in that it means that you can start responding to your own
processor before translating the address, but it doubles the complexity of
your cache in that you have to be able to associate on both physical and
virtual addresses.

VV gives you faster response to your own processor by starting the cache
lookup before address translation begins. It works like this:
    Give virtual address to cache
    If a miss then
	put virtual address on cache synchronization bus
	|| put physical address on physical memory bus
	|| if a write but data on data bus
where || means operations performed in parallel.The speed advantage isn't
carried over to the bus - the advantage over VP is only that both cache
ports associate on the same thing, the virtual address - but it only works
if all processors share the same virtual address space.

---

Please, don't anyone say `who cares about the little increment of speed you
get from virtual caching?'. Caching is purely about speed - if you're not
worried about speed, don't do caching at all. 

No, sorry, that's not quite just - some people may be happy with the
increment gained from PP caching, and not need to worry about getting that
little bit faster (particularly since a good cache is faster than many of
the microprocessors it might get attached to). But if your goals are speed,
speed, and more speed, then the little bit of gain from V* caching is
tempting, but the extra size necessary for VP caching turns you off.

Andy "Krazy" Glew. Gould CSD-Urbana.    USEnet:  ihnp4!uiucdcs!ccvaxa!aglew
1101 E. University, Urbana, IL 61801    ARPAnet: aglew@gswd-vms

stewart@magic.UUCP (04/23/86)

A physical address cache need not suffer in performance because one must
somehow do the virtual to physical translation "first."  The early
Stanford Sun boards, for example, applied the high order bits of the
virtual address to the TLB and the low order bits to the cache tag 
memory and the physical page number came out of the TLB just in time
to be compared with the cache tag, giving a hit or miss indication.

The particular implementation had some oddities like the page size
equalling the cache size, but having them be different doesn't hurt
too much.

For example, if the page size is smaller than the cache size (likely,
these days) then it might appear that you need some bits of the
physical page number to address the cache tag memory, but that can be
solved by requiring that a few low order bits of the physical page number
match the low order bits of the virtual page number.   Any real
memory allocator worth its salt should be able to deal with that
kind of restriction.

If the page number and cache size overlap by, say, two bits, then one
could also look up the four possible tags and then select, based on
the TLB output.  Kind of a set associative trick....

Let your imagination roam.  -Larry

eric@chronon.UUCP (Eric Black) (04/23/86)

In article <7711@watdaisy.UUCP> dneudoerffer@watdaisy.UUCP (Dave Neudoerffer) writes:
>I think we need some clarification on what cache we're talking about here.
>I agree with Henry that a process identification tag on addresses
>in the memory management translation cache (TLB) is a bonus and saves
>flushes of this cache.
>Also if a data cache is used between processor and memory management, 
>ie caching virtual addresses, then this tag may also be useful.  However,
>as Ben points out, you can get physical address aliasing if two
>processes are accessing the some memory page.
>However, if a data cache is put in the system after the memory management
>hardware, there is no aliasing problem since only physical addresses
>are being cached.  I beleive this last setup is the one found in most
>systems with a data cache.
>

Ah, but putting the cache AFTER the memory management hardware puts
the address translation in the critical path for ALL memory accesses,
whether the reference hits the cache or not.

Too bad if TLB lookup (or whatever) is more than an insignificant fraction
of cache access time.  It is easier to implement, though.  Of course,
if the processor is slow enough that time(TLB+cache) is still quick from
the processor's point of view, it doesn't matter.  Nobody would ever
want to buy a faster processor, would they? :-)

-- 
Eric Black   "Garbage In, Gospel Out"
UUCP:        {sun,pyramid,hplabs,amdcad}!chronon!eric
WELL:        eblack
BIX:         eblack

jer%peora@peora.UUCP (04/29/86)

> Seems to me that you've worked around to one single virtual address space.
> You've got a fixed number of address bits for your address within a process,
> plus a fixed number of address bits for the process number => one bigger
> virtual address with a fixed number of bits.
>
> What you've proposed is segments, with a one-to-one correspondence between
> segments and processes.

It was my intent to work around to a single virtual address space!  Because
that was what you were asking about.

However, note that this address space has one unusual property, viz., that
if the "process number" bits are all zero (or some other unique value,
but let's say zero WLOG) then you get your "own" process's address space,
just as if you'd filled in your process number there.

I *think* this provides what you concluded you require, viz., a way for
the code to tell "what processor it's running on" (which I am thinking you
meant "what process it's running in behalf of" -- clearly in a symmetrical
multiprocessor the code doesn't have to know what processor it's running
on at all, except for the small amount of code that actually schedules
processes onto processors).

The issue of "where to put the caches" also applies to "where to put the
address translation hardware".  Since such hardware tends to be slow, you
get better performance if you put it with the processors, so that you have
a number of them working in parallel.  On the other hand, you then have
your basic consistency problem (a favorite topic of mine since my research
back in graduate school involved a model of memory where data objects,
rather than memory locations, had names, in order to avoid this problem),
i.e., keeping multiple copies of what are really the same object consistent.

On the other hand, you can eliminate this problem by putting the translation
hardware out at the memory (which I believe is what was done by
Gottlieb et. al. in their supercomputer project, along with also putting
some adders and so on out there), but then you only have one of them,
which means it has to be very fast to avoid a bottleneck.  I recall reading
a comment by Gottlieb about that fairly recently, where he was saying he
wished he'd put his memory management at the processors instead.

Regarding the "segments", I have some difficulty with that term because it
seems to constrain a lot of thinking.  What I am actually proposing is that
you have multiple translation tables for the lower-order bits of your
address, one such table per process, and you select the table number based
on the high-order bits of the address.  The processor would also have a
register identifying the current process number, which it would use to fill
in these high-order bits whenever they would otherwise be zero.
-- 
E. Roskos