[comp.arch] i860 CPU information

clif@intelca.intel.com (Ken Shoemaker) (03/04/89)

The following information is taken from the i860 TM 64-Bit Microprocessor
data sheet order number 240296-001.  I hope that this posting does not
generate a meta-discussion about appropriateness of the posting.  I
believe that it contains more technical information than a typical
comp.arch posting.  

We, Intel, will try to answer questions regarding the architecture .
However due to work pressures and the need for approval prior to posting
non-technical information their will probably be a delay.


			i860 64-bit Microprocessor

Highlights:

Parallel Architecture:  3 instructions Clock
	- one integer or control instruction
	- up to to Floating Point Instructions

High Performance Design
	- 33.3/40 MHz Clock Rate
	- 80 MFLOP Peak Single Precision MFLOPs
	- 60 MFLOP Peak Double Precision MFLOPs
	- 64-bit External Data Bus
	- 64-bit Internal Instruction Cache Bus
	- 128-bit Internal Data Cache Bus	

Measured Performance with Current Compilers
	- 24 Megawhetsones (40 MHz)
	- 83K Dhrystones (40 MHz)

Highly Integrated
	- 32/64-bit Pipelined Floating-Point Adder and Multipler
	- 32-bit Integer and Control Unit
	- 64-Bit 3-D Graphics Unit
	- Paging Unitg with TLB
	- 4K Byte Instruction Cache
	- 8K Byte Data Cache

The core execution unit controls overall operation of the i860 TM CPU.
The core unit executes load, store, integer, bit, and control-transfer 
operations, and fetches instructions for the floating-point unit 
as well.  A set of 32 32-bit general-purpose registers are provided 
for the manipulation of integer data.  Load and store instructions move 
8-, 16-, and 32-bit data to and from these registers.  Its full set of
integer, logical, and control-transfer instructions give the core
unit the ability to execute complete systems software and
applications programs.  A trap mechanism provides rapid response
to exceptions and external interrupts.  Debugging is supported by
the ability to trap on data or instruction reference. 

The floating-point hardware is connected to a separate set of 
floating-point registers, which can be accessed as 16 64-bit registers,
or 32 32-bit registers.  Special load and store instructions can 
also access these same registers as  8 128-bit registers.  All 
floating-point instructions use  these registers as their source and 
destination operands.

The floating-point control unit controls both the floating-point adder 
and the floating-point multiplier, issuing instructions, handling 
all source and result exceptions, and updating status bits in the 
floating-point status register.  The adder and multiplier can operate 
in parallel, producing up to two results per clock.  The 
floating-point data types, floating-point instructions, and exception 
handling all support the IEEE Standard for Binary 
Floating-Point Arithmetic (ANSI/IEEE Std 754-1985).  

The floating-point adder performs addition, subtraction, comparison, 
and conversions on 64- and 32-bit floating-point values.  An adder 
instruction executes in three to four clocks; however, in pipelined mode,
a new result is generated every clock.

The floating-point multiplier performs floating-point and 
integer multiply and floating-point reciprocal operations on 64- and 
32-bit floating-point values.  A multiplier instruction executes 
in three to four clocks; however, in pipelined mode, a new result
can be generated every clock for single-precision and every other
clock for double precision.

The graphics unit has special integer logic that supports
three-dimensional drawing in a graphics frame buffer, with color
intensity shading and hidden surface elimination via the Z-buffer
algorithm.  The graphics unit recognizes the pixel as an 8-, 16-,
or 32-bit data type.  It can compute individual red, blue, and
green color intensity values within a pixel; but it does so with
parallel operations that take advantage of the 64-bit internal
word size and 64-bit external bus.  The graphics features of the
i860 microprocessor assume that the surface of a solid object is drawn 
with polygon patches whose shapes approximate the original object.
The color intensities of the vertices of the polygon and their
distances from the viewer are known, but the distances and
intensities of the other points must be calculated by 
interpolation.  The graphics instructions of 860 CPU the directly
aid such interpolation.

The paging unit implements protected, paged, virtual memory via a
64-entry, four-way set-associative memory called the TLB
(Translation Lookaside Buffer).  The paging unit uses the TLB to
perform the translation of logical address to physical address,
and to check for access violations.  The access protection scheme
employs two levels of privilege:  user and supervisor.

{Editors note the i860 CPU's paging mechanism is the same
as the 386 CPU.}

The instruction cache is a two-way set-associative memory of four
Kbytes, with 32-byte blocks.  It transfers up to 64 bits per
clock (266 Mbyte/sec at 33.3 MHz). 

The data cache is a two-way set-associative memory of eight
Kbytes, with 32-byte blocks.  It transfers up to 128 bits per
clock (533 Mbyte/sec at 33.3 MHz).  The 860 CPU normally uses writeback 
caching, i.e. memory writes update the cache (if applicable) 
without necessarily updating memory immediately;
however, caching can be inhibited by software where necessary. 

The bus and cache control unit performs data and instruction
accesses for the core unit.  It receives cycle requests and
specifications from the core unit, performs the data-cache or
instruction-cache miss processing, controls TLB translation, and
provides the interface to the external bus.  Its pipelined
structure supports up to three outstanding bus cycles.  

Clif Purkiser
Intel Corp, Santa Clara Microcomputer Division

mark@mips.COM (Mark G. Johnson) (03/05/89)

Thanks for Clif Purkiser for an informative posting! <208@intelca.intel.com>
did raise a question, though:

>Highly Integrated
>	- 32/64-bit Pipelined Floating-Point Adder and Multipler
>	- 32-bit Integer and Control Unit
>	- 64-Bit 3-D Graphics Unit
>	- Paging Unitg with TLB
>	- 4K Byte Instruction Cache
>	- 8K Byte Data Cache


Perhaps the list above is simply incomplete; by an omission it leads to
speculations like:
	1.  Is there a Floating-Point Divider in hardware?
	2.  Are there Floating-Point Divide instructions (IEEE 32b & 64b)
		in the 80860 architecture?
	3.  How many clocks does it take to do an IEEE 32b divide?  64b?
Thanks.
-- 
 -- Mark Johnson	
 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
	...!decwrl!mips!mark	(408) 991-0208

mash@mips.COM (John Mashey) (03/05/89)

In article <208@intelca.intel.com> clif@intelca.intel.com (Ken Shoemaker) writes:
>The following information is taken from the i860 TM 64-Bit Microprocessor
>data sheet order number 240296-001.  I hope that this posting does not
>generate a meta-discussion about appropriateness of the posting....

Appropriate posting; thanx; it's much better than seeing random rumors and
misinformation, and there's plenty of technical content.

There a few questions though: I suspect this was just an oversight,
as somebody MUST know the answers, but 2 of the numbers need clarification,
or they are almost meaningless:

>Measured Performance with Current Compilers
I assume this was measured on real hardware, so are you allowed to say what
the memory system looks like?  i.e., read latency and write retirement rates,
for example?  (of course, for these particular benchmarks it probably doesn't
matter too much, since their cache miss rates are neglible :-)

>	- 24 Megawhetsones (40 MHz)
		1)  Was this single precision or double precision?
		2)  Whichever it was, what was the other one?
>	- 83K Dhrystones (40 MHz)
		1) Which version: 1.1 or 2.1?
			I assume this wasn't 1.0, whose numbers are 15% better
			than 1.1.
		2) What level of optimization?
			any inlining?
			any unusual options? (like, for example: the manual
			shows normal use of a frame pointer, which costs
			4 cycle/call, but could be suppressed if you know
			things like alloca won't be used.  Since a typical
			32-bit RISC would use 30-40 cycles/call, suppressing
			the fp-manipulation gains about 10%.)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

seanf@sco.COM (Sean Fagan) (03/06/89)

In article <14616@obiwan.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
>Perhaps the list above is simply incomplete; by an omission it leads to
>speculations like:
>	1.  Is there a Floating-Point Divider in hardware?

No.

>	2.  Are there Floating-Point Divide instructions (IEEE 32b & 64b)
>		in the 80860 architecture?

No.

>	3.  How many clocks does it take to do an IEEE 32b divide?  64b?

Depends.  I think it might be somewhere around 30-40 (40-50?), but I'm not
sure.  It doesn't have divide in hardware; what it has is reciporacal
approximations (1.0/x), so you do that (plus a little bit to get rid of the
errors), then multiply.

Kinda like a Cray, right? 8-)

-- 
Sean Eric Fagan  | "What the caterpillar calls the end of the world,
seanf@sco.UUCP   |  the master calls a butterfly."  -- Richard Bach
(408) 458-1422   | Any opinions expressed are my own, not my employers'.

earl@wright.mips.com (Earl Killian) (03/11/89)

In article <1895@oakhill.UUCP> tomj@oakhill.UUCP (Tom Johnson) and
article <21570@shemp.CS.UCLA.EDU>, marc@oahu (Marc Tremblay) discuss
the choice of I-cache and D-cache size on the i860.

The reason that most systems have identical sized caches is just for
simplicity, and secondarily because it is often the best overall
performance choice.  On a large suite of benchmarks you'll find a
large number miss more in the i-cache, and a large number miss more in
the d-cache.  The average results depends on the set of benchmarks
chosen, and on other cache parameters, such as the number of words
refilled on a cache miss.  There is no universal truth.

For example, the M/500 had a 16KB I-cache and 8KB D-cache because its
performance was more I-cache limited.  But if it had a larger cache
refill size, it might have become more D-limited because large refill
works better for I than D.

As for why the i860 D-cache is twice the size of the I-cache, my guess
is a very simple explanation: the D-cache had to be twice as wide in
order to support 128-bit load/store.

I've now simulated i860-style caches on a wide range of programs, and
the size of both caches is a problem, with I-cache more of a
bottleneck.  The overall cpi is in the range of 2.2-2.7, resulting in
>native< MIPS of 12 to 15 at 33.3MHz.  A tad far from "150-mips" eh?
--
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086

mash@mips.COM (John Mashey) (03/11/89)

In article <21570@shemp.CS.UCLA.EDU> marc@cs.ucla.edu (Marc Tremblay) writes:

>For other systems, such as the MIPS M/2000 processor board, or
>the Tadpole VME board based on the 88000, cache sizes for both 
>instructions and data are the same. 
>Although the sizes of the caches for these system are independent,
>none of them have the "opposite" (having a larger instruction cache).
>Last time I heard, they intended to run UNIX on these machines :-) .
>Based on the choices made at MIPS (2 64k caches), they probably 
>obtained simulations that show that performance/cost was optimized 
>for equal size caches.

The MIPS M/500 uses 16K I + 8K D cache, having grown up from the
8KI + 8K D predecessor (about 1 BVUP difference by doubling the I-cache).

No, the reason was to get the largest caches that were electrically reasonable.
We'd lessen the D-cache first....
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

jeff@Alliant.COM (Jeff Collins) (03/11/89)

In article <21570@shemp.CS.UCLA.EDU> marc@cs.ucla.edu (Marc Tremblay) writes:
:In article <1895@oakhill.UUCP> tomj@oakhill.UUCP (Tom Johnson) writes:
::4)  Still on the subject of the caches:  There is no way to externally
::    invalidate cache lines.  This makes the part virtually unusable in multi-
::    processing configurations, since cache coherency cannot be maintained.
:
:Invalidating cache lines externally is not an absolute requirement for 
:using caches in a multi-processor environment. 
:There are policies that do not require this feature at all.

	Which policies are these?  How does one share memory in a
	multiprocessor when you can't have external bus watchers?
	Actually never mind that, how do you switch a process from one
	processor to another (I don't count flushing the D-cache on
	each context switch as a viable answer)?

newton@kahuna.UUCP (Mike Newton) (03/13/89)

In article <1895@oakhill.UUCP> tomj@oakhill.UUCP (Tom Johnson) writes:
>2)  The ISSCC paper indicates a Dhrystone _1.1_ rating of 105,000 Dhrystones
>    at 50 MHz.  Why was the 1.1 version used?  

WARNING: (mild) flaming

Intel is (in)famous for this.  For some good (though biased!) comments
on this aspect of Intel marketing, see Motorola's Performance briefs
for the 680x0 series micros.  Over the years i have repeatedly been
shocked by Intel's peverse use of benchmarks.

Dhrystone 1.1 can have VERY special optimizations done to it to make it
run much faster than you would expect.

Remember that in this industry (currently) rule #2:
	If you are behind, promise everything and confuse everyone
	as much as possible.
By announcing this "fast" micro now, they will put doubts into everyone's
minds about using a different machine.  Later, when the part is "real",
it doesnt matter so much if it isn't really that great.  Just look at
initial promises of the 386 vs. reality.  There were a lot of people
who waited for it, rather than use the 68020 -- even though it caused
product delays.

Note that I am NOT saying that this (i860) is not a good chip.  I dont
know:
	[1] the specs
	[2] whether it is "real"!

- mike

PS: Disclaimer Disclaimer

>tomj@oakhill.UUCP
>Disclaimer:  Motorola does NOT NECESSARILY ENDORSE ANY OR ALL COMMENTS
>             CONTAINED HEREIN.  ALL THOUGHTS/COMMENTS ARE PURELY MY OWN.

ps: from who you appear to work for (Motorola), i suspect (but do not
imply!) that you already assumed what i wrote above, but could not say
it in those words without being blasted to death.  I, however, have no
(known to me) stake in this -- and have more friends who have worked
for Intel than Motorola.

-- 
From the bit bucket in the middle of the Pacific...

Mike Newton				newton@csvax.caltech.edu
Caltech Submillimeter Observatory	kahuna!newton@csvax.caltech.edu
Post Office Box 4339
Hilo Hawaii 96720			808 935 1909

	"Reality is a lie that hasn't been found out yet..."

mash@mips.COM (John Mashey) (03/14/89)

In article <3024@alliant.Alliant.COM> jeff@alliant.Alliant.COM (Jeff Collins) writes:
......
>:There are policies that do not require this feature at all.
....
>	Actually never mind that, how do you switch a process from one
>	processor to another (I don't count flushing the D-cache on
>	each context switch as a viable answer)?

As I posted in <15016@winchester>, you have to flush the caches
on context-switch in a single CPU, much less a multiprocessor.

As the details of the i860 become clearer, it seems obvious that
the Intel engineers designed it as a very powerful numerics coprocessor,
(and at least the Intel engineers we know say so, also)
and all the tradeoffs were made in that direction.  Another way to
put it is that all the design tradeoffs one sees cry out "mini-super"-
style design (not supermini or mainframe-style).
I'm hardly privy
to the internal decisions, but it wouldn't surprise me if it hadn't
been "redesigned" (i.e. relabeled) by marketing very late in the game
to become a general-purpose chip, for such things as multi-user systems.:-)
But don't zing the engineers too much....
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mash@mips.COM (John Mashey) (03/15/89)

In article <175@kahuna.UUCP> newton@kahuna.UUCP (Mike Newton) writes:
>In article <1895@oakhill.UUCP> tomj@oakhill.UUCP (Tom Johnson) writes:
>>2)  The ISSCC paper indicates a Dhrystone _1.1_ rating of 105,000 Dhrystones
>>    at 50 MHz.  Why was the 1.1 version used?  

Despite the existence of Dhrystone 2.1, it turns out that most people,
seeing unlabeled Dhrystones, think it's 1.1, if only because when
you backtrack what they meant, that's what you find.  At least it was
labeled, and their performance document gives nubmers for both,
and they have a typical ratio.  [Now, I still find the number puzzling, but..]
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

brooks@maddog.llnl.gov (Eugene Brooks) (03/15/89)

In article <3024@alliant.Alliant.COM> jeff@alliant.Alliant.COM (Jeff Collins) writes:
>	Which policies are these?  How does one share memory in a
>	multiprocessor when you can't have external bus watchers?
>	Actually never mind that, how do you switch a process from one
>	processor to another (I don't count flushing the D-cache on
>	each context switch as a viable answer)?

Although you don't count flushing the D-cache as a viable answer, for
large scalable multiprocessor systems such a solution is indeed viable.
In particular, if you are not talking about a bus architecture at all
where snooing is not very easy to do,  write-through caches which
use the volatile keyword in C as a mechanism to force a cache miss
on a read are quite useful.  Non-bus systems don't have a memory
bandwidth problem, they generally have a memory latency problem.
The write-through cache can be tolerated because the needed bandwidth
is available and the cache flush is quick because cache lines only
need to be forgotten, not written to memory.  A single instruction
could nail the whole cache.

This is of course somewhat of a tangent to the issue for the i860.
The i860's cache is not write-through with explicit user code
management of the cache lines so it does not fit in the "non-bus"
scheme mentioned above very well, and the current on chip cache
does not have a coherence strategy useful for bus based shared
memory multiprocessors (other than to keep shared data out of the
on chip cache).  For such a small on chip cache the "flush
the cache on a context switch" is not really much of an issue. I
suspect that you have to do this anyway for a single cpu machine
as the cache is a virtual memory cache.

Is the news software incompatible with your mailer too?
brooks@maddog.llnl.gov, brooks@maddog.uucp, uunet!maddog.llnl.gov!brooks

doug@ross.UUCP (doug carmean) (03/15/89)

In article  <21923@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:
>Although you don't count flushing the D-cache as a viable answer, for
>large scalable multiprocessor systems such a solution is indeed viable.
>In particular, if you are not talking about a bus architecture at all
>where snooing is not very easy to do,  write-through caches which
>use the volatile keyword in C as a mechanism to force a cache miss
>on a read are quite useful.  Non-bus systems don't have a memory
>bandwidth problem, they generally have a memory latency problem.
.
It seems to me that you are proposing a multiprocessing system that 
implements a cache but never actually uses the cache.  Your cache 
scheme uses write through and then forces misses on the reads.  Why
even bother implementing a D-cache?
.
>This is of course somewhat of a tangent to the issue for the i860.
>The i860's cache is not write-through with explicit user code
>management of the cache lines so it does not fit in the "non-bus"
>scheme mentioned above very well, and the current on chip cache
>does not have a coherence strategy useful for bus based shared
>memory multiprocessors (other than to keep shared data out of the
>on chip cache).  For such a small on chip cache the "flush
>the cache on a context switch" is not really much of an issue. 
.
I think what you really mean here is that using such a cache system
in an application with heavy context switching is not really much
of an issue - you would never want to do it!  From what I understand
of the i860, you must flush the I-cache, the D-cache and the TLB on
a context switch.  This seems like a very big penalty to pay every
single time you want to switch contexts.
.
>I suspect that you have to do this anyway for a single cpu machine
>as the cache is a virtual memory cache.
.
A virtual memory cache does not necessarily imply that you must flush
the cache on every context switch.  An easy solution to this problem
is to store the context number along with the cache tag entry.  This
way the context number is compared along with the tag entry to 
determine if a chache entry is a hit or not.  Also note that there
are alternative solutions to vitual caches in a multiprocessing system
other than the solution you have presented here.  A virtual cache 
that implements a copyback scheme with bus snooping is very
feasible in a multprocessing environment.

-- 
-doug carmean
-ROSS Technology, 7748 Hwy 290 West Suite 400, Austin, TX 78736
-ross!doug@cs.utexas.edu

rodman@mfci.UUCP (Paul Rodman) (03/15/89)

In article <15213@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>In article <3024@alliant.Alliant.COM> jeff@alliant.Alliant.COM (Jeff Collins) writes:
>......
>>:There are policies that do not require this feature at all.
>....
>>	Actually never mind that, how do you switch a process from one
>>	processor to another (I don't count flushing the D-cache on
>>	each context switch as a viable answer)?
>
>As I posted in <15016@winchester>, you have to flush the caches
>on context-switch in a single CPU, much less a multiprocessor.
>

Hmmmm. I'm not sure I understand this, I haven't read your posting
so forgive my ignorance here.

On our single cpu system we use a process id tagged icache and we
assign the process a new hardware "pid" (or "asid", if you prefer :-),
which takes essentially no time. Obviously, on a cmiss the cache 
is written with the hardware pid and on a read this tag is checked
against the current hardware pid.  [ Sorry for boring those of you
that already know this from basic comp.arch 101]

Once you've cycled around all the hardware pids, you *then* have to 
actually flush the cache. But this is only one time in 256,1024 ,etc,etc..

So, I assume this is what you guys mean when you say "flush the cache".

Bye,

    Paul K. Rodman 
    rodman@mfci.uucp
    __... ...__    _.. .   _._ ._ .____ __.. ._

marc@oahu.cs.ucla.edu (Marc Tremblay) (03/16/89)

In article <3024@alliant.Alliant.COM> jeff@alliant.Alliant.COM (Jeff Collins) writes:
>In article <21570@shemp.CS.UCLA.EDU> marc@cs.ucla.edu (Marc Tremblay) writes:
>:Invalidating cache lines externally is not an absolute requirement for 
>:using caches in a multi-processor environment. 
>:There are policies that do not require this feature at all.
>
>	Which policies are these?  How does one share memory in a
>	multiprocessor when you can't have external bus watchers?
>	Actually never mind that, how do you switch a process from one
>	processor to another (I don't count flushing the D-cache on
>	each context switch as a viable answer)?

1) As mentioned in another article making sharable pages uncachable
   can be a viable answer for various configurations. Remember not
   everything is bus-based, (i.g. interconnection networks),
   so a broadcasts may not be allowed.

2) Directory-based cache coherency scheme: several papers have been
   published about this method. One of them is:

      "A New Solution to Coherence Problems in Multicache Systems"
      L.M. Censier and P. Feautrier,
      IEEE Transactions on Computers, TC-12, Dec. 1978, pp. 1112-1118.

   The basic trick consists of appending a small vector of bits to
   each main memory block. The vector needs to be N+1 bits long where
   N is the number of processors in the system. One bit/cache is
   necessary to indicate the presence or absence of the block in
   each cache and an extra bit indicates if the block has been
   modified or not. Reads, Writes, Misses, are explained in the paper.

   A "cheaper" solution has been proposed in:

      "An Economical Solution to the Cache Coherence Problem"
      James Archibald and Jean Loup Baer, 
      11th Annual Symposium on Computer Architecture,
      June 1984, Ann Arbor MI, pp. 355-371

   In this paper the overhead of adding a vector to each memory
   block is reduced to 2 bits/block.

Bus watchers are nice but are not an absolute necessity for
implementing efficient multiprocessor systems.

					Marc Tremblay
					marc@CS.UCLA.EDU
					Computer Science Department, UCLA

loving@lanai.cs.ucla.edu (Mike Loving) (03/16/89)

In article <15213@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>In article <3024@alliant.Alliant.COM> jeff@alliant.Alliant.COM (Jeff Collins) writes:
>......
>>:There are policies that do not require this feature at all.
>....
>>	Actually never mind that, how do you switch a process from one
>>	processor to another (I don't count flushing the D-cache on
>>	each context switch as a viable answer)?
>
>As I posted in <15016@winchester>, you have to flush the caches
>on context-switch in a single CPU, much less a multiprocessor.

A popular misconception.  It is NOT necessary to flush the cache on a context
switch.  If your cache is physically addressed and you do not include PIDs in
the tags, then yes you do have to.  An example circumventing this is the new
HP machines which use a virtually addressed (no address xlat delay) cache and
do not flush the cache on context switches.

-------------------------------------------------------------------------------
Mike Loving          loving@lanai.cs.ucla.edu
                     . . . {hplabs,ucbvax,uunet}!cs.ucla.edu!loving
-------------------------------------------------------------------------------

brooks@vette.llnl.gov (Eugene Brooks) (03/16/89)

In article <222@ross.UUCP> doug@ross.UUCP (doug carmean) writes:
With regard to a write through explicitly managed cache system.
>It seems to me that you are proposing a multiprocessing system that 
>implements a cache but never actually uses the cache.  Your cache 
>scheme uses write through and then forces misses on the reads.  Why
>even bother implementing a D-cache?
You don't force misses on all the reads.  You force misses only on
reads in your shared memory parallel program for which you know are data
communicated from another processor.  As an example, you can consider
a parallel linear system solver using Gauss Elimination.  It is quite
easy to write a parallel algorithm which explicitly manages the cache.
The last row of the matrix is reused N times, where N is the matrix
dimension, before it is communicated to the rest of the processors.
Even explicit cache flushing for the communication could be used, but
the cost of doing this is horrible context switching overhead for cache
sizes large enough to be useful.  Your notion of including a context descriptor
in the cache line is useful for this, but one will still pay a cost when
you need to clear the cache of a specific descriptor upon process death.
At least, it is a better situation than having to clear the cache on every
context switch.
>I think what you really mean here is that using such a cache system
>in an application with heavy context switching is not really much
>of an issue - you would never want to do it!  From what I understand
>of the i860, you must flush the I-cache, the D-cache and the TLB on
>a context switch.  This seems like a very big penalty to pay every
>single time you want to switch contexts.
Agreed, or as in the i860, the size of the cache you are willing to
treat in this manner would be quite small.

>other than the solution you have presented here.  A virtual cache 
>that implements a copyback scheme with bus snooping is very
>feasible in a multprocessing environment.
No doubt there are several ways to do this.  I made no claim that you
couldn't. I only indicated what one might want to do for a multiprocessor
using a multistage interconnection network to the memory modules where
snooping might not be that practical of an option.  Microprocessor speeds
are cranking up to the point that the number you could hang on a bus will
be very limited, even with the very best of write back coherent cache
protocols.  The fact that the whole processor, memory management, cache,
etc, is now appearing on one chip is making systems with large numbers of
processors VERY feasable.  The processors are free and its the memory subsystem
which will cost the bucks.  This will likely drive the commercial development
of scalable shared memory systems soon.  Scalable message passing systems,
of course, are already here.



Is the news software incompatible with your mailer too?
brooks@maddog.llnl.gov, brooks@maddog.uucp, uunet!maddog.llnl.gov!brooks

tim@crackle.amd.com (Tim Olson) (03/16/89)

In article <21795@shemp.CS.UCLA.EDU> loving@cs.ucla.edu (Mike Loving) writes:
| In article <15213@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
| >As I posted in <15016@winchester>, you have to flush the caches
| >on context-switch in a single CPU, much less a multiprocessor.
| 
| 
| A popular misconception.  It is NOT necessary to flush the cache on a context
| switch.  If your cache is physically addressed and you do not include PIDs in
| the tags, then yes you do have to.  An example circumventing this is the new
| HP machines which use a virtually addressed (no address xlat delay) cache and
| do not flush the cache on context switches.

The context (no pun intended ;-) of John Mashey's posting was the i860,
which must have its caches flushed on context switches, even in
uniprocessor configurations.

I believe you have swapped the terms "physically" and "virtually" above
-- you need PIDs in *virtually* addressed caches to avoid address
aliasing problems.  This assumes that you have a write-through cache. 

Has anyone tried to use virtual, copy-back caches with PIDs to prevent
flushing like the i860 requires?  Problems of how to save modified data,
as well as the consistency of shared memory come to mind...

	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

mash@mips.COM (John Mashey) (03/16/89)

In article <706@m3.mfci.UUCP> rodman@mfci.UUCP (Paul Rodman) writes:
>In article <15213@winchester.mips.COM> mash@mips.COM (John Mashey) writes:

>>As I posted in <15016@winchester>, you have to flush the caches
>>on context-switch in a single CPU, much less a multiprocessor.
>>
>
>Hmmmm. I'm not sure I understand this, I haven't read your posting
>so forgive my ignorance here.

...Description of classic process-tagged cache, etc...

The <15016@winchester> posting was a discussion of the i860
mechanism.  There are, of course, numerous ways to avoid cache flushes
per context-switch or more often, even with virtual caches. I probably
should have written the following:

"on context-switch in a single i860, much less a multiprocessor i860."
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mash@mips.COM (John Mashey) (03/16/89)

In article <21795@shemp.CS.UCLA.EDU> loving@cs.ucla.edu (Mike Loving) writes:
>In article <15213@winchester.mips.COM> mash@mips.COM (John Mashey) writes:

>>As I posted in <15016@winchester>, you have to flush the caches
>>on context-switch in a single CPU, much less a multiprocessor.
>
>
>A popular misconception.  It is NOT necessary to flush the cache on a context
>switch.  If your cache is physically addressed and you do not include PIDs in
>the tags, then yes you do have to.  An example circumventing this is the new
>HP machines which use a virtually addressed (no address xlat delay) cache and
>do not flush the cache on context switches.

Apparently <15016@winchester> got lost, or people 'n'd it.

This is NOT a populatr misconception.  If you have a "simple"
virtual-addressed/virtual-tagged cache, i.e., as in the i860, with
neither pids/asids, nor the (more complex) segment-style scheme of HP PA,
then you will flush the caches on context switches, and you might do it
more often, depending on the TLB/cache interactions, and how tricky the
OS wants to get in deferring flushes.

Physically-addressed caches don't need any flushes on context-switch
or re-maps; you must have meant "If your cache is virtually addressed"
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

doug@ross.UUCP (doug carmean) (03/17/89)

In article <24869@amdcad.AMD.COM> tim@amd.com (Tim Olson) asks:
>Has anyone tried to use virtual, copy-back caches with PIDs to prevent
>flushing like the i860 requires?  Problems of how to save modified data,
>as well as the consistency of shared memory come to mind...
.
Cypress will be offering two parts that both implement a virtual, copy-back
cache with context numbers.  The CY7C604 supports 4096 contexts in an 
on chip 2K tag.  The '605 will be a multiprocessing version of the '604
which will have both virtual and physical tags to support bus snooping.
Note that both parts support both copy-back and write through policies.
-- 
-doug carmean
-ROSS Technology, 7748 Hwy 290 West Suite 400, Austin, TX 78736
-ross!doug@cs.utexas.edu

jeff@Alliant.COM (Jeff Collins) (03/17/89)

In article <706@m3.mfci.UUCP> rodman@mfci.UUCP (Paul Rodman) writes:
>In article <15213@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>>As I posted in <15016@winchester>, you have to flush the caches
>>on context-switch in a single CPU, much less a multiprocessor.
>>
>Once you've cycled around all the hardware pids, you *then* have to 
>actually flush the cache. But this is only one time in 256,1024 ,etc,etc..
>
>So, I assume this is what you guys mean when you say "flush the cache".

	Yep, that is the way most people do it, but my understanding
(I could be wrong) is that the i860 does not have hardware pids.  In
other words the TLB only maintains the virtual address - meaning that
you have to flush on EACH context switch (I wouldn't have brought it
up if there were hardware pids).  It is hard for me to believe that
this is really the case, but that is the impression that I have gotten
(my only information is the net and _Microprocessor Report_).

jeff@Alliant.COM (Jeff Collins) (03/17/89)

In article <21784@shemp.CS.UCLA.EDU> marc@cs.ucla.edu (Marc Tremblay) writes:
>In article <3024@alliant.Alliant.COM> jeff@alliant.Alliant.COM (Jeff Collins) writes:
>>In article <21570@shemp.CS.UCLA.EDU> marc@cs.ucla.edu (Marc Tremblay) writes:
>>:Invalidating cache lines externally is not an absolute requirement for 
>>:using caches in a multi-processor environment. 
>>
>>	Which policies are these?  How does one share memory in a
>>	multiprocessor when you can't have external bus watchers?
>
>1) As mentioned in another article making sharable pages uncachable
>   can be a viable answer for various configurations. Remember not
>   everything is bus-based, (i.g. interconnection networks),
>   so a broadcasts may not be allowed.
>

	You are right.  My bias for shared memory, symmetric, bus-based,
multiprocessors with snoopy caches is so strong that I forget about
interconnection networks.

>2) Directory-based cache coherency scheme: several papers have been
>   published about this method. One of them is:
>
>Bus watchers are nice but are not an absolute necessity for
>implementing efficient multiprocessor systems.

	Actually I know how directory based cache coherency works, but you
still need the ability to invalidate the internal cache from external logic.
The memory will detect that a cache location on a processor needs to be
invalidated and send an invalidate signal (with the physical address) to
the CPU card.  The CPU card must then (because the internal cache is virtual)
do a reverse TLB lookup and then signal the internal cache to invalidate.  All
of this works just fine with two exceptions on the i860.  1). it is not
possible to do the reverse TLB look up (not enough information external to the
processor) and 2). there is no way to send the invalidate (even if you could
figure out the correct virtual address to invalidate).

	I stand by my claim.  There is no way to put this processor into a
shared memory, bus-based, symmetric multiprocessor unless you disable the
D-cache or make all software that might share data manage the D-cache.

viggy@hpsal2.HP.COM (Viggy Mokkarala) (03/18/89)

John Mashey writes:

>This is NOT a populatr misconception.  If you have a "simple"
>virtual-addressed/virtual-tagged cache, i.e., as in the i860, with
>neither pids/asids, nor the (more complex) segment-style scheme of HP PA,
>then you will flush the caches on context switches, and you might do it
>more often, depending on the TLB/cache interactions, and how tricky the
>OS wants to get in deferring flushes.

>Physically-addressed caches don't need any flushes on context-switch
>or re-maps; you must have meant "If your cache is virtually addressed"
>-- 
>-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>

Just to set the record straight about HP PA, I'd like to make the following
observations:

1.  The "synonym" problem is usually not present.  Only the most privileged
    software can create a synonym situation, and it has the responsibility
    of managing the aliasing situations it creates.

2.  The "segment-style" scheme that John refers to actually alludes to the
    fact that all processes share the same global virtual address space but
    that individual processes are given their own SPACES (segments, if you want
    to think of them that way) for instructions and data except for when they
    share data (in which case, the virtual address is the same).  Specified
    in space registers, each space is 4 G Bytes long (in HP PA, there can be
    upto 4 G spaces).  The space portion of the address is a part of the
    virtual address which can be upto 64 bits wide.

3.  Since the caches are virtually indexed, this scheme prevents the
    occurrence of the synonym problem (no two virtual addresses will map to the
    same physical address) and therefore the need to sweep the caches on a
    context switch.

Viggy Mokkarala, Hewlett Packard Company
viggy@hpda.hp.com

robertb@june.cs.washington.edu (Robert Bedichek) (03/19/89)

In article <3040@alliant.Alliant.COM> 
  jeff@alliant.Alliant.COM (Jeff Collins) writes:
>	Yep, that is the way most people do it, but my understanding
>(I could be wrong) is that the i860 does not have hardware pids.  In
>other words the TLB only maintains the virtual address - meaning that
>you have to flush on EACH context switch (I wouldn't have brought it
>up if there were hardware pids).  It is hard for me to believe that
>this is really the case, but that is the impression that I have gotten
>(my only information is the net and _Microprocessor Report_).

What benefit would hardware PIDs give the i860?  Hardware PIDs make
sense if you have a large cache and frequent context switches, where
it's likely that data from a process will stay in the cache long enough
so that it is still in the cache when the process is resumed, IMHE (In My
Humble Estimation).  Otherwise, the hardware PID decreases performance
slightly because you have to maintain them, and just spreads the
cache-flush and reload time over a longer period.

If you find your system spending a lot of time flushing the cache, then
try to decrease the number of context switches per second.  I think
some workstations have a higher context switch rate than is necessary
for good response.  Also, if you have a multiprocessor, you don't need
as high a switch rate.  In fact while the number of processes is
less than or equal to the number of processors, you don't need to
context switch at all.

OS people find it too easy to solve response time problems by just
turning up the context switch rate.  There are other ways to do it,
like figuring up which task that is ready to run is likely to be the
one that the user will is waiting for.  Now this doesn't mean that you
can't interrupt frequently for the purposes of, say, profiling.

You don't have to flush the i860's cache/tlb when you switch to
supervisor mode do you?  That would be pretty bad, IMHE.

	Rob Bedichek  (robertb@cs.washington.edu)

  "When the last snickerdoodle is eaten, and the last Safeway is closed,
   you will discover that you can not eat money."

marc@oahu.cs.ucla.edu (Marc Tremblay) (03/19/89)

In article <3041@alliant.Alliant.COM> jeff@alliant.Alliant.COM (Jeff Collins) writes:
>	Actually I know how directory based cache coherency works, but you
>still need the ability to invalidate the internal cache from external logic.
>The memory will detect that a cache location on a processor needs to be
>invalidated and send an invalidate signal (with the physical address) to
>the CPU card.  The CPU card must then (because the internal cache is virtual)
>do a reverse TLB lookup and then signal the internal cache to invalidate.  

Using a directory based cache coherency method with processors that do not
have the capability to have their internal cache entries invalidated externally,
the memory controller has to send a vector interrupt to one of the 
the processors which can then invalidate the correct cache line.
I don't know how much flexibility the "flush" instruction of the i860 offers 
(the data sheet does not say much), but it should be possible to do it,
if not, then too bad :-) .

					Marc Tremblay
					marc@CS.UCLA.EDU
					Computer Science Department, UCLA

mash@mips.COM (John Mashey) (03/19/89)

In article <2280004@hpsal2.HP.COM> viggy@hpsal2.HP.COM (Viggy Mokkarala) writes:

>John Mashey writes:

>>This is NOT a populatr misconception.  If you have a "simple"
>>virtual-addressed/virtual-tagged cache, i.e., as in the i860, with
>>neither pids/asids, nor the (more complex) segment-style scheme of HP PA,
>>then you will flush the caches on context switches, and you might do it
>>more often, depending on the TLB/cache interactions, and how tricky the
>>OS wants to get in deferring flushes.

>Just to set the record straight about HP PA, I'd like to make the following
>observations:
	(description of HP PA scheme - thanx)
Sorry if my complex sentence caused confusion.  It was trying to say:

IF "simple" scheme, THEN flush caches on context switch
ELSE /* PIDS or segment/space scheme of HP PA */ don't flush caches.

HP PA is clearly one of the RISCs designed with attention to
multi-tasking, since it, for example, manages to use virtual-addressed
caches while still sharing sharable code & data in the cache efficently,
unlike simple PID schemes.  For example, suppose, in a multi-user
system, using a simple cache-PID scheme, you have several people using
the same program (such as an editor, compiler, DBMS client, etc,
or one other program described later):
	Suppose you have executed process A, using that program,
	and you context-switch to program B, using that same program.
	You switch to the new program's PID, and even though it may
	be executing the exact same code, it I-cache-misses on all of it,
	because it has the wrong PID. Same thing happens for shared libraries.
	Same thing happens if it uses shared data, which things like
	DBMS do, or perhaps, on some systems X-clients & X-server;
	only this time it D-cache misses.  This is especially exciting
	for dirty data: A writes some data.  B attempts to read it.
	Since the PID's mismatch, you flush it to memory, then you
	re-read it to get one there with the right copy.
Does this matter?
	If the caches are really small, then not much, as far as I can tell,
	although there isn't a lot of data pbulished on this kind of
	think, for good reason. {If you know a lot about this, you're
	probably a vendor who treats this as serious competitive info.}
	However, high-performance systems don't have small caches,
	and the caches MUST continue to grow, since DRAM refuses to offer
	seriously-better access times.

Note that HP PA avoids most of this problem via the space indentifiers,
which are, in some sense a collection of PIDs or ASIDs.  Thus, if you
switch to a different process that's using the same program, the second
process's space-maps (or whatever they call them) can at least point
at the same code as did the first process, and there can be shared data
regions, etc, without requiring cache-misses to refill the cache.
The obvious issue that's left is doing fork-with-copy-on-write,
as that looks a bit difficult to do in the classical way.  Still, the
scheme OBVIOUSLY was thinking of multi-tasking environments with
efficiently-shared code and data, in the presence of large caches.
(Ross Bott of Pyramid gave a good talk at Uniforum that included some
discussions of cache issues in OLTP environments.)

Oh, the one other program that's real important is the UNIX kernel
itself, which, in some sense, is often treated as a giant shared
library.  It has terrible locality, and so you have to kick and
scratch to get every % of hit rate you can.  Note that the
most straightforward of PID schemes causes considerable extra
thrashing around in the presence of serious context-switch rates.
The typical choices come down to letting the kernel use the same PID as
the current user program, which means you communicate well with it,
or using a specific pid for the kernel itself, which means better
hit rates for the kernel code itself, but more overhead in dealing
with user processes, or combinations that switch back and forth.
(maybe somebody that does this kind of thing can post some useful info)

Of course, numerous variations are possible, but the basic idea
is: if you're using simple virtual caches, with no PIDs, or
even with PIDs, you're probably thinking more about single-user
performance than you are about multi-tasking performance.
(There's nothing wrong with that tradeoff, of course, if that's what
you're trying to do.)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

jeff@Alliant.COM (Jeff Collins) (03/20/89)

In article <7630@june.cs.washington.edu> robertb@uw-june.UUCP (Robert Bedichek) writes:
>In article <3040@alliant.Alliant.COM> 
>  jeff@alliant.Alliant.COM (Jeff Collins) writes:
>>	Yep, that is the way most people do it, but my understanding
>>(I could be wrong) is that the i860 does not have hardware pids. 
>
>What benefit would hardware PIDs give the i860?  Hardware PIDs make
>sense if you have a large cache and frequent context switches, where
>it's likely that data from a process will stay in the cache long enough
>so that it is still in the cache when the process is resumed, IMHE (In My
>Humble Estimation).  Otherwise, the hardware PID decreases performance
>slightly because you have to maintain them, and just spreads the
>cache-flush and reload time over a longer period.
>
> <comments on increasing context switch time deleted>
>

	It is true that an 8k data cache is fairly small, and hardware
PIDs for a cache of that size may not be a win - all my mail was
attempting to say, however, was that the cache must be flushed at each
context switch.  I, personally, am not a fan of caches that must be
flushed at each context switch - this is independent of the rate of
context switches (having to flush the caches will always be slower
than not having to flush them).  Yes, you can decrease the average cost
of the cache flush by having longer times between context switches.

	The real objection is more from an architectural point of view:
what happens if/when Intel increases the size of the D-cache?  Will
they then add hardware PIDs?

doug@ross.UUCP (doug carmean) (03/21/89)

In article <15531@winchester.mips.COM> John Mashey writes:
...
>	Suppose you have executed process A, using that program,
>	and you context-switch to program B, using that same program.
>	You switch to the new program's PID, and even though it may
>	be executing the exact same code, it I-cache-misses on all of it,
>	because it has the wrong PID. Same thing happens for shared libraries.
>	Same thing happens if it uses shared data, which things like
>	DBMS do, or perhaps, on some systems X-clients & X-server;
>	only this time it D-cache misses.  This is especially exciting
>	for dirty data: A writes some data.  B attempts to read it.
>	Since the PID's mismatch, you flush it to memory, then you
>	re-read it to get one there with the right copy.
...
Mr. Mashey's description is entirely accurate for a system that uses
a very simple cache controller, i.e. one that does not detect aliases.  
Higher performance systems will check the physical address of the
requested virtual address against the physical address of the cache line
that is to be replaced.  If the two physical addresses match, the 
cache is used to handle the access, else the cache miss processed as
a normal miss.

In most implementations, the aliased data/instruction can be provided
with only a one cycle penalty over the cache hit access.  An implementation
might choose to overwrite the cache tag context with the new context so
that subsequent accesses will hit cache without the alias detection
penalty (assuming the context switching scenario that Mr. Mashey 
proposed).  If yet another context switch occurs, the same sequence of
operations would occur with the first access incurring the alias 
detection penalty and subsequent accesses hitting the cache.

This approach may not offer quite the performance or the glamour that
the HP PA offers, but it is considerably higher performance than the 
approach Mr. Mashey outlined.  
-- 
-doug carmean                           ross!doug@cs.utexas.edu
-ROSS Technology, 7748 Hwy 290 West Suite 400, Austin, TX 78736
-disclaimer: <insert anything you like here>

w-colinp@microsoft.UUCP (Colin Plumb) (03/21/89)

jeff@alliant.Alliant.COM (Jeff Collins) wrote:
> 	Yep, that is the way most people do it, but my understanding
> (I could be wrong) is that the i860 does not have hardware pids.  In
> other words the TLB only maintains the virtual address - meaning that
> you have to flush on EACH context switch (I wouldn't have brought it
> up if there were hardware pids).  It is hard for me to believe that
> this is really the case, but that is the impression that I have gotten
> (my only information is the net and _Microprocessor Report_).

This is the case.  You have to flush the cache every context switch.
(Since it's a write-back cache, you have to at least flush the pending
writes before messing with the page table, anyway.)  

For those interested, as far as I can tell, the i860's flush instruction
is half a load, that forces the write-back (you play with bits in a
status register to force a given set ito be used) but loads a bogus value
into the cache and the destination register.  I don't think the cache has
valid bits.
-- 
	-Colin (uunet!microsoft!w-colinp)

"Don't listen to me.  I never do." - The Doctor

jeff@Alliant.COM (Jeff Collins) (03/21/89)

In article <21972@shemp.CS.UCLA.EDU> marc@cs.ucla.edu (Marc Tremblay) writes:
>Using a directory based cache coherency method with processors that do not
>have the capability to have their internal cache entries invalidated externally,
>the memory controller has to send a vector interrupt to one of the 
>the processors which can then invalidate the correct cache line.
>I don't know how much flexibility the "flush" instruction of the i860 offers 
>(the data sheet does not say much), but it should be possible to do it,
>if not, then too bad :-) .

	Good point.  But still all that the memory can do is send the
physical address to invalidate (it is all that it knows), this still
means that the interrupt handler has to perform a reverse translation
from physical to virtual (after determining which virtual space to
use - actually I quess the current virtual space is all that needs to
be looked at - as the D-cache must have been flushed at the last
context switch), and only then it can flush the appropriate tage line.
(I infer (from the fact that a complete D-cache flush is a loop) that
it is possible to flush particular line from the D-cache.)  Doing this
conversion is possible, but I think the conversion involves an
exhaustive search of the processes page table (this can probably be
optimized).  Oh, by the way, while you are doing the vectored
interrupt, the reverse translation and the invalidate, don't you have
to hold off the memory request that initiated this whole process (the
memory has to wait until the flush or invalidate happen before it can
reply)?  Can you say memory latency?

andrew@frip.wv.tek.com (Andrew Klossner) (03/22/89)

[]

		"Still on the subject of the caches:  There is no way
		to externally invalidate cache lines.  This makes the
		part virtually unusable in multi-processing
		configurations, since cache coherency cannot be
		maintained."

	"Invalidating cache lines externally is not an absolute
	requirement for using caches in a multi-processor environment.
	There are policies that do not require this feature at all."

Furthermore, you can achieve the necessary effect by sending the other
CPU a message telling it to invalidate its cache lines.  If exception
handling (get in, do it, get out) is fast enough, and if you don't do
this every few nanoseconds, then the performance degradation shouldn't
be a big deal.

You don't absolutely need an external invalidation signal in hardware
for multiprocesisng.

  -=- Andrew Klossner   (uunet!tektronix!orca!frip!andrew)      [UUCP]
                        (andrew%frip.wv.tek.com@relay.cs.net)   [ARPA]

w-colinp@microsoft.UUCP (Colin Plumb) (03/22/89)

marc@cs.ucla.edu (Marc Tremblay) wrote:
> Using a directory based cache coherency method with processors that do not
> have the capability to have their internal cache entries invalidated
> externally, the memory controller has to send a vector interrupt to one
> of the processors which can then invalidate the correct cache line.
> I don't know how much flexibility the "flush" instruction of the i860 offers 
> (the data sheet does not say much), but it should be possible to do it,
> if not, then too bad :-) .

The flush instruction is half a load; it basically forces the data in the
cache line it conflicts with to be written out and removed from the cache.
I believe doing a flush specifying an address already in the cache is
useless; the point is to create a fake address that conflicts with the real
one (using knowledge of how the cache works) and some magic bits in a status
register that let you control which set a cache miss chooses to replace.
(Default is random replacement.)

So, to force a flush of the data at a certain address, you'd have to flush all
4 lines that might contain with that address.  (There's no way to examine
the cache's contents.)  With the status register munging, a bit ugly.

Add my existing comments about the horribleness of the i860's exception
handling, and I don't think the software approach would be acceptably fast.
(BTW, no vectored interrupts.)
-- 
	-Colin (uunet!microsoft!w-colinp)

"Don't listen to me.  I never do." - The Doctor