[comp.arch] i860 cache flushing

lance@Ricerca.orc.olivetti.com (Lance Berc) (03/19/89)

As has been said, the i860 has two on-chip virtual caches (4k I + 8k
D) and a 64 entry TLB, all of which need to be invalidated when
context switching (and sometimes when the memory map is changed). The
D-cache has to be flushed as well as invalidated (code is treated as
immutable - self modifying code won't work unless it lives in
non-cached pages).

Intel estimates that at 33MHz flushing the D-cache takes on average
30usec (30% - 50% dirty) and 60usec worst case. I believe that these
numbers assume no wait-state memory (fastest possible 5 2 2 2 CPU to
memory write cycles).

I'd be interested in seeing some numbers on the frequency of both
context switching and interrupt handling in `typical' state-of-the-art
machines under some sort of well-defined load (such as compiles using
local disks under Unix on Sun-3s,4s, MIPS boxes, etc). This might help
determine just how important raw context switch times are.

Fast context switches are important, but it seems that standard Unix
time quanta are not shrinking as the amount of work done per quantum
increases. So maybe the percentage of CPU time spent context switching
versus the amount of time spent doing `useful' work is becoming small
enough that the raw context switch time is becoming less significant.

The i860 seems to favor using silicon to gain sheer straight-line
speed at the expense of some performance in the curves. Sounds like a
good trade off to me, but it depends on where you drive...

lance
lance@orc.olivetti.com
(415) 496-6200
Lance Berc                lance@orc.olivetti.com       Beer as an alternate
Olivetti Research Center  lance%orc.uucp@unix.sri.com       currency!
Menlo Park, California    (415) 496-6248
< These opinions bear no resemblance to those of Ing. C. Olivetti & C. SpA. >

rpw3@amdcad.AMD.COM (Rob Warnock) (03/23/89)

In article <39485@oliveb.olivetti.com> (Lance Berc) writes:
+---------------
| Intel estimates that at 33MHz flushing the D-cache takes on average
| 30usec (30% - 50% dirty) and 60usec worst case. I believe that these
| numbers assume no wait-state memory (fastest possible 5 2 2 2 CPU to
+---------------

Similarly, because of the very large register file, the Am29000 appears
at first to have a problem with full context switching (*not* system calls
or interrupts, those continue the stack-cache discipline), since you have
to save/restore 160 of the 192 registers (if 32 are reserved to the kernel).
But at 25 MHz, using the load/store-multiple instructions and burst-mode
memories (using normal static-column DRAMs bank-interleaved 2:1, still cheap),
you can save the old user's full register set and load up the new user's
full set in 12.8 microseconds. [The TLB has PIDs, so no flush there.]

I just don't see 20-50 VUPS type of systems needing to do tens of thousands
of full context switches per second, at least not in general-purpose
timesharing...


Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun}!redwood!rpw3
ATTmail:  !rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403

mash@mips.COM (John Mashey) (03/24/89)

In article <24958@amdcad.AMD.COM> rpw3@amdcad.UUCP (Rob Warnock) writes:
>In article <39485@oliveb.olivetti.com> (Lance Berc) writes:
>+---------------
>| Intel estimates that at 33MHz flushing the D-cache takes on average
>| 30usec (30% - 50% dirty) and 60usec worst case. I believe that these
>| numbers assume no wait-state memory (fastest possible 5 2 2 2 CPU to
>+---------------
>
>Similarly, because of the very large register file, the Am29000 appears
>at first to have a problem with full context switching (*not* system calls...

>I just don't see 20-50 VUPS type of systems needing to do tens of thousands
>of full context switches per second, at least not in general-purpose
>timesharing...

A lightning look at busy machines around here showed 60-120 cs/sec.
If it only takes 30-60 microsec, that's 1.8-7.2 millisec / sec,
or a little less than half a percent.
Now, that's the easy part.
-----
The hard parts are: 1) figuring out how often you must flush the caches
because you change a mapping in the kernel [maybe the Sun folks can say
something on this; I recall them talking about tuning to avoid unnecessary
flushings in virtual caches], and 2) figuring out what the aggregate
cache miss rate impact is of flushing the caches more often.
(I have no idea, and it surely is load-dependent, and there's probably
some nice papers sitting around to be done.)

The modest size of on-chip caches makes this less detrimental,
in terms of what you're losing by flushing them.  As they get bigger,
it will get more noticable, especially for OS performance itself,
which REALLY likes big caches, since it has bad locality.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

lance@orc.olivetti.com (Lance Berc) (03/25/89)

The 30usec avg, 60usec worst case i860 cache flushing times did not include
the rest of the context switch. I believe that without save/restore of the
FPU a full switch is in the 60 - 90 usec range (this is mostly a guess).
Using John's 60 - 120 switch/sec guestimate we're still in the 1 - 2 percent
range, which should be acceptable in a multitasking situation.

Saving and restoring the FPU state is pretty hairy - the manual's example
has about one hundred instructions. The time required will depend heavily
on the memory subsystem characteristics since there probably won't be any
I or D cache hits here. Multitasking number crunching applications probably
isn't a good idea unless they are given larger timeslices. No surprise.
Lance Berc                lance@orc.olivetti.com      (415) 496-6248
Olivetti Research Center  lance%orc.uucp@unix.sri.com <standard disclaimer>

conte@bach.csg.uiuc.edu (Tom Conte) (03/28/89)

About physical i-caches: it is not clear to me that a context switch
may not render lines in even physical i-caches invalid.  In a processor
where the i-fetch unit is what fills the i-cache, there is a chance
that after a context switch the OS will map a different page (of
instructions, perhaps) into a physical page frame that has some lines
in the i-cache.  The pager usually updates (or its data is run through)
the d-cache; hence, these updates won't get to the i-cache.

I see three ways around this: have the pager flush the i-cache
(selectively or always), use a hardware pid with the (*physical*)
i-cache, or somehow run paging of instruction pages through the i-cache.

Most of the arguments against always-flushing the i-cache are irrelevant
anyway since modern on-chip i-caches are too small to preserve any lines
of processes between context switches.

------
Tom Conte      Computer Systems Group, Coordinated Science Lab
               University of Illinois, Urbana-Champaign, Illinois
...The opinions expressed are my own, of course.
uucp:	 ...!uiucdcs!uicsrd!conte    internet:	conte@bach.csg.uiuc.edu

keith@mips.COM (Keith Garrett) (03/30/89)

In article <665@garcon.cso.uiuc.edu> conte@bach.csg.uiuc.edu.UUCP (Tom Conte) writes:
>About physical i-caches: it is not clear to me that a context switch
>may not render lines in even physical i-caches invalid.  In a processor
>where the i-fetch unit is what fills the i-cache, there is a chance
>that after a context switch the OS will map a different page (of
>instructions, perhaps) into a physical page frame that has some lines
>in the i-cache.

this is a page swap, not a context switch. you have to flush/invalidate
both physical and virtual caches (tlb's also) for this, but the frequency
should be alot lower.
-- 
Keith Garrett        "This is *MY* opinion, OBVIOUSLY"
UUCP: keith@mips.com  or  {ames,decwrl,prls}!mips!keith
USPS: Mips Computer Systems,930 Arques Ave,Sunnyvale,Ca. 94086