[comp.arch] Real Time/Cache

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (11/26/90)

One of the interesting problems in real-time systems, is finding
the guaranteed-not-to-exceed execution time of a piece of code.

Caches make the problem nastier. First, you would hope that a cache
full of another task's stuff, is no slower than an empty cache. 

Next, if you assume interrupts, you pretty well have a take the
cache-disabled timing as your worst case. There is an out: you can
prefill the cache and freeze it: but I don't think it's commonly
done. Or, you can scrap interrupts, use non-preemptive scheduling,
and assume that each scheduling event trashes the whole cache.

I recently heard a new answer, which I thought I'd pass along. The
idea is to segment the cache(s), so that some tasks have the
exclusive use of parts of the cache(s). Those tasks are now immune to
interference from other tasks, and perhaps even from interrupts.
Their worst-case times are just their standalone times, on a machine
with a small cache. [Yes, I am ignoring shared data, multilevel
caches, assignment, task switching, etc.]

The suggested support hardware would put (say) four address lines
through four 4:1 muxes.  This would carve the cache into 16 segments,
and a sort of PID would enable some power-of-2 number of these
segments.

The cost? In logic, not much. In time - well, one pre-set-up 4:1-mux
delay on some of the cache-line-select path. Does anyone care to say
how much of a hit that would be on their favorite machine? Or is
there a better way to get a similar effect?
-- 
Don		D.C.Lindsay

DXB132@psuvm.psu.edu (11/27/90)

Perhaps it would be better in many cases to forget about the cache, and
simply put some fast static RAM in the memory map, to be used for time-
critical routines and data storage. The not-so-critical routines would be
placed in relatively slow memory (presumably you are thinking about very
large chunks of software else a pure static RAM approach is simpler and
just as cheap. Of course my idea of very large is 8K or so :-))

-- Dan Babcock

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (11/29/90)

In article <90331.001007DXB132@psuvm.psu.edu> DXB132@psuvm.psu.edu writes:
>Perhaps it would be better in many cases to forget about the cache, and
>simply put some fast static RAM in the memory map, to be used for time-
>critical routines and data storage. The not-so-critical routines would be
>placed in relatively slow memory (presumably you are thinking about very
>large chunks of software else a pure static RAM approach is simpler and
>just as cheap. Of course my idea of very large is 8K or so :-))

I did the same, once: we had 768 words of 100 ns SRAM and 64 KW of
core (yes, core). The SRAM was enough to hold a few interrupt
handlers, and all of the important inner loops (eg. the FFT).  We
could keep the data in core, since we had a scoreboarded load-
multiple and store-multiple.

I'm not sure that the SRAM idea completely overshadows the real-time-
cache idea. With some current machines, one can put memory SRAM in
parallel with the cache SRAM, or in place of it. But there's a big
trend to on-chip cache, and external SRAM is surely going to be
slower than that. (Aside from an extra clock or two of access, there
is often a bus width difference.) So, although uncached references
out to SRAM are faster than uncached references out to DRAM, we're
still looking at the cache giving better average-case performance.

The performance cost of the segmented cache scheme isn't clear to me,
so I'd still like comments on it.
-- 
Don		D.C.Lindsay

DXB132@psuvm.psu.edu (11/29/90)

In article <11228@pt.cs.cmu.edu>, lindsay@gandalf.cs.cmu.edu (Donald Lindsay)
says:

>trend to on-chip cache, and external SRAM is surely going to be
>slower than that. (Aside from an extra clock or two of access, there
>is often a bus width difference.) So, although uncached references

Any Transputer fans out there? I seem to recall (not really sure) that
the Transputer had 2K or 4K of static RAM on-chip, that was up to software
to use (i.e. it wasn't a cache in the usual sense).
Your point is well taken, however.

-- Dan Babcock

vestal@SRC.Honeywell.COM (Steve Vestal) (12/12/90)

In article <11190@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:

>> One of the interesting problems in real-time systems, is finding
>> the guaranteed-not-to-exceed execution time of a piece of code.

There are actually two distinct effects to be considered here: the impact of
caching due to preemption and the impact of caching due to data-dependent
memory access patterns within each task.

As far as preemption, assuming that worst-case time equals time with caches
disabled may be a bit pessimistic (not to mention possibly incorrect in some
pathalogical cases).  It may be possible to arrive at a worst-case context swap
time that includes cache effects, which in turn can be handled in some
scheduling models (e.g. rate monotonic scheduling).  The major problem here
is accurately determining what the worst-case context swap time will be.

The second problem is determining the worst-case execution time for a
(nonpreempted) task.  What is required is a determination of some "longest"
path through a task, which isn't as simple as finding the longest path through
some control flow graph when caches are present.  Basic block execution times
(arc lengths in the control flow graph) may depend on where control came from
and on the contents of variables/registers.  (I'm ignoring data-dependent
control flow, e.g. real-time Euclid.)

To say that the only execution time measure of interest in hard real-time
systems is a verifiable upper bound on execution time (ignore average
throughput, ignore measurements based on a few sample inputs) may be
excessive.  There are techniques to handle occasional "timing faults" in hard
real-time systems.  However, I would be interested to know of 1)
tools/techniques to determine reasonably tight worst-case execution
time bounds for specific processors, and 2) comparisons of processors based on
worst-case execution time bounds rather than average execution times.

Steve Vestal
Mail: Honeywell S&RC MN65-2100, 3660 Technology Drive, Minneapolis MN 55418 
Phone: (612) 782-7049                    Internet: vestal@src.honeywell.com

jesup@cbmvax.commodore.com (Randell Jesup) (01/24/91)

In article <11228@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
>I'm not sure that the SRAM idea completely overshadows the real-time-
>cache idea. With some current machines, one can put memory SRAM in
>parallel with the cache SRAM, or in place of it. But there's a big
>trend to on-chip cache, and external SRAM is surely going to be
>slower than that. (Aside from an extra clock or two of access, there
>is often a bus width difference.) So, although uncached references
>out to SRAM are faster than uncached references out to DRAM, we're
>still looking at the cache giving better average-case performance.

	Instead of making on-chip cache, for real-time work on-chip FAST
memory may well be better, especially if it's managed properly.  The new
AT&T DSP uses this sort of memory (the 3210 has 2K of on-chip multi-port
SRAM).

	The cache does nothing for you in a realtime application since you
have to assume worst-case anyways (well, not nothing, but not far from it -
non-realtime tasks on the same machine might get an advantage).

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.commodore.com  BIX: rjesup  
The compiler runs
Like a swift-flowing river
I wait in silence.  (From "The Zen of Programming")  ;-)

vestal@SRC.Honeywell.COM (Steve Vestal) (01/29/91)

In article mumble jesup@cbmvax.commodore.com (Randell Jesup) writes:

>> 	The cache does nothing for you in a realtime application since you
>> have to assume worst-case anyways (well, not nothing, but not far from it -
>> non-realtime tasks on the same machine might get an advantage).

My sympathies lie with Randell, but things aren't always this dire.

First, it has been argued that some real-time code is sufficiently
deterministic (i.e. memory access patterns are largely independent of input
data) that cache behavior will be largely deterministic.  Even if they can't
be easily predicted from an examination of the code, execution time
measurements taken during sample executions are representative of true
worst-case execution times.

Second, even though missing a deadline is a fault, this needn't be a fatal
fault in all cases.  There are scheduling techniques that can allow some
applications to tolerate occasional timing faults.

My impression is that most hardware/software trade-off studies have assumed
that the goal was to maximize average throughput.  This is not the same goal
as minimizing verifiable upper bounds on execution time, which is desirable
for safety-critical real-time applications.

Steve Vestal
Mail: Honeywell S&RC MN65-2100, 3660 Technology Drive, Minneapolis MN 55418 
Phone: (612) 782-7049                    Internet: vestal@src.honeywell.com

davidb@brac.inmos.co.uk (David Boreham) (01/29/91)

In article <17999@cbmvax.commodore.com> jesup@cbmvax.commodore.com (Randell Jesup) writes:
>	Instead of making on-chip cache, for real-time work on-chip FAST
>memory may well be better, especially if it's managed properly.  The new
>AT&T DSP uses this sort of memory (the 3210 has 2K of on-chip multi-port
>SRAM).
>
>	The cache does nothing for you in a realtime application since you
>have to assume worst-case anyways (well, not nothing, but not far from it -
>non-realtime tasks on the same machine might get an advantage).

Old hat. All our processors have had on-chip 4K SRAM for years.

Why not make a processor which has the facility to use the on-chip
SRAM as either a cache, or SRAM, or half-in-half ? 

David Boreham, INMOS Limited | mail(uk): davidb@inmos.co.uk or ukc!inmos!davidb
Bristol,  England            |     (us): uunet!inmos.com!davidb
+44 454 616616 ex 547        | Internet: davidb@inmos.com

gessel@masada.cs.swarthmore.edu (Daniel Mark Gessel) (01/30/91)

In article <17999@cbmvax.commodore.com> jesup@cbmvax.commodore.com (Randell Jesup) writes:
>	Instead of making on-chip cache, for real-time work on-chip FAST
>memory may well be better, especially if it's managed properly.  The new
>AT&T DSP uses this sort of memory (the 3210 has 2K of on-chip multi-port
>SRAM).

Aren't there SRAMS fast enough to allow 50 million accesses per
second? (less than 20 nanoseconds) Couldn't a 50Mhz processor use this
memory without a cache or wait states?

I just wondered why processors are designed with a cache instead of
using these kinds of memory chips. I have always just assumed it was
expense.

Dan
--
Daniel Mark Gessel                          Independent Software Consultant
Internet: gessel@cs.swarthmore.edu                   and Developer
I do not represent Swarthmore College (thank God).

alex@vmars.tuwien.ac.at (Alexander Vrchoticky) (01/30/91)

vestal@SRC.Honeywell.COM (Steve Vestal) writes:

>First, it has been argued that some real-time code is sufficiently
>deterministic (i.e. memory access patterns are largely independent of input
>data) that cache behavior will be largely deterministic.  

One should point out that this does not generally hold for systems where 
context switches can occur at arbitrary points in time.
One task's cache hit is another task's cache miss.

>My impression is that most hardware/software trade-off studies have assumed
>that the goal was to maximize average throughput.  This is not the same goal
>as minimizing verifiable upper bounds on execution time, which is desirable
>for safety-critical real-time applications.

I wholeheartedly agree. Both performance measures have their place,
each in its area.  Unfortunately far too many people still tend to 
equate `fast' and `real-time', without asking which performance measures 
are relevant for real-time systems.
The idea that high average (as opposed to guaranteed) performance is 
useful for real-time systems must be based on the notion that the
processor is somehow more `comfortable' when processing the idle loop
than when doing something useful. :-) 


--
Alexander Vrchoticky            | alex@vmars.tuwien.ac.at
TU Vienna, CS/Real-Time Systems | +43/222/58801-8168
"i cant even remember if we were lovers or if i just wanted to" (violent femmes)

vestal@SRC.Honeywell.COM (Steve Vestal) (01/30/91)

(I'm cross-posting this message on the impacts of caching in real-time systems
to comp.realtime.  Should discussions of architectural considerations for
trustworthy and/or real-time systems be moved to comp.realtime?)

In article <2288@tuvie.UUCP> alex@vmars.tuwien.ac.at (Alexander Vrchoticky) writes:

 > vestal@SRC.Honeywell.COM (Steve Vestal) writes:
>>First, it has been argued that some real-time code is sufficiently
>>deterministic (i.e. memory access patterns are largely independent of input
>>data) that cache behavior will be largely deterministic.

> One should point out that this does not generally hold for systems where 
> context switches can occur at arbitrary points in time.
> One task's cache hit is another task's cache miss.

Yes, the impact of caching on context switch times is a slightly different
issue than the impact of caching on worst-case (unpreempted) execution times
for individual tasks.  In some cyclically scheduled systems there are no real
preemptions.  In rate monotonically scheduled systems, it is possible to
include context switch times in the performance model.  In general, one
context switch (or two, depending on how you define a switch) will eventually
occur as a result of each task dispatch in a system (but this is effectively
arbitrary if your scheduling model doesn't allow you to include it).  My
conjecture is that if you include the time required to completely flush/fill
the cache, TLB, etc.  in the context switch time then you will be safe.  This
means that any speed-up due to caching is discounted when bounding context
switch times but not necessarily when bounding task execution times (for those
willing to assume the memory access patterns of their tasks are largely
data-independent, of course).  (Personally, I would be hesitant to accept such
an analysis for a safety-critical system, but there are applications where
this approach should be reasonable.)

Steve Vestal
Mail: Honeywell S&RC MN65-2100, 3660 Technology Drive, Minneapolis MN 55418 
Phone: (612) 782-7049                    Internet: vestal@src.honeywell.com

sjohn@poland.ece.cmu.edu (John Sasinowski) (01/31/91)

Dave Kirk, who just finished his Ph.D. here at Carnegie Mellon University,
wrote his dissertation on a cache design for real-time systems.  This
design, called SMART caching (Strategic Memory Allocation for Real Time),
provides predictable execution times for tasks that run in the cache.

Let me know if you want more information about this.  I can send you
a list of references if you wish.

John Sasinowski
sjohn@poland.ece.cmu.edu

jfc@athena.mit.edu (John F Carr) (02/01/91)

In article <GESSEL.91Jan29100936@masada.cs.swarthmore.edu>
	gessel@masada.cs.swarthmore.edu (Daniel Mark Gessel) writes:

>I just wondered why processors are designed with a cache instead of
>using these kinds of memory chips. I have always just assumed it was
>expense.

I would guess that it is harder to make a fast data path to main memory than
to a small on-chip memory.  The choice then becomes whether to make the fast
memory a cache or not.  For most applications (probably not including real
time) it is better to let the hardware take care of copying frequently
accessed memory locations into the fast memory.

--
    John Carr (jfc@athena.mit.edu)

cprice@mips.COM (Charlie Price) (02/02/91)

In article <GESSEL.91Jan29100936@masada.cs.swarthmore.edu> gessel@masada.cs.swarthmore.edu (Daniel Mark Gessel) writes:
>In article <17999@cbmvax.commodore.com> jesup@cbmvax.commodore.com (Randell Jesup) writes:
>>	Instead of making on-chip cache, for real-time work on-chip FAST
>>memory may well be better, especially if it's managed properly.
>>...
>
>Aren't there SRAMS fast enough to allow 50 million accesses per
>second? (less than 20 nanoseconds) Couldn't a 50Mhz processor use this
>memory without a cache or wait states?

There are three delays in a memory access:
propagating the address from the processor to every chip that needs to hear it,
in-chip time to retrieve the stored bits,
and sending the answer back to the processor.
For the large memory array needed with a fast processor in a "computer"
(rather than an embedded processor doing a specific thing),
the address and data transit times are very large.

A further problem is that to run without waiting for memory
a recent-design machine needs an instruction and a data
transfer each cycle.

The key to a fast machine these days is to have the data you want
to use within a handful of centimeters of where you want to use it
nearly all the time.
-- 
Charlie Price    cprice@mips.mips.com        (408) 720-1700
MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA   94086-23650

huy@rainbow.asd.sgi.com (Huy Nguyen) (02/02/91)

| Dave Kirk, who just finished his Ph.D. here at Carnegie Mellon University,
| wrote his dissertation on a cache design for real-time systems.  This
| design, called SMART caching (Strategic Memory Allocation for Real Time),
| provides predictable execution times for tasks that run in the cache.
| 
| Let me know if you want more information about this.  I can send you
| a list of references if you wish.
| 
| John Sasinowski
| sjohn@poland.ece.cmu.edu
| 
| 
This sounds like an interesting paper. Could you post this information to the
net so others can see it?

 huy@sgi.com

gillies@cs.uiuc.edu (Don Gillies) (02/03/91)

huy@rainbow.asd.sgi.com (Huy Nguyen) writes:

>| Dave Kirk, who just finished his Ph.D. here at Carnegie Mellon University,
>| wrote his dissertation on a cache design for real-time systems.  This
>| design, called SMART caching (Strategic Memory Allocation for Real Time),
>| provides predictable execution times for tasks that run in the cache.

Information on this appears in the IEEE Real-time Systems Symposium

"SMART (Strategic Memory Allocation for Real-Time) Cache Design"
	IEEE RTSS 1989, p 229.
"SMART (Strategic Memory Allocation for Real-Time) Using the MIPS R3000"
	IEEE RTSS 1990, p 322

These papers are about how to build a cache with different partitions
for different processes.

sjohn@poland.ece.cmu.edu (John Sasinowski) (02/04/91)

>Dave Kirk, who just finished his Ph.D. here at Carnegie Mellon University,
>wrote his dissertation on a cache design for real-time systems.  This
>design, called SMART caching (Strategic Memory Allocation for Real Time),
>provides predictable execution times for tasks that run in the cache.
>
>Let me know if you want more information about this.  I can send you
>a list of references if you wish.
>
	After receiving many requests for the information, here is a brief
summary of the SMART caching strategy along with the list of references:

	The SMART cache design strategy is a software controlled
partitioning technique which provides predictable cache performance in
real-time systems.  The cache is divided into segments which are assigned to
tasks to form private partitions.  The partitions are regions of cache that 
can be accessed by only one task.  For effeciency considerations, the number
of segments in a partition should be a power of two.  Since only one task has
access to each partition, no other tasks can overwrite the working set of that
task.  The partitions need not be the same size for every task, so the
divisions of the cache among the tasks can be chosen to provide the best
performance for the set of tasks.

	This work was done by Dave Kirk under the guidance of Jay Strosnider
as part of the ART (Advanced Real-Time Technology) project.  Currently, I am
working on compiler techniques to improve the predictable performance of
caches in real-time systems.

References:
	David B. Kirk
	Process Dependent Static Cache Partitioning for Real-Time Systems
	Proceedings of the Real-Time Systems Symposium, pages 181-190.
	IEEE, Huntsville, AL, December 1988

	David B. Kirk
	SMART (Strategic Memory Allocation for Real-Time) Cache Design
	Proceedings of the Real-Time Systems Symposium, pages 229-237.
	IEEE, Santa Monica, CA, December 1989

	David B. Kirk, Jay K. Strosnider
	SMART (Strategic Memory Allocation for Real-Time) Cache Design
		Using the R3000
	Proceedings of the Real-Time Systems Symposium
	IEEE, Orlando, FL, December 1990

	David B. Kirk
	Predictable Cache Design for Real-Time Systems
	PhD Thesis, Carnegie Mellon University, December 1990
	(Copies available by written request to:
		Jay K. Strosnider
		Department of Electrical and Computer Engineering
		Carnegie Mellon University
		Pittsburgh, PA  15213-3890)


John Sasinowski
sjohn@poland.ece.cmu.edu

russell@apple.com (Russell Williams) (02/09/91)

The Elxsi gamma CPU (running in the lab but never shipped before Elxsi 
U.S. quit R&D) implemented  cache partitioning  for real time support.  
Its 1 MB I and D caches could be independently partitioned up to 8 ways 
each.  The simple implementation (replacing a variable number of the high 
order address bits to the direct mapped cache) meant that partitions had 
to be power of two sizes, and that performance cost was negligable.   

A small piece of software handled partition allocation, sharing the 
largest unallocated partition among all the tasks that hadn't explicitly 
requested a partition.  Minor modifications were required to areas of the 
system software which handled cache flushing (e.g. for I/O).