lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (11/26/90)
One of the interesting problems in real-time systems, is finding the guaranteed-not-to-exceed execution time of a piece of code. Caches make the problem nastier. First, you would hope that a cache full of another task's stuff, is no slower than an empty cache. Next, if you assume interrupts, you pretty well have a take the cache-disabled timing as your worst case. There is an out: you can prefill the cache and freeze it: but I don't think it's commonly done. Or, you can scrap interrupts, use non-preemptive scheduling, and assume that each scheduling event trashes the whole cache. I recently heard a new answer, which I thought I'd pass along. The idea is to segment the cache(s), so that some tasks have the exclusive use of parts of the cache(s). Those tasks are now immune to interference from other tasks, and perhaps even from interrupts. Their worst-case times are just their standalone times, on a machine with a small cache. [Yes, I am ignoring shared data, multilevel caches, assignment, task switching, etc.] The suggested support hardware would put (say) four address lines through four 4:1 muxes. This would carve the cache into 16 segments, and a sort of PID would enable some power-of-2 number of these segments. The cost? In logic, not much. In time - well, one pre-set-up 4:1-mux delay on some of the cache-line-select path. Does anyone care to say how much of a hit that would be on their favorite machine? Or is there a better way to get a similar effect? -- Don D.C.Lindsay
DXB132@psuvm.psu.edu (11/27/90)
Perhaps it would be better in many cases to forget about the cache, and simply put some fast static RAM in the memory map, to be used for time- critical routines and data storage. The not-so-critical routines would be placed in relatively slow memory (presumably you are thinking about very large chunks of software else a pure static RAM approach is simpler and just as cheap. Of course my idea of very large is 8K or so :-)) -- Dan Babcock
lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (11/29/90)
In article <90331.001007DXB132@psuvm.psu.edu> DXB132@psuvm.psu.edu writes: >Perhaps it would be better in many cases to forget about the cache, and >simply put some fast static RAM in the memory map, to be used for time- >critical routines and data storage. The not-so-critical routines would be >placed in relatively slow memory (presumably you are thinking about very >large chunks of software else a pure static RAM approach is simpler and >just as cheap. Of course my idea of very large is 8K or so :-)) I did the same, once: we had 768 words of 100 ns SRAM and 64 KW of core (yes, core). The SRAM was enough to hold a few interrupt handlers, and all of the important inner loops (eg. the FFT). We could keep the data in core, since we had a scoreboarded load- multiple and store-multiple. I'm not sure that the SRAM idea completely overshadows the real-time- cache idea. With some current machines, one can put memory SRAM in parallel with the cache SRAM, or in place of it. But there's a big trend to on-chip cache, and external SRAM is surely going to be slower than that. (Aside from an extra clock or two of access, there is often a bus width difference.) So, although uncached references out to SRAM are faster than uncached references out to DRAM, we're still looking at the cache giving better average-case performance. The performance cost of the segmented cache scheme isn't clear to me, so I'd still like comments on it. -- Don D.C.Lindsay
DXB132@psuvm.psu.edu (11/29/90)
In article <11228@pt.cs.cmu.edu>, lindsay@gandalf.cs.cmu.edu (Donald Lindsay) says: >trend to on-chip cache, and external SRAM is surely going to be >slower than that. (Aside from an extra clock or two of access, there >is often a bus width difference.) So, although uncached references Any Transputer fans out there? I seem to recall (not really sure) that the Transputer had 2K or 4K of static RAM on-chip, that was up to software to use (i.e. it wasn't a cache in the usual sense). Your point is well taken, however. -- Dan Babcock
vestal@SRC.Honeywell.COM (Steve Vestal) (12/12/90)
In article <11190@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes: >> One of the interesting problems in real-time systems, is finding >> the guaranteed-not-to-exceed execution time of a piece of code. There are actually two distinct effects to be considered here: the impact of caching due to preemption and the impact of caching due to data-dependent memory access patterns within each task. As far as preemption, assuming that worst-case time equals time with caches disabled may be a bit pessimistic (not to mention possibly incorrect in some pathalogical cases). It may be possible to arrive at a worst-case context swap time that includes cache effects, which in turn can be handled in some scheduling models (e.g. rate monotonic scheduling). The major problem here is accurately determining what the worst-case context swap time will be. The second problem is determining the worst-case execution time for a (nonpreempted) task. What is required is a determination of some "longest" path through a task, which isn't as simple as finding the longest path through some control flow graph when caches are present. Basic block execution times (arc lengths in the control flow graph) may depend on where control came from and on the contents of variables/registers. (I'm ignoring data-dependent control flow, e.g. real-time Euclid.) To say that the only execution time measure of interest in hard real-time systems is a verifiable upper bound on execution time (ignore average throughput, ignore measurements based on a few sample inputs) may be excessive. There are techniques to handle occasional "timing faults" in hard real-time systems. However, I would be interested to know of 1) tools/techniques to determine reasonably tight worst-case execution time bounds for specific processors, and 2) comparisons of processors based on worst-case execution time bounds rather than average execution times. Steve Vestal Mail: Honeywell S&RC MN65-2100, 3660 Technology Drive, Minneapolis MN 55418 Phone: (612) 782-7049 Internet: vestal@src.honeywell.com
jesup@cbmvax.commodore.com (Randell Jesup) (01/24/91)
In article <11228@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes: >I'm not sure that the SRAM idea completely overshadows the real-time- >cache idea. With some current machines, one can put memory SRAM in >parallel with the cache SRAM, or in place of it. But there's a big >trend to on-chip cache, and external SRAM is surely going to be >slower than that. (Aside from an extra clock or two of access, there >is often a bus width difference.) So, although uncached references >out to SRAM are faster than uncached references out to DRAM, we're >still looking at the cache giving better average-case performance. Instead of making on-chip cache, for real-time work on-chip FAST memory may well be better, especially if it's managed properly. The new AT&T DSP uses this sort of memory (the 3210 has 2K of on-chip multi-port SRAM). The cache does nothing for you in a realtime application since you have to assume worst-case anyways (well, not nothing, but not far from it - non-realtime tasks on the same machine might get an advantage). -- Randell Jesup, Keeper of AmigaDos, Commodore Engineering. {uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.commodore.com BIX: rjesup The compiler runs Like a swift-flowing river I wait in silence. (From "The Zen of Programming") ;-)
vestal@SRC.Honeywell.COM (Steve Vestal) (01/29/91)
In article mumble jesup@cbmvax.commodore.com (Randell Jesup) writes: >> The cache does nothing for you in a realtime application since you >> have to assume worst-case anyways (well, not nothing, but not far from it - >> non-realtime tasks on the same machine might get an advantage). My sympathies lie with Randell, but things aren't always this dire. First, it has been argued that some real-time code is sufficiently deterministic (i.e. memory access patterns are largely independent of input data) that cache behavior will be largely deterministic. Even if they can't be easily predicted from an examination of the code, execution time measurements taken during sample executions are representative of true worst-case execution times. Second, even though missing a deadline is a fault, this needn't be a fatal fault in all cases. There are scheduling techniques that can allow some applications to tolerate occasional timing faults. My impression is that most hardware/software trade-off studies have assumed that the goal was to maximize average throughput. This is not the same goal as minimizing verifiable upper bounds on execution time, which is desirable for safety-critical real-time applications. Steve Vestal Mail: Honeywell S&RC MN65-2100, 3660 Technology Drive, Minneapolis MN 55418 Phone: (612) 782-7049 Internet: vestal@src.honeywell.com
davidb@brac.inmos.co.uk (David Boreham) (01/29/91)
In article <17999@cbmvax.commodore.com> jesup@cbmvax.commodore.com (Randell Jesup) writes: > Instead of making on-chip cache, for real-time work on-chip FAST >memory may well be better, especially if it's managed properly. The new >AT&T DSP uses this sort of memory (the 3210 has 2K of on-chip multi-port >SRAM). > > The cache does nothing for you in a realtime application since you >have to assume worst-case anyways (well, not nothing, but not far from it - >non-realtime tasks on the same machine might get an advantage). Old hat. All our processors have had on-chip 4K SRAM for years. Why not make a processor which has the facility to use the on-chip SRAM as either a cache, or SRAM, or half-in-half ? David Boreham, INMOS Limited | mail(uk): davidb@inmos.co.uk or ukc!inmos!davidb Bristol, England | (us): uunet!inmos.com!davidb +44 454 616616 ex 547 | Internet: davidb@inmos.com
gessel@masada.cs.swarthmore.edu (Daniel Mark Gessel) (01/30/91)
In article <17999@cbmvax.commodore.com> jesup@cbmvax.commodore.com (Randell Jesup) writes: > Instead of making on-chip cache, for real-time work on-chip FAST >memory may well be better, especially if it's managed properly. The new >AT&T DSP uses this sort of memory (the 3210 has 2K of on-chip multi-port >SRAM). Aren't there SRAMS fast enough to allow 50 million accesses per second? (less than 20 nanoseconds) Couldn't a 50Mhz processor use this memory without a cache or wait states? I just wondered why processors are designed with a cache instead of using these kinds of memory chips. I have always just assumed it was expense. Dan -- Daniel Mark Gessel Independent Software Consultant Internet: gessel@cs.swarthmore.edu and Developer I do not represent Swarthmore College (thank God).
alex@vmars.tuwien.ac.at (Alexander Vrchoticky) (01/30/91)
vestal@SRC.Honeywell.COM (Steve Vestal) writes: >First, it has been argued that some real-time code is sufficiently >deterministic (i.e. memory access patterns are largely independent of input >data) that cache behavior will be largely deterministic. One should point out that this does not generally hold for systems where context switches can occur at arbitrary points in time. One task's cache hit is another task's cache miss. >My impression is that most hardware/software trade-off studies have assumed >that the goal was to maximize average throughput. This is not the same goal >as minimizing verifiable upper bounds on execution time, which is desirable >for safety-critical real-time applications. I wholeheartedly agree. Both performance measures have their place, each in its area. Unfortunately far too many people still tend to equate `fast' and `real-time', without asking which performance measures are relevant for real-time systems. The idea that high average (as opposed to guaranteed) performance is useful for real-time systems must be based on the notion that the processor is somehow more `comfortable' when processing the idle loop than when doing something useful. :-) -- Alexander Vrchoticky | alex@vmars.tuwien.ac.at TU Vienna, CS/Real-Time Systems | +43/222/58801-8168 "i cant even remember if we were lovers or if i just wanted to" (violent femmes)
vestal@SRC.Honeywell.COM (Steve Vestal) (01/30/91)
(I'm cross-posting this message on the impacts of caching in real-time systems to comp.realtime. Should discussions of architectural considerations for trustworthy and/or real-time systems be moved to comp.realtime?) In article <2288@tuvie.UUCP> alex@vmars.tuwien.ac.at (Alexander Vrchoticky) writes: > vestal@SRC.Honeywell.COM (Steve Vestal) writes: >>First, it has been argued that some real-time code is sufficiently >>deterministic (i.e. memory access patterns are largely independent of input >>data) that cache behavior will be largely deterministic. > One should point out that this does not generally hold for systems where > context switches can occur at arbitrary points in time. > One task's cache hit is another task's cache miss. Yes, the impact of caching on context switch times is a slightly different issue than the impact of caching on worst-case (unpreempted) execution times for individual tasks. In some cyclically scheduled systems there are no real preemptions. In rate monotonically scheduled systems, it is possible to include context switch times in the performance model. In general, one context switch (or two, depending on how you define a switch) will eventually occur as a result of each task dispatch in a system (but this is effectively arbitrary if your scheduling model doesn't allow you to include it). My conjecture is that if you include the time required to completely flush/fill the cache, TLB, etc. in the context switch time then you will be safe. This means that any speed-up due to caching is discounted when bounding context switch times but not necessarily when bounding task execution times (for those willing to assume the memory access patterns of their tasks are largely data-independent, of course). (Personally, I would be hesitant to accept such an analysis for a safety-critical system, but there are applications where this approach should be reasonable.) Steve Vestal Mail: Honeywell S&RC MN65-2100, 3660 Technology Drive, Minneapolis MN 55418 Phone: (612) 782-7049 Internet: vestal@src.honeywell.com
sjohn@poland.ece.cmu.edu (John Sasinowski) (01/31/91)
Dave Kirk, who just finished his Ph.D. here at Carnegie Mellon University, wrote his dissertation on a cache design for real-time systems. This design, called SMART caching (Strategic Memory Allocation for Real Time), provides predictable execution times for tasks that run in the cache. Let me know if you want more information about this. I can send you a list of references if you wish. John Sasinowski sjohn@poland.ece.cmu.edu
jfc@athena.mit.edu (John F Carr) (02/01/91)
In article <GESSEL.91Jan29100936@masada.cs.swarthmore.edu> gessel@masada.cs.swarthmore.edu (Daniel Mark Gessel) writes: >I just wondered why processors are designed with a cache instead of >using these kinds of memory chips. I have always just assumed it was >expense. I would guess that it is harder to make a fast data path to main memory than to a small on-chip memory. The choice then becomes whether to make the fast memory a cache or not. For most applications (probably not including real time) it is better to let the hardware take care of copying frequently accessed memory locations into the fast memory. -- John Carr (jfc@athena.mit.edu)
cprice@mips.COM (Charlie Price) (02/02/91)
In article <GESSEL.91Jan29100936@masada.cs.swarthmore.edu> gessel@masada.cs.swarthmore.edu (Daniel Mark Gessel) writes: >In article <17999@cbmvax.commodore.com> jesup@cbmvax.commodore.com (Randell Jesup) writes: >> Instead of making on-chip cache, for real-time work on-chip FAST >>memory may well be better, especially if it's managed properly. >>... > >Aren't there SRAMS fast enough to allow 50 million accesses per >second? (less than 20 nanoseconds) Couldn't a 50Mhz processor use this >memory without a cache or wait states? There are three delays in a memory access: propagating the address from the processor to every chip that needs to hear it, in-chip time to retrieve the stored bits, and sending the answer back to the processor. For the large memory array needed with a fast processor in a "computer" (rather than an embedded processor doing a specific thing), the address and data transit times are very large. A further problem is that to run without waiting for memory a recent-design machine needs an instruction and a data transfer each cycle. The key to a fast machine these days is to have the data you want to use within a handful of centimeters of where you want to use it nearly all the time. -- Charlie Price cprice@mips.mips.com (408) 720-1700 MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA 94086-23650
huy@rainbow.asd.sgi.com (Huy Nguyen) (02/02/91)
| Dave Kirk, who just finished his Ph.D. here at Carnegie Mellon University, | wrote his dissertation on a cache design for real-time systems. This | design, called SMART caching (Strategic Memory Allocation for Real Time), | provides predictable execution times for tasks that run in the cache. | | Let me know if you want more information about this. I can send you | a list of references if you wish. | | John Sasinowski | sjohn@poland.ece.cmu.edu | | This sounds like an interesting paper. Could you post this information to the net so others can see it? huy@sgi.com
gillies@cs.uiuc.edu (Don Gillies) (02/03/91)
huy@rainbow.asd.sgi.com (Huy Nguyen) writes: >| Dave Kirk, who just finished his Ph.D. here at Carnegie Mellon University, >| wrote his dissertation on a cache design for real-time systems. This >| design, called SMART caching (Strategic Memory Allocation for Real Time), >| provides predictable execution times for tasks that run in the cache. Information on this appears in the IEEE Real-time Systems Symposium "SMART (Strategic Memory Allocation for Real-Time) Cache Design" IEEE RTSS 1989, p 229. "SMART (Strategic Memory Allocation for Real-Time) Using the MIPS R3000" IEEE RTSS 1990, p 322 These papers are about how to build a cache with different partitions for different processes.
sjohn@poland.ece.cmu.edu (John Sasinowski) (02/04/91)
>Dave Kirk, who just finished his Ph.D. here at Carnegie Mellon University, >wrote his dissertation on a cache design for real-time systems. This >design, called SMART caching (Strategic Memory Allocation for Real Time), >provides predictable execution times for tasks that run in the cache. > >Let me know if you want more information about this. I can send you >a list of references if you wish. > After receiving many requests for the information, here is a brief summary of the SMART caching strategy along with the list of references: The SMART cache design strategy is a software controlled partitioning technique which provides predictable cache performance in real-time systems. The cache is divided into segments which are assigned to tasks to form private partitions. The partitions are regions of cache that can be accessed by only one task. For effeciency considerations, the number of segments in a partition should be a power of two. Since only one task has access to each partition, no other tasks can overwrite the working set of that task. The partitions need not be the same size for every task, so the divisions of the cache among the tasks can be chosen to provide the best performance for the set of tasks. This work was done by Dave Kirk under the guidance of Jay Strosnider as part of the ART (Advanced Real-Time Technology) project. Currently, I am working on compiler techniques to improve the predictable performance of caches in real-time systems. References: David B. Kirk Process Dependent Static Cache Partitioning for Real-Time Systems Proceedings of the Real-Time Systems Symposium, pages 181-190. IEEE, Huntsville, AL, December 1988 David B. Kirk SMART (Strategic Memory Allocation for Real-Time) Cache Design Proceedings of the Real-Time Systems Symposium, pages 229-237. IEEE, Santa Monica, CA, December 1989 David B. Kirk, Jay K. Strosnider SMART (Strategic Memory Allocation for Real-Time) Cache Design Using the R3000 Proceedings of the Real-Time Systems Symposium IEEE, Orlando, FL, December 1990 David B. Kirk Predictable Cache Design for Real-Time Systems PhD Thesis, Carnegie Mellon University, December 1990 (Copies available by written request to: Jay K. Strosnider Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA 15213-3890) John Sasinowski sjohn@poland.ece.cmu.edu
russell@apple.com (Russell Williams) (02/09/91)
The Elxsi gamma CPU (running in the lab but never shipped before Elxsi U.S. quit R&D) implemented cache partitioning for real time support. Its 1 MB I and D caches could be independently partitioned up to 8 ways each. The simple implementation (replacing a variable number of the high order address bits to the direct mapped cache) meant that partitions had to be power of two sizes, and that performance cost was negligable. A small piece of software handled partition allocation, sharing the largest unallocated partition among all the tasks that hadn't explicitly requested a partition. Minor modifications were required to areas of the system software which handled cache flushing (e.g. for I/O).