ed@iitmax.IIT.EDU (Ed Federmeyer) (12/31/89)
One of the things that seems to characterize RISC chips is the relatively large number of registers. This makes me wonder what happens during a context switch. After all, moving 256 (or more) registers to memory, and then another 256 (or more!) back in for each context switch seems like an awfull lot of overhead. I can think of a few ways around this: 1) Do nothing special... Suffer 2) Have each register "tagged" like a cache, so only the "dirty" registers need to be moved out. You'd still have to load in all the old ones. 3) Have a few register "sets". Ie, a context switch really moves a pointer to a bank of registers (of which there are several on-chip). 4) Like 3, but only have 2 sets. While context 2 is processing, drain out context 1's set so it's ready by the next switch. Since a RISC chip seems to execute 1 instruction in 1 cycle, I can't see that there is alot of extra bus cycles. (Unlike in a CISC, where you might have hundreds of clock cycles free while the processor executes an instruction in which the bus is not being used) Unless of course you have a second bus going to memory dedicated to just shuttling registers in and out. Are any of these sorts of techniqes being used in real RISC chips? Or am I just over estimating the overhead of RISC context switching? If I get some Email responses to this, I'll post a summary. -- +-------------------------------------+--------------------------------------+ | Ed Federmeyer | Internet: ed@iitMax.iit.edu | | "Unauthorized access is | Bitnet: sysed@iitVax | | strictly unauthorized." | Office: (312) 567-5981 | +-------------------------------------+--------------------------------------+
hankd@pur-ee.UUCP (Hank Dietz) (01/02/90)
In article <3167@iitmax.IIT.EDU> ed@iitmax.iit.edu (Ed Federmeyer) writes: >One of the things that seems to characterize RISC chips is the relatively >large number of registers. This makes me wonder what happens during a >context switch. After all, moving 256 (or more) registers to memory, and >then another 256 (or more!) back in for each context switch seems like an >awfull lot of overhead. I can think of a few ways around this: > >1) Do nothing special... Suffer >2) Have each register "tagged" like a cache, so only the "dirty" registers > need to be moved out. You'd still have to load in all the old ones. >3) Have a few register "sets". Ie, a context switch really moves a pointer > to a bank of registers (of which there are several on-chip). >4) Like 3, but only have 2 sets. While context 2 is processing, drain out ... First of all, it isn't just the register file which has gotten big -- it's the complete localized process state. This includes registers, caches, even process-specific page tables and disk buffers. Second, it has NOTHING TO DO WITH BEING RISC -- chips are fast, talking with other chips is slow, talking with other boards is even slower, so ANY high-performance architecture naturally tends toward a larger, longer lived, localized process state. As for your list of choices, you left out two favorites: 5) Initiate context switch before it is needed... this looks a lot like the incremental checkpointing done by fault-tolerence folk. 6) Since nobody builds uniprocessors anymore (;-), NEVER interrupt a processor: simply let another processor which happens to be free at that moment (or the next processor to become free) handle the interrupt. Number 6 is the really interesting one. Suppose you have an MIMD architecture with perhaps 64 processors and typically only about 5-10 programs running simultaneously... you should use execution-time changes in the parallelism width of each program to implement priorities. The OS scheduler (which is partly hardware) would simply insure that there is always at least one processor free to service any time-critical interrupt which might arrive (e.g., "incoming ICBMs detected"). Less critical interrupts (e.g., "Joe Luser types the letter Q") can simply be buffered awaiting a free processor... one will become free as soon as a program either terminates or reaches a point at which it can change parallelism-width. With "enough" processors, this stochastic wait for a free processor can be made arbitrarily short.... BTW, you might say that processes can require context switches for synchronous events (e.g., loading a value from memory which is far away), but IMHO the use of a context switch is usually overkill in such cases (sorry, Burton ;-). This is because, with the right architecture, synchronous delay events can be hidden using static (compile-time) scheduling (e.g., code motions to hide delayed loads). Beside that, it doesn't bother me very much if a few of many processors are needlessly idle. Why? Because adding context switching ability adds a surprisingly large degree of circuit complexity (as you have implied above), my wild guesstimate is that it would be at least 30%. Suppose I have the choice between having 100 context-switchable processors or 130 otherwise-equivalent non-interruptable processors... about 30 non-interruptable processors would have to be idle before I'd get worried. Oh yes... the non-interruptable processors will probably also have fewer gate delays per basic operation -- they'll run with a faster clock. -hankd@ecn.purdue.edu
tim@nucleus.amd.com (Tim Olson) (01/02/90)
In article <3167@iitmax.IIT.EDU> ed@iitmax.iit.edu (Ed Federmeyer) writes: | One of the things that seems to characterize RISC chips is the relatively | large number of registers. This makes me wonder what happens during a | context switch. After all, moving 256 (or more) registers to memory, and | then another 256 (or more!) back in for each context switch seems like an | awfull lot of overhead. I can think of a few ways around this: | | 1) Do nothing special... Suffer | | 2) Have each register "tagged" like a cache, so only the "dirty" registers | need to be moved out. You'd still have to load in all the old ones. | 3) Have a few register "sets". Ie, a context switch really moves a pointer | to a bank of registers (of which there are several on-chip). | 4) Like 3, but only have 2 sets. While context 2 is processing, drain out | context 1's set so it's ready by the next switch. Since a RISC chip | seems to execute 1 instruction in 1 cycle, I can't see that there is | alot of extra bus cycles. (Unlike in a CISC, where you might have hundreds | of clock cycles free while the processor executes an instruction in which | the bus is not being used) Unless of course you have a second bus | going to memory dedicated to just shuttling registers in and out. Here are 3 methods that can be used to reduce context switch time on the Am29000 RISC processor: 1) Register Banking (like your #3). Because the large local register file can be addressed offset from a stack-pointer value, it can be divided into separate register banks. A context switch then consists of saving and restoring a few special registers (to the on-chip global registers) and changing the stack pointer to use the new register bank. Register banks can be protected from user-mode access in groups of 16 registers. A full context switch takes about 1 microsecond at 25MHz. This method is supported in the Am29000 hardware, but current software tools do not make use of it. 2) Reduced Stack Cache Size. The standard Am29000 calling convention uses the local register file as a runtime stack cache. The maximum number of registers used in the cache can be regulated by a user-defined register spill/fill trap handler which is invoked whenever a portion of the cached stack must be saved/restored to memory. The number of registers used can range from ~32 to 128, allowing the system designer to make the correct trade-off between context-switch time and individual task performance. Using this technique, context-switch times can be varied from ~9 microseconds to ~17 microseconds with 2-cycle first-access, single-cycle burst memory at 25MHz. 3) Saving/Restoring Live Registers Only (like your #2). In systems where security (covert channels) is not an issue (as in most real-time embedded control systems), the context-switch time can be reduced by saving only the live local registers found between the current stack pointer and the top of the stack cache. This is typically about half of the total local registers. In addition, only the current procedure frame of the new context need be restored; the rest of the stack cache will be "faulted in" if necessary. This results in context switch times of ~10 microseconds using the full 128-register stack cache. -- Tim Olson Advanced Micro Devices (tim@amd.com)
mat@uts.amdahl.com (Mike Taylor) (01/03/90)
In article <14007@pur-ee.UUCP>, hankd@pur-ee.UUCP (Hank Dietz) writes: > > 6) Since nobody builds uniprocessors anymore (;-), NEVER interrupt a > processor: simply let another processor which happens to be free at that > moment (or the next processor to become free) handle the interrupt. > > Number 6 is the really interesting one. > A slight variation on this is used by some mainframe operating systems - in a multiprocessor only one processor may be enabled for I/O interruptions. The OS will enable another processor if the interrupt load on the enabled processor reaches a certain threshold (and so on). Of course, any processor can start an I/O operation. A nice side-effect is that the enabled processor can test for pending interrupts when it finishes interrupt processing and handle the next one in the queue without bothering to process switch. This batching effect increases performance further. -- Mike Taylor ...!{hplabs,amdcad,sun}!amdahl!mat [ This may not reflect my opinion, let alone anyone else's. ]
mash@mips.COM (John Mashey) (01/03/90)
In article <3167@iitmax.IIT.EDU> ed@iitmax.iit.edu (Ed Federmeyer) writes: >One of the things that seems to characterize RISC chips is the relatively >large number of registers. This makes me wonder what happens during a >context switch. After all, moving 256 (or more) registers to memory, and >then another 256 (or more!) back in for each context switch seems like an >awfull lot of overhead. I can think of a few ways around this: It is good to avoid over-generalizing: a) Many RISCs do not have as many registers, i.e. in approximate numbers of 32-bit registers, ignoring system registers: 88K: 32 MIPS & HP PA: 64 b) The registers are only one part of the state / cost for context switching. Others include: system and internal registers (consider the amount of state pushed on stack of 68K on some exceptions) MMU/TLBs [some designs require state save if more than N processes; others pay the cost incrementally, but some cost is paid, sometime. cache structure, especially if flushes are needed, as in i860 c) In fact, in OS's like UNIX, moderate changes in numbers of registers have fairly minimal effects on context-switching time, compared with the above issues, plus even more important: the OS structure, and its design tradeoffs. In particular, the definition of "Context switch" is pretty fuzzy, because it really matters how you measure it. Is it: time to save/restore registers? time to save/restore entire state? time to do all of that, plus the OS scheduling code? measured by switching between 2 processes, or between N? d) Amount of state changed is likely to be more important in such operating systems as: 1) Switching machine OSs, which have large numbers of processes that execute for fairly short periods of time. 2) Real-time operating systems, especially at the "hard" edge, where guaranteed predictability is required. (Mechanisms that increase thruput for somethings often increase the variability.) Given all of this, it's sometimes pretty hard to make comparisons, except by benchmarking an entire hardware+operating system combination, ideally with the same OS. I have a tiny piece of data, from which large conclusions should not be drawn, but which is useful. JMI Software Consultants has a fairly portable real-time kernel called C Executive (TM), and they publish some numbers on different environments. Here are times in microseconds for context-switch, version 2.3, published 10/89: CPU R2000 68020 29000 SPARC System M/120 FORCE AMD Sun-4/2xx MHz 16.7 25 20 16.7 C Compiler MIPS Whitesmiths MetaWare Sun Cntxt Switch, microseconds 10 17 29 20 ------------- Now, that's all that I know. I don't know what the board designs are for the 68020 and 29000, and I have no idea of the internal structure of the C Executive and how it matches the different chips; I have no idea how they measure context-switching, just that it's consistent. It is pretty clear that there is at least a gross correlation with number of registers here, but that's about all. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
shankar@SRC.Honeywell.COM (Subash Shankar) (01/03/90)
In article <28573@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
# 3) Saving/Restoring Live Registers Only (like your #2). In systems
# where security (covert channels) is not an issue (as in most
# real-time embedded control systems), the context-switch time can
# be reduced by saving only the live local registers found between
# the current stack pointer and the top of the stack cache.
Why does security affect things here?
---
Subash Shankar Honeywell Systems & Research Center
voice: (612) 782 7558 US Snail: 3660 Technology Dr., Minneapolis, MN 55418
shankar@src.honeywell.com srcsip!shankar
honig@ics.uci.edu (David A. Honig) (01/03/90)
In article <52104@srcsip.UUCP> shankar@src.honeywell.com (Subash Shankar) writes: >In article <28573@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes: > ># 3) Saving/Restoring Live Registers Only (like your #2). In systems ># where security (covert channels) is not an issue (as in most ># real-time embedded control systems), the context-switch time can ># be reduced by saving only the live local registers found between ># the current stack pointer and the top of the stack cache. > >Why does security affect things here? >Subash Shankar Honeywell Systems & Research Center Isn't it obvious? The contents of registers, if not overwritten, provides a 'hidden' communication path between processes. If you worry about that then your context switches will have to do more work. And I don't do either architecture or security professionally! -- David A. Honig "As these houses are built, the lions continue to wander around those areas, and they bump into a poodle or a German shepard. That causes some trouble." ---Prof. R. Barrett, UCB
melvin@ucbarpa.Berkeley.EDU (Steve Melvin) (01/03/90)
In article <14007@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes: >First of all, it isn't just the register file which has gotten big -- it's >the complete localized process state. This includes registers, caches, even >process-specific page tables and disk buffers. Second, it has NOTHING TO DO >WITH BEING RISC -- chips are fast, talking with other chips is slow, talking >with other boards is even slower, so ANY high-performance architecture >naturally tends toward a larger, longer lived, localized process state. > Your point that the localized process state includes more than just registers is well taken, but I'd like to take issue with your second statement. There are indeed architectural features which affect the size of the process state, independent from the speed differential between the processor and the rest of the world. The key is that temporary results which are computed and used within a basic block need not be stored in "named" registers. That is, they may be internal registers but NOT part of the process state if, for example, context switches are never initiated except on basic block boundaries (interrupts can either have long latencies or work can be discarded). Of course, how much the state can be reduced while preserving the advantages of the same number of internal registers depends on the application and the compiler. The point is that by increasing the size of the unit of work which the processor considers "atomic" (I call this an Execution Atomic Unit), fewer *architectural* registers are required. The main reason one would want to do this is to allow more parallelism to be exploited, but that's another issue. > >BTW, you might say that processes can require context switches for >synchronous events (e.g., loading a value from memory which is far away), >but IMHO the use of a context switch is usually overkill in such cases >(sorry, Burton ;-). This is because, with the right architecture, >synchronous delay events can be hidden using static (compile-time) >scheduling (e.g., code motions to hide delayed loads). > Well, first of all, the use of the term "context switch" is a little misleading as applied to the HEP-1. That machine allows the context for up to 50 processes to simultaneously be present in the processor, eight of which are at any time actively in the pipeline. When a process needs to wait for memory, it is simply not returned to the group (of 50) which are candidates for being scheduled. Secondly, I disagree that with the "right architecture" synchronous delay events can be hidden. If there is no variance in memory latency (i.e. if you have no cache or if you have a 100% hit rate), then the compiler can do a pretty good job (branch predictability is also important here, static vs. dynamic but that's another issue). However, when memory latency becomes more variable, dynamic scheduling starts becoming more important (also, this effect is increasingly important as the number of parallel operations increases). ------- Steve Melvin University of California, Berkeley -------
firth@sei.cmu.edu (Robert Firth) (01/03/90)
In article <28573@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes: ># 3) Saving/Restoring Live Registers Only (like your #2). In systems ># where security (covert channels) is not an issue In article <52104@srcsip.UUCP> shankar@src.honeywell.com (Subash Shankar) writes: >Why does security affect things here? Here's a simple example. (Mnemonics have been changed to protect the innocent.) You get hold of the login processing code of the Mugwump IV multi-user operating system. That code reads an 8-character password, scrambles it, and checks it. It just so happens that, on exit from the code, the clear password is still lying around in <R6,R7>. You write a short program that does this loop get myself time sliced read <R6,R7> write the value to a file end loop Every so often, by pure chance, the OS scheduler will resume your program immediately after executing the login process. Every time that happens, you will have captured a password. The registers <R6,R7> are the covert channel by means of which information leaks from the login process to your program. An OS that saved and restored only live registers would not save and restore them, so leaving the channel open. (The registers aren't live because there is no write to them anywhere in the program.)
rec@dg.dg.com (Robert Cousins) (01/04/90)
In article <7fWJ02qY7aoo01@amdahl.uts.amdahl.com> mat@uts.amdahl.com (Mike Taylor) writes: >In article <14007@pur-ee.UUCP>, hankd@pur-ee.UUCP (Hank Dietz) writes: >> 6) Since nobody builds uniprocessors anymore (;-), NEVER interrupt a >> processor: simply let another processor which happens to be free at that >> moment (or the next processor to become free) handle the interrupt. >> Number 6 is the really interesting one. >A slight variation on this is used by some mainframe operating systems - in >a multiprocessor only one processor may be enabled for I/O interruptions. The >OS will enable another processor if the interrupt load on the enabled processor >reaches a certain threshold (and so on). Of course, any processor can start >an I/O operation. A nice side-effect is that the enabled processor can test >for pending interrupts when it finishes interrupt processing and handle the >next one in the queue without bothering to process switch. This batching effect >increases performance further. >Mike Taylor ...!{hplabs,amdcad,sun}!amdahl!mat This can bring about unacceptable performance degradations under some circumstances. A number of fully symmetric operating systems allow any processor to field an interrupt from any peripheral (DG/UX does). One of the reasons for this is that single threading interrupt service (which is what you are talking about) introduces two sets of delays: The first is when a peripheral must wait longer on average for interrupt service which reduces throughput and the second is when the sleeping task may not wake up until a cross-processor interrupt occurs. Both of these are statistical problems that occur to some degree on all MP OSs. However, if you do the queueing computations, it is indeed preferable to allow multiple CPUs to handle interrupts. However, this does not mean that batching of interrupt processing is a bad idea. Just the opposite. When a CPU completes one ISR, it should be able to enter another without penalty. Conclusion: administring an interrupt policy for a high performance multiprocessor is more complex than normally considered. Robert Cousins Dept. Mgr, Workstation Dev't. Data General Corp. Speaking for myself alone. which reduces thoughput
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (01/04/90)
While this topic is going on, does any (production) CPU use a register dirty bit? What I envision is that there would be a load-em-all instruction which would load all registers from memory and clear the dirty bits. On context switch the save-em-all instruction would only save those which were modified. I can't envision a system working which didn't preserve the registers in a context switch, and I think the proposed loop sleep a bit check the registers is fanciful. If the registers checge while my process is running I would expect the process to die rather sooner than later. I *can* envision that on first startup the registers might not be set to a known state. With memory it's another story. We used to allocate memory and search it all the time in the "good old days." We ran (perhaps crawled) 32 users on a single MB of core, too, and I don't want to go back to that, either. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "The world is filled with fools. They blindly follow their so-called 'reason' in the face of the church and common sense. Any fool can see that the world is flat!" - anon