lamaster@pioneer.UUCP (02/28/87)
In article <5763@amdahl.UUCP> chuck@amdahl.UUCP (Charles Simmons) writes: >In article <448@cpocd2.UUCP> howard@cpocd2.UUCP (Howard A. Landman) writes: >>In article <4376@columbia.UUCP> eppstein@tom.columbia.edu (David Eppstein) writes: >>>[David implies that an interrupt requires a complete context switch >>>of writing all active windows to memory and reloading them after >>>handling the interrupt] >> >>[Howard points out that for an interrupt, the operating system code >>that handles the interrupt can promise not to trash active windows, >>and continue to use the existing stack. He also points out that >>context switches between user processes occur relatively infrequently >>and thus the overhead of writing out all active windows and reading in >>new windows on a context switch is not important.] > >I agree with this, so I'm going to change the subject slightly. Lately >I've been thinking about operating systems which would perform a complete >context switch on every interrupt. In particular, code for handling >interrupts would run in a restricted environment (ie user mode) where >their ability to accidentally trash other interrupt handlers, the operating >system, and device drivers would be minimized. > >My biggest problem with this concept is that context switching, even >on a machine with "only" 16 registers is extremely expensive. >I'd be interested in hearing about architechtures that minimize the >expense of a context switch. > Actually, if you have a fast, pipelined memory interface, it is not so expensive after all to save the complete context automatically. CDC and Cray have been doing various forms of this for many years. On the CDC 6000/7000 and Cyber 70 and 170 machines this is known as "exchange jump". The Cyber 205 actually swaps 256 registers in less than 200 CPU cycles (not to be confused with the longer memory cycles). It can do this because of the way the memory interface is structured. It may look "expensive", but simplifies register management so much that it is actually cheap. On the Cray-1 type machines, this automatic exchange jump feature includes only the small register sets, not the vector registers, because of the expense of saving 8 vector registers. So, the complexity of having to have operating system kernel code become aware of the state of vector registers and who they belong to is reintroduced. Anyway, automatic context saving is not pie in the sky. It works well and probably provides a net performance/price improvement on machines which are going to have a fast, pipelined memory interface anyway; many machines intended for engineering/ scientific/floating point intensive use DO, and more SHOULD. Hugh LaMaster, m/s 233-9, UUCP {seismo,topaz,lll-crg}!ames!pioneer!lamaster NASA Ames Research Center ARPA lamaster@ames-pioneer.arpa Moffett Field, CA 94035 ARPA lamaster@pioneer.arc.nasa.gov Phone: (415)694-6117 ARPA lamaster@ames.arc.nasa.gov "In order to promise genuine progress, the acronym RISC should stand for REGULAR (not reduced) instruction set computer." - Wirth ("Any opinions expressed herein are solely the responsibility of the author and do not represent the opinions of NASA or the U.S. Government")
rdt@houxv.UUCP (03/02/87)
> > Actually, if you have a fast, pipelined memory interface, it is not so > expensive after all to save the complete context automatically. CDC > and Cray have been doing various forms of this for many years. > CDC 6000/7000 and Cyber 70 and 170 machines this is known as "exchange > jump". The Cyber 205 actually swaps 256 registers in less than 200 > CPU cycles (not to be confused with the longer memory cycles). It can > do this because of the way the memory interface is structured. It may > look "expensive", but simplifies register management so much that it > is actually cheap. It works well and probably provides a net > performance/price improvement on machines which are going to have a > fast, pipelined memory interface anyway Could anyone share with us how the CPU hardware and operating system coordinate their response to a faulted copyback midway thru an exchange jump? Let 16 registers be copied back to memory (actually stack in main memory) and a fault occurs at register copyback in word 5 of the 16 total. Consider 2 cases of fault: fault due to a page not present (stack overflow) and fault due to a memory parity error. I assume the os must immediately respond to the fault by moving the registers out to a temporary safe area of memory before switching to the new process. (Recall that part of the old process context still resides in the shared registers) How does the hardware differentiate this kind of fault from non-context switch memory faults when calling the OS fault handler? How does the hardware get the base address of the temporary safe area? What happens if the copy to the safe area faults as well? If the context being saved is the contents of a copyback stack cache rather than a those of a register file, the situation gets more interesting since one can not typically fault the instruction which wrote the word into the cache until cache copyback discovers the problem with main memory. Any comments? Richard Trauben ATT Information Systems WE32x00 Processor Development Holmdel, NJ.
billw@navajo.UUCP (03/03/87)
So has anyone done research to see if it might be worthwhile to implement a machine with multiple sets of registers (say 16 sets of 64 registers, with register windows)? This would cut down on the overhead of conext switching a lot assuming that not too many processes were "active" at once. It would probably add a lot to the complexity of the system though... BillW
lamaster@pioneer.UUCP (03/04/87)
In article <542@houxv.UUCP> rdt@houxv.UUCP (R.TRAUBEN) writes: > > Could anyone share with us how the CPU hardware and operating > system coordinate their response to a faulted copyback midway > thru an exchange jump? > > Let 16 registers be copied back to memory (actually stack in > main memory) and a fault occurs at register copyback in word > 5 of the 16 total. Consider 2 cases of fault: fault due to a > page not present (stack overflow) and fault due to a memory > parity error. > > I assume the os must immediately respond to the fault by moving > the registers out to a temporary safe area of memory before > switching to the new process. (Recall that part of the old > process context still resides in the shared registers) > > How does the hardware differentiate this kind of fault from > non-context switch memory faults when calling the OS fault > handler? How does the hardware get the base address of > the temporary safe area? What happens if the copy to the > safe area faults as well? > > If the context being saved is the contents of a copyback > stack cache rather than a those of a register file, the > situation gets more interesting since one can not typically > fault the instruction which wrote the word into the cache > until cache copyback discovers the problem with main memory. > > Any comments? > >Richard Trauben >ATT Information Systems >WE32x00 Processor Development >Holmdel, NJ. > > > The best answer to this question is to refer you to the Cyber 205 Hardware Reference Manual, CDC #60256020. However, there are some things that could be said briefly: You are certainly correct that the hardware must know what physical memory to swap the registers to if a page fault occurs. I glossed over the fact that the Cyber 205 and lower Cyber's have a somewhat different structure. Because memory is always resident on a lower Cyber (no page faults), any memory address can be used to swap. The Lower Cyber exchange jump need not be to the operating system kernel, although there is a tree-like structure. On the Cyber 205, the exchange jump is referred to as "exit force" and it is always between the operating system kernel and another process. On the lower Cybers, a "job supervisor" could exchange to/from a job without going through the kernel. On the Cyber 205, to get to the equivalent of a job supervisor, first the process exits to the kernel, and the kernel then exits to supervisor. On both types of machines, there are a number of hidden registers in addition to the visible registers. These registers, when written out to memory, are referred to as the "exchange package". One of these registers holds the physical memory address of the exchange package that is to be read in if an exchange is done. So, to make a long story short, in both cases the registers are exchanged with a physical memory location. Therefore, there is no possibility of a page fault. [If a fatal SECDED fault occurred (one of the other questions) then the CPU would stop on a fatal error (whether an exchange was taking place or not). These only happen about once a year and are not a problem.] In the case of the Cyber 205, it is the kernel's job to manage the exchange packages for all the processes in the system. Since it doesn't take very long to exchange to/from the kernel, the overhead is small. On the Cyber 205, there is an additional user state instruction which exchanges registers (though not the hidden registers) which runs just as fast. If it encounters a page fault, the instruction is continued (not restarted) when the task is restarted. The Cyber 205 has instruction continuation for many other instructions as well, since it has memory to memory vector instructions which may cause page faults. Although much of the machine is "hardwired", instruction continuation is "all handled in the microcode". There is no data cache. There is an instruction cache. The 256 registers were a conscious design decision. In practice, the 256 registers are just about the right size to make sure that for almost all Fortran modules, local scalar variables are already in registers. And since the vector instructions are memory to memory, and run through very large amounts of memory very fast, it would be counterproductive to read vector data into a cache. (With a somewhat different organization, the Alliant mini-super does this, I seem to recall). I did not mean to mislead anyone into thinking that the Cyber 205 has an message passing structure. It simply has a very fast way to load and unload its large register file, and uses it to produce a very fast context switch between a process and the kernel. The kernel does all the message passing (actually, that is almost all it does. It also schedules.) With a high performance memory interface, it is not catastrophically expensive to save registers. Hugh LaMaster, m/s 233-9, UUCP {seismo,topaz,lll-crg,ucbvax}! NASA Ames Research Center ames!pioneer!lamaster Moffett Field, CA 94035 ARPA lamaster@ames-pioneer.arpa Phone: (415)694-6117 ARPA lamaster@pioneer.arc.nasa.gov "In order to promise genuine progress, the acronym RISC should stand for REGULAR (not reduced) instruction set computer." - Wirth ("Any opinions expressed herein are solely the responsibility of the author and do not represent the opinions of NASA or the U.S. Government")
devoz@encore.UUCP (03/04/87)
[flames are not encouraged] The "PRIME"ates out there on the net should have something to say about multiple register sets. They have been using multiple register sets for years. They call them "current register sets" if memory serves me correctly. devoz@encore "Are you all part of tonights dream or what?" {linus,talcott,decvax,ihnp4,allegra,necis,compass}!encore!devoz
ron@brl-sem.UUCP (03/04/87)
In article <542@houxv.UUCP>, rdt@houxv.UUCP (R.TRAUBEN) writes: > Could anyone share with us how the CPU hardware and operating > system coordinate their response to a faulted copyback midway > thru an exchange jump? > > Consider 2 cases of fault: fault due to a > page not present (stack overflow) and fault due to a memory > parity error. Easy, first, the machines don't page. Second, use better error correction, parity is for farmers.
ian@loral.UUCP (03/05/87)
In article <1430@navajo.STANFORD.EDU> billw@navajo.STANFORD.EDU (William E. Westfield) writes: > >So has anyone done research to see if it might be worthwhile to >implement a machine with multiple sets of registers (say 16 sets >of 64 registers, with register windows)? This would cut down on >the overhead of conext switching a lot assuming that not too many >processes were "active" at once. It would probably add a lot to >the complexity of the system though... > >BillW The Xerox Dorado has multiple register sets and as a result has very fast context switch. For information on the Dorado see "The Dorado: A High-Performance Personal Computer. Three Papers" Xerox CSL-81-1 January 1981, Palo Alto Research Center 3333 Coyote Hill Rd., Palo Alto, CA 94304 These papers are well worth reading. To quote from the first paper "A Processor for a High-Performance Personal Computer" by Butler W. Lampson and Kenneth A. Pier, page 7 5.3 Task Specific State In order to allow the immediate task switching described above, the processor must be able to save and restore state within one microcycle. This is accomplished by keeping the vital state information throughout the processor not in a single rank of registers but in task specific registers. These are actually implemented with high speed memory that is addressed by a task number. Examples of task specific registera are the microcode program counter, the branch condition register, the microcode subroutine link register, the memory data register, and a temporary storage register for each task. The number of the task whcih will execute in the next microcycle is broadcast throughout the processor and used to address the task specific registers. Thus, data can be fetched from high speed task specific memories and be available for use in the next cycle. Ian Kaplan Loral Dataflow Group Loral Instrumentation USENET: {ucbvax,decvax,ihnp4}!sdcsvax!loral!ian ARPA: sdcc6!loral!ian@UCSD USPS: 8401 Aero Dr. San Diego, CA 92123
lamaster@pioneer.UUCP (03/05/87)
In article <670@brl-sem.ARPA> ron@brl-sem.ARPA (Ron Natalie <ron>) writes: >In article <542@houxv.UUCP>, rdt@houxv.UUCP (R.TRAUBEN) writes: > > Could anyone share with us how the CPU hardware and operating > > system coordinate their response to a faulted copyback midway > > thru an exchange jump? > > >Easy, first, the machines don't page. >Second, use better error correction, parity is for farmers. The CDC 6000's (e.g. 6600), 7600, Cyber 70's, and 170's don't page. The CDC Cyber 205 pages. The 205 has virtual memory (with 2 page sizes: small pages with 512, 2048, or 8192 words (small page size selectable) and large pages with 64K words), (words are 64 bits), and is generally, except for the memory-to-memory vector instructions which are microprogrammed, a somewhat RISCy machine. The Cyber 205 is a completely different architecture from the older Cybers. Hugh LaMaster, m/s 233-9, UUCP {seismo,topaz,lll-crg,ucbvax}! NASA Ames Research Center ames!pioneer!lamaster Moffett Field, CA 94035 ARPA lamaster@ames-pioneer.arpa Phone: (415)694-6117 ARPA lamaster@pioneer.arc.nasa.gov "In order to promise genuine progress, the acronym RISC should stand for REGULAR (not reduced) instruction set computer." - Wirth ("Any opinions expressed herein are solely the responsibility of the author and do not represent the opinions of NASA or the U.S. Government")
roger@celtics.UUCP (03/13/87)
In article <1430@navajo.STANFORD.EDU> billw@navajo.STANFORD.EDU (William E. Westfield) writes: > >So has anyone done research to see if it might be worthwhile to >implement a machine with multiple sets of registers (say 16 sets >of 64 registers, with register windows)? The Celerity Accel processor in the C1200 series has implemented exactly this approach: a two-dimensional register stack cache, one dimension being the sliding-window implementation, the second dimension being an index by context ID, allowing bank-switching and fast context switch between most active processes. The register configuration is as follows: Each window is composed of 16 parameter registers, 16 local variable registers, and a window into the next 16 registers (a callee's parameter registers). Registers are configured as follows on each model: Model Number of windows (depth) Number of contexts (width) ===== ========================= ========================== C1200 16 8 C1230 32 16 C1260 Dyadic 32 32 (16 per processor) The Extended Arithmetic Unit tightly coupled co-processor also has its own stack cache of 64-bit registers: 15 frames of 15 64-bit registers on the C1200, 8 banks of 15 frames of 15 64-bit registers on the C1230, and 8 banks of 15 frames of 15 on the C1260. (The more observant of you might have guessed that the C1260 is a symmetrical dual processor based on the C1230...) Use of static, on-chip registers to augment the register stack caches has been discussed earlier in this group by JJ Whelan of Celerity's engineering group, who is a more informed speaker than I (but I'm in Sales Support, so I just love to shoot my mouth off... :-) -- ///==\\ (No disclaimer - nobody's listening anyway.) /// Roger B.A. Klorese, CELERITY (Northeast Area) \\\ 40 Speen St., Framingham, MA 01701 +1 617 872-1552 \\\==// celtics!roger@seismo.CSS.GOV - seismo!celtics!roger
faustus@ucbcad.berkeley.edu (Wayne A. Christopher) (03/16/87)
In article <1477@celtics.UUCP>, roger@celtics.UUCP (Roger Klorese) writes: > Each window is composed of 16 parameter registers, 16 local variable > registers, and a window into the next 16 registers (a callee's parameter > registers). Registers are configured as follows on each model: > > Model Number of windows (depth) Number of contexts (width) > ===== ========================= ========================== > C1230 32 16 The processor has 16K registers??? On chip? Is it implemented as a RAM? Is that fast enough? Wayne