[comp.arch] register saving on context switch

lamaster@pioneer.UUCP (02/28/87)

In article <5763@amdahl.UUCP> chuck@amdahl.UUCP (Charles Simmons) writes:
>In article <448@cpocd2.UUCP> howard@cpocd2.UUCP (Howard A. Landman) writes:
>>In article <4376@columbia.UUCP> eppstein@tom.columbia.edu (David Eppstein) writes:
>>>[David implies that an interrupt requires a complete context switch
>>>of writing all active windows to memory and reloading them after
>>>handling the interrupt]
>>
>>[Howard points out that for an interrupt, the operating system code
>>that handles the interrupt can promise not to trash active windows,
>>and continue to use the existing stack.  He also points out that
>>context switches between user processes occur relatively infrequently
>>and thus the overhead of writing out all active windows and reading in
>>new windows on a context switch is not important.]
>
>I agree with this, so I'm going to change the subject slightly.  Lately
>I've been thinking about operating systems which would perform a complete
>context switch on every interrupt.  In particular, code for handling
>interrupts would run in a restricted environment (ie user mode) where 
>their ability to accidentally trash other interrupt handlers, the operating 
>system, and device drivers would be minimized.
>
>My biggest problem with this concept is that context switching, even
>on a machine with "only" 16 registers is extremely expensive.
>I'd be interested in hearing about architechtures that minimize the
>expense of a context switch.
> 
Actually, if you have a fast, pipelined memory interface, it is not so
expensive after all to save the complete context automatically.  CDC
and Cray have been doing various forms of this for many years.  On the
CDC 6000/7000 and Cyber 70 and 170 machines this is known as "exchange
jump".  The Cyber 205 actually swaps 256 registers in less than 200 
CPU cycles (not to be confused with the longer memory cycles).  It can
do this because of the way the memory interface is structured.  It may
look "expensive", but simplifies register management so much that it
is actually cheap.  On the Cray-1 type machines, this automatic
exchange jump feature includes only the small register sets, not the
vector registers, because of the expense of saving 8 vector registers.
So, the complexity of having to have operating system kernel code 
become aware of the state of vector registers and who they belong to
is reintroduced.  Anyway, automatic context saving is not pie in the
sky.  It works well and probably provides a net performance/price
improvement on machines which are going to have a fast, pipelined
memory interface anyway; many machines intended for engineering/
scientific/floating point intensive use DO, and more SHOULD.




  Hugh LaMaster, m/s 233-9,  UUCP {seismo,topaz,lll-crg}!ames!pioneer!lamaster 
  NASA Ames Research Center  ARPA lamaster@ames-pioneer.arpa
  Moffett Field, CA 94035    ARPA lamaster@pioneer.arc.nasa.gov
  Phone:  (415)694-6117      ARPA lamaster@ames.arc.nasa.gov

"In order to promise genuine progress, the acronym RISC should stand 
for REGULAR (not reduced) instruction set computer." - Wirth

("Any opinions expressed herein are solely the responsibility of the
author and do not represent the opinions of NASA or the U.S. Government")

rdt@houxv.UUCP (03/02/87)

> 
> Actually, if you have a fast, pipelined memory interface, it is not so
> expensive after all to save the complete context automatically.  CDC
> and Cray have been doing various forms of this for many years.  
> CDC 6000/7000 and Cyber 70 and 170 machines this is known as "exchange
> jump".  The Cyber 205 actually swaps 256 registers in less than 200 
> CPU cycles (not to be confused with the longer memory cycles).  It can
> do this because of the way the memory interface is structured.  It may
> look "expensive", but simplifies register management so much that it
> is actually cheap. It works well and probably provides a net 
> performance/price improvement on machines which are going to have a 
> fast, pipelined memory interface anyway

 	Could anyone share with us how the CPU hardware and operating
	system coordinate their response to a faulted copyback midway
	thru an exchange jump?

	Let 16 registers be copied back to memory (actually stack in
	main memory) and a fault occurs at register copyback in word
	5 of the 16 total. Consider 2 cases of fault:  fault due to a
	page not present (stack overflow) and fault due to a memory
	parity error. 

	I assume the os must immediately respond to the fault by moving
        the registers out to a temporary safe area of memory before 
	switching to the new process. (Recall that part of the old
	process context still resides in the shared registers)

	How does the hardware differentiate this kind of fault from
	non-context switch memory faults when calling the OS fault
	handler? How does the hardware get the base address of
	the temporary safe area? What happens if the copy to the
	safe area faults as well?

	If the context being saved is the contents of a copyback
	stack cache rather than a those of a register file, the 
	situation gets more interesting since one can not typically
	fault the instruction which wrote the word into the cache
	until cache copyback discovers the problem with main memory.

	Any comments?

Richard Trauben
ATT Information Systems
WE32x00 Processor Development
Holmdel, NJ.

billw@navajo.UUCP (03/03/87)

So has anyone done research to see if it might be worthwhile to
implement a machine with multiple sets of registers (say 16 sets
of 64 registers, with register windows)?  This would cut down on
the overhead of conext switching a lot assuming that not too many
processes were "active" at once.  It would probably add a lot to
the complexity of the system though...

BillW

lamaster@pioneer.UUCP (03/04/87)

In article <542@houxv.UUCP> rdt@houxv.UUCP (R.TRAUBEN) writes:
>
> 	Could anyone share with us how the CPU hardware and operating
>	system coordinate their response to a faulted copyback midway
>	thru an exchange jump?
>
>	Let 16 registers be copied back to memory (actually stack in
>	main memory) and a fault occurs at register copyback in word
>	5 of the 16 total. Consider 2 cases of fault:  fault due to a
>	page not present (stack overflow) and fault due to a memory
>	parity error. 
>
>	I assume the os must immediately respond to the fault by moving
>        the registers out to a temporary safe area of memory before 
>	switching to the new process. (Recall that part of the old
>	process context still resides in the shared registers)
>
>	How does the hardware differentiate this kind of fault from
>	non-context switch memory faults when calling the OS fault
>	handler? How does the hardware get the base address of
>	the temporary safe area? What happens if the copy to the
>	safe area faults as well?
>
>	If the context being saved is the contents of a copyback
>	stack cache rather than a those of a register file, the 
>	situation gets more interesting since one can not typically
>	fault the instruction which wrote the word into the cache
>	until cache copyback discovers the problem with main memory.
>
>	Any comments?
>
>Richard Trauben
>ATT Information Systems
>WE32x00 Processor Development
>Holmdel, NJ.
>
>
>	

The best answer to this question is to refer you to the Cyber 205
Hardware Reference Manual, CDC #60256020.  However, there are
some things that could be said briefly:

You are certainly correct that the hardware must know what
physical memory to swap the registers to if a page fault occurs.

I glossed over the fact that the Cyber 205 and lower Cyber's
have a somewhat different structure.  Because memory is always
resident on a lower Cyber (no page faults),
any memory address can be used to
swap.  The Lower Cyber exchange jump need not be to the
operating system kernel, although there is a tree-like
structure.  On the Cyber 205, the exchange jump is referred to
as "exit force" and it is always between the operating system
kernel and another process.  On the lower Cybers, a "job 
supervisor" could exchange to/from a job without going through
the kernel.  On the Cyber 205, to get to the equivalent of a
job supervisor, first the process exits to the kernel, and the
kernel then exits to supervisor.  

On both types of machines, there are a number of hidden
registers in addition to the visible registers.  These registers,
when written out to memory, are referred to as the "exchange
package".  One of these registers holds the physical memory 
address of the exchange package that is to be read in if an
exchange is done.  So, to make a long story short, in both 
cases the registers are exchanged with a physical memory
location.  Therefore, there is no possibility of a page fault.
[If a fatal SECDED fault occurred (one of the other questions)
then the CPU would stop on a fatal error (whether an exchange
was taking place or not).  These only happen about once a year
and are not a problem.]  In the case of the Cyber 205, it is
the kernel's job to manage the exchange packages for all the
processes in the system.  Since it doesn't take very long to
exchange to/from the kernel, the overhead is small.

On the Cyber 205, there is an additional user state instruction
which exchanges registers (though not the hidden registers) 
which runs just as fast.  If it encounters a page fault, the
instruction is continued (not restarted) when the task is
restarted.  The Cyber 205 has instruction continuation for
many other instructions as well, since it has memory to memory
vector instructions which may cause page faults.  Although much of
the machine is "hardwired", instruction continuation is "all
handled in the microcode".

There is no data cache.  There is an instruction cache.  The
256 registers were a conscious design decision.  In practice,
the 256 registers are just about the right size to make sure that
for almost all Fortran modules, local scalar variables are already
in registers.  And since the vector instructions are memory to
memory, and run through very large amounts of memory very fast, it
would be counterproductive to read vector data into a cache.
(With a somewhat different organization, the Alliant mini-super 
does this, I seem to recall).

I did not mean to mislead anyone into thinking that the Cyber 205
has an message passing structure.  It simply has a very fast way
to load and unload its large register file, and uses it to 
produce a very fast context switch between a process and the kernel.
The kernel does all the message passing (actually, that is almost
all it does.  It also schedules.)  With a high performance memory
interface, it is not catastrophically expensive to save registers.

  Hugh LaMaster, m/s 233-9,  UUCP {seismo,topaz,lll-crg,ucbvax}!
  NASA Ames Research Center                ames!pioneer!lamaster
  Moffett Field, CA 94035    ARPA lamaster@ames-pioneer.arpa
  Phone:  (415)694-6117      ARPA lamaster@pioneer.arc.nasa.gov

"In order to promise genuine progress, the acronym RISC should stand 
for REGULAR (not reduced) instruction set computer." - Wirth

("Any opinions expressed herein are solely the responsibility of the
author and do not represent the opinions of NASA or the U.S. Government")

devoz@encore.UUCP (03/04/87)

[flames are not encouraged]

The "PRIME"ates out  there on  the net  should have  something to say
about multiple register sets.  They have been using multiple register
sets for years.  

They call them "current register sets" if memory serves me correctly.

						devoz@encore

"Are you all part of tonights dream or what?"

{linus,talcott,decvax,ihnp4,allegra,necis,compass}!encore!devoz

ron@brl-sem.UUCP (03/04/87)

In article <542@houxv.UUCP>, rdt@houxv.UUCP (R.TRAUBEN) writes:
 >  	Could anyone share with us how the CPU hardware and operating
 > 	system coordinate their response to a faulted copyback midway
 > 	thru an exchange jump?
 > 
 > 	Consider 2 cases of fault:  fault due to a
 > 	page not present (stack overflow) and fault due to a memory
 > 	parity error. 

Easy, first, the machines don't page.
Second, use better error correction, parity is for farmers.

ian@loral.UUCP (03/05/87)

In article <1430@navajo.STANFORD.EDU> billw@navajo.STANFORD.EDU (William E. Westfield) writes:
>
>So has anyone done research to see if it might be worthwhile to
>implement a machine with multiple sets of registers (say 16 sets
>of 64 registers, with register windows)?  This would cut down on
>the overhead of conext switching a lot assuming that not too many
>processes were "active" at once.  It would probably add a lot to
>the complexity of the system though...
>
>BillW


  The Xerox Dorado has multiple register sets and as a result has
  very fast context switch.  For information on the Dorado see

    "The Dorado: A High-Performance Personal Computer.  Three Papers"
    Xerox CSL-81-1 January 1981, Palo Alto Research Center
    3333 Coyote Hill Rd., Palo Alto, CA 94304

  These papers are well worth reading.  To quote from the first paper
  "A Processor for a High-Performance Personal Computer" by Butler W.
  Lampson and Kenneth A. Pier, page 7

    5.3 Task Specific State

    In order to allow the immediate task switching described above, the
    processor must be able to save and restore state within one microcycle.
    This is accomplished by keeping the vital state information throughout
    the processor not in a single rank of registers but in task specific
    registers.  These are actually implemented with high speed memory that
    is addressed by a task number.  Examples of task specific registera are
    the microcode program counter, the branch condition register, the
    microcode subroutine link register, the memory data register, and a
    temporary storage register for each task.  The number of the task whcih
    will execute in the next microcycle is broadcast throughout the
    processor and used to address the task specific registers.  Thus, data
    can be fetched from high speed task specific memories and be available
    for use in the next cycle.

		     Ian Kaplan
		     Loral Dataflow Group
		     Loral Instrumentation
	     USENET: {ucbvax,decvax,ihnp4}!sdcsvax!loral!ian
	     ARPA:   sdcc6!loral!ian@UCSD
	     USPS:   8401 Aero Dr. San Diego, CA 92123

lamaster@pioneer.UUCP (03/05/87)

In article <670@brl-sem.ARPA> ron@brl-sem.ARPA (Ron Natalie <ron>) writes:
>In article <542@houxv.UUCP>, rdt@houxv.UUCP (R.TRAUBEN) writes:
> >  	Could anyone share with us how the CPU hardware and operating
> > 	system coordinate their response to a faulted copyback midway
> > 	thru an exchange jump?
> > 
>Easy, first, the machines don't page.
>Second, use better error correction, parity is for farmers.

The CDC 6000's (e.g. 6600), 7600, Cyber 70's, and 170's don't page.

The CDC Cyber 205 pages.

The 205 has virtual memory (with 2 page sizes: small pages with 512, 2048,
or 8192 words (small page size selectable) and large pages with 64K words),
(words are 64 bits), and is generally, except for the memory-to-memory vector
instructions which are microprogrammed, a somewhat RISCy machine.
The Cyber 205 is a completely different architecture from the
older Cybers.

  Hugh LaMaster, m/s 233-9,  UUCP {seismo,topaz,lll-crg,ucbvax}!
  NASA Ames Research Center                ames!pioneer!lamaster
  Moffett Field, CA 94035    ARPA lamaster@ames-pioneer.arpa
  Phone:  (415)694-6117      ARPA lamaster@pioneer.arc.nasa.gov

"In order to promise genuine progress, the acronym RISC should stand 
for REGULAR (not reduced) instruction set computer." - Wirth

("Any opinions expressed herein are solely the responsibility of the
author and do not represent the opinions of NASA or the U.S. Government")

roger@celtics.UUCP (03/13/87)

In article <1430@navajo.STANFORD.EDU> billw@navajo.STANFORD.EDU (William E. Westfield) writes:
>
>So has anyone done research to see if it might be worthwhile to
>implement a machine with multiple sets of registers (say 16 sets
>of 64 registers, with register windows)?

The Celerity Accel processor in the C1200 series has implemented exactly
this approach: a two-dimensional register stack cache, one dimension being
the sliding-window implementation, the second dimension being an index by
context ID, allowing bank-switching and fast context switch between most
active processes.  The register configuration is as follows:

Each window is composed of 16 parameter registers, 16 local variable
registers, and a window into the next 16 registers (a callee's parameter
registers).  Registers are configured as follows on each model:

Model         Number of windows (depth)     Number of contexts (width)
=====         =========================     ==========================

C1200                  16                             8
C1230                  32                            16
C1260 Dyadic           32                            32 (16 per processor)

The Extended Arithmetic Unit tightly coupled co-processor also has its
own stack cache of 64-bit registers: 15 frames of 15 64-bit registers
on the C1200, 8 banks of 15 frames of 15 64-bit registers on the C1230,
and 8 banks of 15 frames of 15 on the C1260.  (The more observant of you
might have guessed that the C1260 is a symmetrical dual processor based
on the C1230...)

Use of static, on-chip registers to augment the register stack caches has
been discussed earlier in this group by JJ Whelan of Celerity's engineering
group, who is a more informed speaker than I (but I'm in Sales Support, so
I just love to shoot my mouth off... :-)

-- 
 ///==\\   (No disclaimer - nobody's listening anyway.)
///        Roger B.A. Klorese, CELERITY (Northeast Area)
\\\        40 Speen St., Framingham, MA 01701  +1 617 872-1552
 \\\==//   celtics!roger@seismo.CSS.GOV - seismo!celtics!roger

faustus@ucbcad.berkeley.edu (Wayne A. Christopher) (03/16/87)

In article <1477@celtics.UUCP>, roger@celtics.UUCP (Roger Klorese) writes:
> Each window is composed of 16 parameter registers, 16 local variable
> registers, and a window into the next 16 registers (a callee's parameter
> registers).  Registers are configured as follows on each model:
> 
> Model         Number of windows (depth)     Number of contexts (width)
> =====         =========================     ==========================
> C1230                  32                            16

The processor has 16K registers???  On chip?  Is it implemented as a RAM?
Is that fast enough?

	Wayne