[comp.arch] Context switching on RISC chips

ed@iitmax.IIT.EDU (Ed Federmeyer) (12/31/89)

One of the things that seems to characterize RISC chips is the relatively
large number of registers.  This makes me wonder what happens during a
context switch.  After all, moving 256 (or more) registers to memory, and
then another 256 (or more!) back in for each context switch seems like an
awfull lot of overhead.  I can think of a few ways around this:

1) Do nothing special... Suffer
2) Have each register "tagged" like a cache, so only the "dirty" registers
   need to be moved out.  You'd still have to load in all the old ones.
3) Have a few register "sets".  Ie, a context switch really moves a pointer
   to a bank of registers (of which there are several on-chip).
4) Like 3, but only have 2 sets.  While context 2 is processing, drain out
   context 1's set so it's ready by the next switch.  Since a RISC chip
   seems to execute 1 instruction in 1 cycle, I can't see that there is
   alot of extra bus cycles. (Unlike in a CISC, where you might have hundreds
   of clock cycles free while the processor executes an instruction in which
   the bus is not being used)  Unless of course you have a second bus
   going to memory dedicated to just shuttling registers in and out.

Are any of these sorts of techniqes being used in real RISC chips?  Or
am I just over estimating the overhead of RISC context switching?  If I get
some Email responses to this, I'll post a summary.
-- 
+-------------------------------------+--------------------------------------+
|  Ed Federmeyer                      |    Internet:  ed@iitMax.iit.edu      |
|      "Unauthorized access is        |    Bitnet:    sysed@iitVax           |
|         strictly unauthorized."     |    Office:    (312) 567-5981         |
+-------------------------------------+--------------------------------------+

hankd@pur-ee.UUCP (Hank Dietz) (01/02/90)

In article <3167@iitmax.IIT.EDU> ed@iitmax.iit.edu (Ed Federmeyer) writes:
>One of the things that seems to characterize RISC chips is the relatively
>large number of registers.  This makes me wonder what happens during a
>context switch.  After all, moving 256 (or more) registers to memory, and
>then another 256 (or more!) back in for each context switch seems like an
>awfull lot of overhead.  I can think of a few ways around this:
>
>1) Do nothing special... Suffer
>2) Have each register "tagged" like a cache, so only the "dirty" registers
>   need to be moved out.  You'd still have to load in all the old ones.
>3) Have a few register "sets".  Ie, a context switch really moves a pointer
>   to a bank of registers (of which there are several on-chip).
>4) Like 3, but only have 2 sets.  While context 2 is processing, drain out
...

First of all, it isn't just the register file which has gotten big -- it's
the complete localized process state.  This includes registers, caches, even
process-specific page tables and disk buffers.  Second, it has NOTHING TO DO
WITH BEING RISC -- chips are fast, talking with other chips is slow, talking
with other boards is even slower, so ANY high-performance architecture
naturally tends toward a larger, longer lived, localized process state.

As for your list of choices, you left out two favorites:

5) Initiate context switch before it is needed...  this looks a lot like the
   incremental checkpointing done by fault-tolerence folk.

6) Since nobody builds uniprocessors anymore (;-), NEVER interrupt a
   processor:  simply let another processor which happens to be free at that
   moment (or the next processor to become free) handle the interrupt.

Number 6 is the really interesting one.

Suppose you have an MIMD architecture with perhaps 64 processors and
typically only about 5-10 programs running simultaneously...  you should use
execution-time changes in the parallelism width of each program to implement
priorities.  The OS scheduler (which is partly hardware) would simply insure
that there is always at least one processor free to service any
time-critical interrupt which might arrive (e.g., "incoming ICBMs
detected").  Less critical interrupts (e.g., "Joe Luser types the letter Q")
can simply be buffered awaiting a free processor...  one will become free as
soon as a program either terminates or reaches a point at which it can change
parallelism-width.  With "enough" processors, this stochastic wait for a free
processor can be made arbitrarily short....

BTW, you might say that processes can require context switches for
synchronous events (e.g., loading a value from memory which is far away),
but IMHO the use of a context switch is usually overkill in such cases
(sorry, Burton ;-).  This is because, with the right architecture,
synchronous delay events can be hidden using static (compile-time)
scheduling (e.g., code motions to hide delayed loads).

Beside that, it doesn't bother me very much if a few of many processors are
needlessly idle.  Why?  Because adding context switching ability adds a
surprisingly large degree of circuit complexity (as you have implied above),
my wild guesstimate is that it would be at least 30%.  Suppose I have the
choice between having 100 context-switchable processors or 130
otherwise-equivalent non-interruptable processors...  about 30
non-interruptable processors would have to be idle before I'd get worried.
Oh yes...  the non-interruptable processors will probably also have fewer
gate delays per basic operation -- they'll run with a faster clock.

						-hankd@ecn.purdue.edu

tim@nucleus.amd.com (Tim Olson) (01/02/90)

In article <3167@iitmax.IIT.EDU> ed@iitmax.iit.edu (Ed Federmeyer) writes:
| One of the things that seems to characterize RISC chips is the relatively
| large number of registers.  This makes me wonder what happens during a
| context switch.  After all, moving 256 (or more) registers to memory, and
| then another 256 (or more!) back in for each context switch seems like an
| awfull lot of overhead.  I can think of a few ways around this:
| 
| 1) Do nothing special... Suffer
|
| 2) Have each register "tagged" like a cache, so only the "dirty" registers
|    need to be moved out.  You'd still have to load in all the old ones.
| 3) Have a few register "sets".  Ie, a context switch really moves a pointer
|    to a bank of registers (of which there are several on-chip).
| 4) Like 3, but only have 2 sets.  While context 2 is processing, drain out
|    context 1's set so it's ready by the next switch.  Since a RISC chip
|    seems to execute 1 instruction in 1 cycle, I can't see that there is
|    alot of extra bus cycles. (Unlike in a CISC, where you might have hundreds
|    of clock cycles free while the processor executes an instruction in which
|    the bus is not being used)  Unless of course you have a second bus
|    going to memory dedicated to just shuttling registers in and out.

Here are 3 methods that can be used to reduce context switch time
on the Am29000 RISC processor:

  1) Register Banking (like your #3).  Because the large local
     register file can be addressed offset from a stack-pointer value,
     it can be divided into separate register banks.  A context
     switch then consists of saving and restoring a few special
     registers (to the on-chip global registers) and changing the
     stack pointer to use the new register bank.  Register banks can
     be protected from user-mode access in groups of 16 registers.
     A full context switch takes about 1 microsecond at 25MHz.
     This method is supported in the Am29000 hardware, but current
     software tools do not make use of it.

  2) Reduced Stack Cache Size.  The standard Am29000 calling
     convention uses the local register file as a runtime stack cache.
     The maximum number of registers used in the cache can be
     regulated by a user-defined register spill/fill trap handler
     which is invoked whenever a portion of the cached stack must be
     saved/restored to memory.  The number of registers used can range
     from ~32 to 128, allowing the system designer to make the correct
     trade-off between context-switch time and individual task
     performance.  Using this technique, context-switch times can be
     varied from ~9 microseconds to ~17 microseconds with 2-cycle
     first-access, single-cycle burst memory at 25MHz.

  3) Saving/Restoring Live Registers Only (like your #2).  In systems
     where security (covert channels) is not an issue (as in most
     real-time embedded control systems), the context-switch time can
     be reduced by saving only the live local registers found between
     the current stack pointer and the top of the stack cache.  This
     is typically about half of the total local registers.  In
     addition, only the current procedure frame of the new context
     need be restored; the rest of the stack cache will be "faulted
     in" if necessary.  This results in context switch times of ~10
     microseconds using the full 128-register stack cache.


	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

mat@uts.amdahl.com (Mike Taylor) (01/03/90)

In article <14007@pur-ee.UUCP>, hankd@pur-ee.UUCP (Hank Dietz) writes:
> 
> 6) Since nobody builds uniprocessors anymore (;-), NEVER interrupt a
>    processor:  simply let another processor which happens to be free at that
>    moment (or the next processor to become free) handle the interrupt.
> 
> Number 6 is the really interesting one.
> 
A slight variation on this is used by some mainframe operating systems - in
a multiprocessor only one processor may be enabled for I/O interruptions. The
OS will enable another processor if the interrupt load on the enabled processor
reaches a certain threshold (and so on).  Of course, any processor can start 
an I/O operation. A nice side-effect is that the enabled processor can test
for pending interrupts when it finishes interrupt processing and handle the
next one in the queue without bothering to process switch. This batching effect
increases performance further.
-- 
Mike Taylor                               ...!{hplabs,amdcad,sun}!amdahl!mat

[ This may not reflect my opinion, let alone anyone else's.  ]

mash@mips.COM (John Mashey) (01/03/90)

In article <3167@iitmax.IIT.EDU> ed@iitmax.iit.edu (Ed Federmeyer) writes:
>One of the things that seems to characterize RISC chips is the relatively
>large number of registers.  This makes me wonder what happens during a
>context switch.  After all, moving 256 (or more) registers to memory, and
>then another 256 (or more!) back in for each context switch seems like an
>awfull lot of overhead.  I can think of a few ways around this:

It is good to avoid over-generalizing:
	a) Many RISCs do not have as many registers, i.e. in approximate
	numbers of 32-bit registers, ignoring system registers:
		88K: 32
		MIPS & HP PA: 64
	b) The registers are only one part of the state / cost for context
	switching.  Others include:
		system and internal registers (consider the amount of state
			pushed on stack of 68K on some exceptions)
		MMU/TLBs [some designs require state save if more than
			N processes; others pay the cost incrementally,
			but some cost is paid, sometime.
		cache structure, especially if flushes are needed, as in i860
	c) In fact, in OS's like UNIX, moderate changes in numbers of registers
		have fairly minimal effects on context-switching time,
		compared with the above issues, plus even more important: the
		OS structure, and its design tradeoffs.  In particular,
		the definition of "Context switch" is pretty fuzzy, because
		it really matters how you measure it.  Is it:
			time to save/restore registers?
			time to save/restore entire state?
			time to do all of that, plus the OS scheduling code?
			measured by switching between 2 processes, or
				between N?
	d) Amount of state changed is likely to be more important in such
	operating systems as:
		1) Switching machine OSs, which have large numbers of
		processes that execute for fairly short periods of time.
		2) Real-time operating systems,  especially at the "hard"
		edge, where guaranteed predictability is required.
		(Mechanisms that increase thruput for somethings often
		increase the variability.)
Given all of this, it's sometimes pretty hard to make comparisons, except
by benchmarking an entire hardware+operating system combination, ideally
with the same OS.
I have a tiny piece of data, from which large conclusions
should not be drawn, but which is useful.  JMI Software Consultants has a
fairly portable real-time kernel called C Executive (TM), and they publish
some numbers on different environments.  Here are times in microseconds for
context-switch, version 2.3, published 10/89:

CPU		R2000		68020		29000		SPARC
System		M/120		FORCE		AMD		Sun-4/2xx
MHz		16.7		25		20		16.7
C Compiler	MIPS		Whitesmiths	MetaWare	Sun

Cntxt Switch,
microseconds	10		17		29		20
-------------
Now, that's all that I know.  I don't know what the board designs are for the
68020 and 29000, and I have no idea of the internal structure of the C 
Executive and how it matches the different chips; I have no idea how they
measure context-switching, just that it's consistent.  It is pretty clear
that there is at least a gross correlation with number of registers here,
but that's about all.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

shankar@SRC.Honeywell.COM (Subash Shankar) (01/03/90)

In article <28573@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:

#  3) Saving/Restoring Live Registers Only (like your #2).  In systems
#     where security (covert channels) is not an issue (as in most
#     real-time embedded control systems), the context-switch time can
#     be reduced by saving only the live local registers found between
#     the current stack pointer and the top of the stack cache.  

Why does security affect things here?
---
Subash Shankar             Honeywell Systems & Research Center
voice: (612) 782 7558      US Snail: 3660 Technology Dr., Minneapolis, MN 55418
shankar@src.honeywell.com  srcsip!shankar

honig@ics.uci.edu (David A. Honig) (01/03/90)

In article <52104@srcsip.UUCP> shankar@src.honeywell.com (Subash Shankar) writes:
>In article <28573@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
>
>#  3) Saving/Restoring Live Registers Only (like your #2).  In systems
>#     where security (covert channels) is not an issue (as in most
>#     real-time embedded control systems), the context-switch time can
>#     be reduced by saving only the live local registers found between
>#     the current stack pointer and the top of the stack cache.  
>
>Why does security affect things here?
>Subash Shankar             Honeywell Systems & Research Center

Isn't it obvious?  The contents of registers, if not overwritten, provides a 
'hidden' communication path between processes.  If you worry about
that then your context switches will have to do more work.

And I don't do either architecture or security professionally!

--
David A. Honig		
 "As these houses are built, the lions continue to wander around those areas,
 and they bump into a poodle or a German shepard. That causes some trouble."
	---Prof. R. Barrett, UCB

melvin@ucbarpa.Berkeley.EDU (Steve Melvin) (01/03/90)

In article <14007@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
>First of all, it isn't just the register file which has gotten big -- it's
>the complete localized process state.  This includes registers, caches, even
>process-specific page tables and disk buffers.  Second, it has NOTHING TO DO
>WITH BEING RISC -- chips are fast, talking with other chips is slow, talking
>with other boards is even slower, so ANY high-performance architecture
>naturally tends toward a larger, longer lived, localized process state.
>

Your point that the localized process state includes more than just registers
is well taken, but I'd like to take issue with your second statement.  There
are indeed architectural features which affect the size of the process state,
independent from the speed differential between the processor and the rest
of the world.  The key is that temporary results which are computed and used
within a basic block need not be stored in "named" registers.  That is, they
may be internal registers but NOT part of the process state if, for example,
context switches are never initiated except on basic block boundaries
(interrupts can either have long latencies or work can be discarded).

Of course, how much the state can be reduced while preserving the advantages
of the same number of internal registers depends on the application and the
compiler.  The point is that by increasing the size of the unit of work which
the processor considers "atomic" (I call this an Execution Atomic Unit),
fewer *architectural* registers are required.  The main reason one would want
to do this is to allow more parallelism to be exploited, but that's
another issue.

>
>BTW, you might say that processes can require context switches for
>synchronous events (e.g., loading a value from memory which is far away),
>but IMHO the use of a context switch is usually overkill in such cases
>(sorry, Burton ;-).  This is because, with the right architecture,
>synchronous delay events can be hidden using static (compile-time)
>scheduling (e.g., code motions to hide delayed loads).
>

Well, first of all, the use of the term "context switch" is a little
misleading as applied to the HEP-1.  That machine allows the context for
up to 50 processes to simultaneously be present in the processor, eight
of which are at any time actively in the pipeline.  When a process 
needs to wait for memory, it is simply not returned to the group (of 50)
which are candidates for being scheduled.

Secondly, I disagree that with the "right architecture" synchronous delay
events can be hidden.  If there is no variance in memory latency (i.e. if
you have no cache or if you have a 100% hit rate), then the compiler can
do a pretty good job (branch predictability is also important here, static
vs. dynamic but that's another issue).  However, when memory latency becomes
more variable, dynamic scheduling starts becoming more important (also,
this effect is increasingly important as the number of parallel operations
increases).

-------
Steve Melvin
University of California, Berkeley
-------

firth@sei.cmu.edu (Robert Firth) (01/03/90)

In article <28573@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:

>#  3) Saving/Restoring Live Registers Only (like your #2).  In systems
>#     where security (covert channels) is not an issue

In article <52104@srcsip.UUCP> shankar@src.honeywell.com (Subash Shankar) writes:

>Why does security affect things here?

Here's a simple example.  (Mnemonics have been changed to protect
the innocent.)

You get hold of the login processing code of the Mugwump IV multi-user
operating system.  That code reads an 8-character password, scrambles
it, and checks it.  It just so happens that, on exit from the code,
the clear password is still lying around in <R6,R7>.

You write a short program that does this

	loop
	   get myself time sliced
	   read <R6,R7>
	   write the value to a file
	end loop

Every so often, by pure chance, the OS scheduler will resume your
program immediately after executing the login process.  Every time
that happens, you will have captured a password.  The registers
<R6,R7> are the covert channel by means of which information leaks
from the login process to your program.  An OS that saved and restored
only live registers would not save and restore them, so leaving
the channel open.  (The registers aren't live because there is no
write to them anywhere in the program.)

rec@dg.dg.com (Robert Cousins) (01/04/90)

In article <7fWJ02qY7aoo01@amdahl.uts.amdahl.com> mat@uts.amdahl.com (Mike Taylor) writes:
>In article <14007@pur-ee.UUCP>, hankd@pur-ee.UUCP (Hank Dietz) writes:
>> 6) Since nobody builds uniprocessors anymore (;-), NEVER interrupt a
>>    processor:  simply let another processor which happens to be free at that
>>    moment (or the next processor to become free) handle the interrupt.
>> Number 6 is the really interesting one.
>A slight variation on this is used by some mainframe operating systems - in
>a multiprocessor only one processor may be enabled for I/O interruptions. The
>OS will enable another processor if the interrupt load on the enabled processor
>reaches a certain threshold (and so on).  Of course, any processor can start 
>an I/O operation. A nice side-effect is that the enabled processor can test
>for pending interrupts when it finishes interrupt processing and handle the
>next one in the queue without bothering to process switch. This batching effect
>increases performance further.
>Mike Taylor                               ...!{hplabs,amdcad,sun}!amdahl!mat

This can bring about unacceptable performance degradations under some 
circumstances. A number of fully symmetric operating systems allow any
processor to field an interrupt from any peripheral (DG/UX does). One
of the reasons for this is that single threading interrupt service 
(which is what you are talking about) introduces two sets of delays:
The first is when a peripheral must wait longer on average for interrupt 
service which reduces throughput and the second is when the sleeping task 
may not wake up until a cross-processor interrupt occurs.  Both of these are 
statistical problems that occur to some degree on all MP OSs.
However, if you do the queueing computations, it is indeed preferable
to allow multiple CPUs to handle interrupts.  However, this does not
mean that batching of interrupt processing is a bad idea.  Just the
opposite.  When a CPU completes one ISR, it should be able to enter
another without penalty.  Conclusion: administring an interrupt policy
for a high performance multiprocessor is more complex than normally
considered.

Robert Cousins
Dept. Mgr, Workstation Dev't.
Data General Corp.

Speaking for myself alone.
which reduces thoughput

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (01/04/90)

  While this topic is going on, does any (production) CPU use a register
dirty bit? What I envision is that there would be a load-em-all
instruction which would load all registers from memory and clear the
dirty bits. On context switch the save-em-all instruction would only
save those which were modified.

  I can't envision a system working which didn't preserve the registers
in a context switch, and I think the proposed loop
	sleep a bit
	check the registers
is fanciful. If the registers checge while my process is running I would
expect the process to die rather sooner than later. I *can* envision
that on first startup the registers might not be set to a known state.

  With memory it's another story. We used to allocate memory and search
it all the time in the "good old days." We ran (perhaps crawled) 32
users on a single MB of core, too, and I don't want to go back to that,
either. 
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
"The world is filled with fools. They blindly follow their so-called
'reason' in the face of the church and common sense. Any fool can see
that the world is flat!" - anon