[net.arch] Task switching or LDPCTX/SVPCTX revisisted again

mash@mips.UUCP (06/19/86)
In article <415@ur-tut.UUCP> tuba@ur-tut.UUCP (Jon Krueger) writes:
>In article <506@mips.UUCP> mash@mips.UUCP (John Mashey) writes:
>> . . .
>>3) Let's try some back-of-the-envelope numbers: ....
>>4) Now, let's try published data: Clark & Levy, ....
>>	a) LDPCTX and SVPCTX aren't on the top 25 in usage of CPU time,
>>	even in VMS Kernel mode.....
>>5) All of this is consistent in bounding the problem: for time-sharing
>>systems like VAXen, the special context save/restore instructions contribute
>>at most half a percent to performance. . . .
>
>Thanks for the numbers and calculations.  I can't argue with your numbers,
>but I arrive at different conclusions.
>
>I agree that the VAX architecture, as implemented on the 780, including the
>presence and performance of those instructions, limits overhead due to
context switching to about 5 percent of processor time.  So the performance
>increase attainable by decreasing this overhead is only 5 percent.  The
>numbers you present don't tells us how much of that 5 percent is spent
>actually executing LDPCTX/SVPCTX.  So we can only estimate the performance
>aspects of increasing their speed.  I accept your estimate of at most half a
>percent processor time spent, so we can only save about half a percent.
>
>What we can't say is how much context switching overhead would rise to if
>the instructions didn't exist.  For instance, if the functionality
>implemented in the microcode of LDPCTX/SVPCTX were performed by a system
>routine, overhead might be 90% of processor time at 60 switches per second.
----------------------------^^----Unlikely, see below.-------
>In this case, we could say that the instructions contribute about 85% to
>system performance.  Similarly, if hardware on the 780 autosaved
>and restored registers as needed by processor modes and subroutine
>instructions, overhead might be 0% of processor time, but cycles
>would take longer.
>
>In other words, I think the numbers you present prove that only about half a
>percent performance increase can be attained by tweaking the special
>instructions.  They don't prove that the special instructions contribute
>only 10% to context switching or only half a percent to system performance
>related to context switching.  Suppose 50 percent of system time was spent
>executing them.  Would you conclude that they contribute 50 percent to
>performance?  I would conclude that they subtract 50 percent from
>performance.
>
>In other other words, you look at measurements of context switching on 780's
>and since the special instructions represent so little processor time, you
>conclude they don't contribute much to performance.  I wonder how much more
>processor time would be spent acheiving the same functionality in different
>ways if the instructions didn't exist and didn't execute at their measured
>speeds.  I conclude that we don't know enough to assess the contribution of
>the special instructions to a 780's ability to keep context switching
>overhead down to about 5 percent.  Therefore, we don't know how important
>the special instructions are to timesharing, or how clever it is to put them
>into your architecture.
----------------------------
1. Jon's premise is correct, in general: "Just because something isn't used
much doesn't mean that omitting it wouldn't drastically slow a system down".
2. However, I believe my original conclusion is also correct, i.e., that
the VAX LDPCTX/SVPCTX don't contribute a lot to performance.
3. The point of all the numbers was to show that the time actually spent doing
LDPCTX/SVPCTX is small, i.e. that one might well afford a less efficient
implementation without seeing much impact on system performance.  
4. Unfortunately, there were a whole bunch of assumptions necessary to
get to the correct conclusion, and they were buried in the sentence in the
original posting [shortly before where Jon started quoting]:

"1) Register save/restores speed are almost entirely dominated by memory
system time anyway."

They were buried because I was treating the issue as a continuation of
familiar discussions, which I woefully neglected to describe.
Here are the hidden assumptions used to reach the conclusion:

a] I assumed people understand exactly what LDPCTX & SVPCTX do:
	LDPCTX:
		Invalidates the per-process half of the TLB
			[sort of half-way between TBIS and TBIA]
		Reloads general purpose registers from Process Control Block
		Reloads mapping regs and a few others
		Saves PSL & PC onto stack (so REI can use them soon)
	SVPCTX:
		Saves general regs into a PCB
		Grabs PSL & PC from stack and saves them in PCB
		Switches to interrupt stack
b] I assumed that people understood that it was clear that:
	LDPCTX is essentially a TLB operation, followed by a load-multiple
	SVPCTX is essentially a store-multiple
	(in each case, with a few tweaks)
	The timings for these things should be domininated by:
	LDPCTX: cache misses doing the loads [likely if actually used
		for context switch, not so likely if for system call/ clock
		interrupt, etc]
	SVPCTX: write-stall activity, i.e., where the CPU can write faster
		than memory can take it.
c] The fundamental point is that these operations are mostly 32-bit word
data movers, with some carefully crafted tweaks to keep the machine in a
reasonable state. These are NOT like (for example) floating point instrs,
where to simulate them in the integer instruction set is always expensive.

d] Consider what it would take to simulate the effects of these things,
either with existing instructions, or with a slightly different partitioning
of function:
	LDPCTX:
		need to add a TLB Invalidate Per-Process instruction
		Copy PSL & PC to stack
		Use POPR, after tweaking the SP to point at (something like)
			a PCB, to reload regs
			OR
			Use sequence of MOVQs to relaod the regs, 2 at a time
	SVPCTX:
		Save SP somewhere, set SP -> PCB (or something like it)
		Use PUSHR to save regs, or MOVQs
		Copy PSL & PC from stack to PCB and pop them
	[as can be seen, what these things are mainly doing for you is to
	be able to implicitly access the current PCB without needing a GP
	register to point there, at an inconvenient place.  Note that this
	pair cannot be too much more expensive than a full function call/
	return that saves most of the registers.]

e) Now: how much longer will the times for the above be? ANS: not much:
	e1) A longer sequence of instructions may well take more
	instruction-cache misses.  This is especially true if there are no
	load/store multiple instructions of any sort available.  However,
	the VAX has them already.
	e2) A longer sequence of instructions may not be able to get at the
	machine's parallelism (for things like auto-increments, for example).
	e3) [MOST IMPORTANT]: It may be that SCPCTX/LDPCTX is the fastest
	way to save/load a large block of registers.  However, this is
	generally unlikely, because it's probably a mistake, because it
	essentially means that the subroutine call mechanism (or PUSHR/POPR)
	are not as fast as possible, which is silly, because it means that
	a frequently occurring operation is less well-tuned than an
	infrequently-occurring one.  In particular, think how odd it would be
	to have loads/stores [of any reasonable size] that cannot run the
	memory system full blast.  If you're going to spend microcode effort
	to tune memory-related operations, where would you put that effort
	on a VAX?  ANS: MOV* [esp MOVC3], CALLS/RET [or PUSHR/POPR].
f) Thus, the fundamental assumption is that the bulk of the time in these
operations is spent pushing memory, with a little bit of control around them.
If you have a fast, general way to save/restore registers, you'll want it
elsewhere even more, so it might as well be a separate instruction that can be
used elsewhere.  What this says is that e3) is unlikely, e2) is possible,
but (typically, not a big deal), and e1) is not that bad a problem, unless
you have no multiple loads/stores of any sort.  Even then, it's not a big
deal, if you consider that you'll almost certainly have substantial cache
misses inside the newly-started user process.

Thus, putting all this together, we have the two pieces:
	a] LDPCTX/SVPCTX don't account for that much time.
	b] They don't do very much that couldn't be simulated by existing
	instructions, or minor variants thereof, and the fundamental nature
	of the replacements is such that the times should be fairly comparable,
	because, if they're not, then the machine has been designed to NOT
	allow good memory performance from normally-usable instructions.
	c] Thus, it's not clear that wrapping up the entire functionality of
	these things, in VAX-time-sharing use, really contributes much to
	performance, versus partitioning them differently.

I apologize for not explaining all of this more in the first place, because,
in retrospect, it's not particularly obvious.  Of course, one can make all
sorts of assumptions that invalidate the conclusion, such has expecting
that process-switch instructions have to do much more work, or that
context-switch rates are high, or that conext-switch latencies must be
guaranteed low, etc, etc.  However, I still think the conclusion, as
originally stated, actually holds in this case.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086