mash@mips.UUCP (06/19/86)
In article <415@ur-tut.UUCP> tuba@ur-tut.UUCP (Jon Krueger) writes: >In article <506@mips.UUCP> mash@mips.UUCP (John Mashey) writes: >> . . . >>3) Let's try some back-of-the-envelope numbers: .... >>4) Now, let's try published data: Clark & Levy, .... >> a) LDPCTX and SVPCTX aren't on the top 25 in usage of CPU time, >> even in VMS Kernel mode..... >>5) All of this is consistent in bounding the problem: for time-sharing >>systems like VAXen, the special context save/restore instructions contribute >>at most half a percent to performance. . . . > >Thanks for the numbers and calculations. I can't argue with your numbers, >but I arrive at different conclusions. > >I agree that the VAX architecture, as implemented on the 780, including the >presence and performance of those instructions, limits overhead due to context switching to about 5 percent of processor time. So the performance >increase attainable by decreasing this overhead is only 5 percent. The >numbers you present don't tells us how much of that 5 percent is spent >actually executing LDPCTX/SVPCTX. So we can only estimate the performance >aspects of increasing their speed. I accept your estimate of at most half a >percent processor time spent, so we can only save about half a percent. > >What we can't say is how much context switching overhead would rise to if >the instructions didn't exist. For instance, if the functionality >implemented in the microcode of LDPCTX/SVPCTX were performed by a system >routine, overhead might be 90% of processor time at 60 switches per second. ----------------------------^^----Unlikely, see below.------- >In this case, we could say that the instructions contribute about 85% to >system performance. Similarly, if hardware on the 780 autosaved >and restored registers as needed by processor modes and subroutine >instructions, overhead might be 0% of processor time, but cycles >would take longer. > >In other words, I think the numbers you present prove that only about half a >percent performance increase can be attained by tweaking the special >instructions. They don't prove that the special instructions contribute >only 10% to context switching or only half a percent to system performance >related to context switching. Suppose 50 percent of system time was spent >executing them. Would you conclude that they contribute 50 percent to >performance? I would conclude that they subtract 50 percent from >performance. > >In other other words, you look at measurements of context switching on 780's >and since the special instructions represent so little processor time, you >conclude they don't contribute much to performance. I wonder how much more >processor time would be spent acheiving the same functionality in different >ways if the instructions didn't exist and didn't execute at their measured >speeds. I conclude that we don't know enough to assess the contribution of >the special instructions to a 780's ability to keep context switching >overhead down to about 5 percent. Therefore, we don't know how important >the special instructions are to timesharing, or how clever it is to put them >into your architecture. ---------------------------- 1. Jon's premise is correct, in general: "Just because something isn't used much doesn't mean that omitting it wouldn't drastically slow a system down". 2. However, I believe my original conclusion is also correct, i.e., that the VAX LDPCTX/SVPCTX don't contribute a lot to performance. 3. The point of all the numbers was to show that the time actually spent doing LDPCTX/SVPCTX is small, i.e. that one might well afford a less efficient implementation without seeing much impact on system performance. 4. Unfortunately, there were a whole bunch of assumptions necessary to get to the correct conclusion, and they were buried in the sentence in the original posting [shortly before where Jon started quoting]: "1) Register save/restores speed are almost entirely dominated by memory system time anyway." They were buried because I was treating the issue as a continuation of familiar discussions, which I woefully neglected to describe. Here are the hidden assumptions used to reach the conclusion: a] I assumed people understand exactly what LDPCTX & SVPCTX do: LDPCTX: Invalidates the per-process half of the TLB [sort of half-way between TBIS and TBIA] Reloads general purpose registers from Process Control Block Reloads mapping regs and a few others Saves PSL & PC onto stack (so REI can use them soon) SVPCTX: Saves general regs into a PCB Grabs PSL & PC from stack and saves them in PCB Switches to interrupt stack b] I assumed that people understood that it was clear that: LDPCTX is essentially a TLB operation, followed by a load-multiple SVPCTX is essentially a store-multiple (in each case, with a few tweaks) The timings for these things should be domininated by: LDPCTX: cache misses doing the loads [likely if actually used for context switch, not so likely if for system call/ clock interrupt, etc] SVPCTX: write-stall activity, i.e., where the CPU can write faster than memory can take it. c] The fundamental point is that these operations are mostly 32-bit word data movers, with some carefully crafted tweaks to keep the machine in a reasonable state. These are NOT like (for example) floating point instrs, where to simulate them in the integer instruction set is always expensive. d] Consider what it would take to simulate the effects of these things, either with existing instructions, or with a slightly different partitioning of function: LDPCTX: need to add a TLB Invalidate Per-Process instruction Copy PSL & PC to stack Use POPR, after tweaking the SP to point at (something like) a PCB, to reload regs OR Use sequence of MOVQs to relaod the regs, 2 at a time SVPCTX: Save SP somewhere, set SP -> PCB (or something like it) Use PUSHR to save regs, or MOVQs Copy PSL & PC from stack to PCB and pop them [as can be seen, what these things are mainly doing for you is to be able to implicitly access the current PCB without needing a GP register to point there, at an inconvenient place. Note that this pair cannot be too much more expensive than a full function call/ return that saves most of the registers.] e) Now: how much longer will the times for the above be? ANS: not much: e1) A longer sequence of instructions may well take more instruction-cache misses. This is especially true if there are no load/store multiple instructions of any sort available. However, the VAX has them already. e2) A longer sequence of instructions may not be able to get at the machine's parallelism (for things like auto-increments, for example). e3) [MOST IMPORTANT]: It may be that SCPCTX/LDPCTX is the fastest way to save/load a large block of registers. However, this is generally unlikely, because it's probably a mistake, because it essentially means that the subroutine call mechanism (or PUSHR/POPR) are not as fast as possible, which is silly, because it means that a frequently occurring operation is less well-tuned than an infrequently-occurring one. In particular, think how odd it would be to have loads/stores [of any reasonable size] that cannot run the memory system full blast. If you're going to spend microcode effort to tune memory-related operations, where would you put that effort on a VAX? ANS: MOV* [esp MOVC3], CALLS/RET [or PUSHR/POPR]. f) Thus, the fundamental assumption is that the bulk of the time in these operations is spent pushing memory, with a little bit of control around them. If you have a fast, general way to save/restore registers, you'll want it elsewhere even more, so it might as well be a separate instruction that can be used elsewhere. What this says is that e3) is unlikely, e2) is possible, but (typically, not a big deal), and e1) is not that bad a problem, unless you have no multiple loads/stores of any sort. Even then, it's not a big deal, if you consider that you'll almost certainly have substantial cache misses inside the newly-started user process. Thus, putting all this together, we have the two pieces: a] LDPCTX/SVPCTX don't account for that much time. b] They don't do very much that couldn't be simulated by existing instructions, or minor variants thereof, and the fundamental nature of the replacements is such that the times should be fairly comparable, because, if they're not, then the machine has been designed to NOT allow good memory performance from normally-usable instructions. c] Thus, it's not clear that wrapping up the entire functionality of these things, in VAX-time-sharing use, really contributes much to performance, versus partitioning them differently. I apologize for not explaining all of this more in the first place, because, in retrospect, it's not particularly obvious. Of course, one can make all sorts of assumptions that invalidate the conclusion, such has expecting that process-switch instructions have to do much more work, or that context-switch rates are high, or that conext-switch latencies must be guaranteed low, etc, etc. However, I still think the conclusion, as originally stated, actually holds in this case. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086