mash@mips.UUCP (06/15/86)
In article <4138@sun.uucp> guy@sun.uucp (Guy Harris) writes: >... >The 68000 does, indeed, not have a single "switch task" instruction, but who >cares? The fact that operation X is performed by a single instruction in no >way implies that operation X is exceptionally fast. Furthermore, I have no >idea how much of the task-switch time on VMS or UNIX is spent doing what the >"load process context" instruction does; it has to figure out which task to >run, for instance, which adds a few more instructions. Guy is right on; furthermore: 1) Register save/restores speed are almsot entirely dominated by memory system time anyway. 2) When measured by the "2 processes writing 1 byte circularly thru pipes" benchmark, each complete UNIX context switch takes on the order of 700 microseconds on a 780. Actual register save/restore time is dominated by write-stalls and data cache misses, which are a function of the memory system, not of the instruction set. The only real difference is in extra instruction-cache misses one may hit by having to do a sequence of loads/stores instead of single micro-coded instructions. Having looked at the code, I guarantee that most of the code is doing other things than saving/restoring registers. 3) Let's try some back-of-the-envelope numbers: a) At 60 cs/second (typical) and 700 usec/cs, the VAX would spend 60*700 = 42,000 usecs, or about 4.2% of the time doing conxtext switches. b) Supposing that that 10% of this time is actually in save/restore, about .4% of the machine might be spent in save/restore (SVPCTX/LDPCTX). Of course, they might be used for other things also. 4) Now, let's try published data: Clark & Levy, "Measurement and Analysis of Instruction Use in the VAX 11/780", 9th Ann. Symp. on Comp. Arch, April 1982. a) LDPCTX and SVPCTX aren't on the top 25 in usage of CPU time, even in VMS Kernel mode. The top 25 instructions use 62% of the total kernel time, and the smallest shown is REMQUE with 1.31%. This was for multi-user workloads. b) MTPR (Move to Processor Register) used 5.27% of the kernel time, and 1.15% of the total CPU time for all processor modes. From this, I infer that the kernel was using 21% of the CPU (1.15/5.27). Hence, the most time-consuming of LDPCTX/SVPCTX could be consuming no more than 1.31% of the kernel, or .27% of the total CPU. Even both together could account for no more than .54% of the total CPU. 5) All of this is consistent in bounding the problem: for time-sharing systems like VAXen, the special context save/restore instructions contribute at most half a percent to performance. [Reminder: this says nothing about whether such instructions are important for real-time systems or other environments. Also, some forms of these instructions have important structural properties or other rationales, but NOT SPEED IN THIS DOMAIN.] -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
ken@njitcccc.UUCP (Kenneth Ng) (06/15/86)
In article <506@mips.UUCP>, mash@mips.UUCP writes: > a) LDPCTX and SVPCTX aren't on the top 25 in usage of CPU time, > even in VMS Kernel mode. The top 25 instructions use 62% of the > total kernel time, and the smallest shown is REMQUE with 1.31%. > This was for multi-user workloads. Out of curiousity, what were the top 5 or so instructions and the percentage of cpu time used? All in all, that was an impressive item though. -- Kenneth Ng: uucp(unreliable) ihnp4!allegra!bellcore!njitcccc!ken soon uucp:ken@rigel.cccc.njit.edu bitnet(prefered) ken@njitcccc.bitnet soon bitnet: ken@orion.cccc.njit.edu (Yes, we are slowly moving to RFC 920, kicking and screaming) New Jersey Institute of Technology Computerized Conferencing and Communications Center Newark, New Jersey 07102 Vulcan jealousy: "I fail to see the logic in prefering Stonn over me" Movie "Short Circuit": Number 5: "I need input"
tuba@ur-tut.UUCP (Jon Krueger) (06/16/86)
In article <506@mips.UUCP> mash@mips.UUCP (John Mashey) writes: > . . . >3) Let's try some back-of-the-envelope numbers: > a) At 60 cs/second (typical) and 700 usec/cs, the VAX would spend > 60*700 = 42,000 usecs, or about 4.2% of the time doing conxtext > switches. > b) Supposing that that 10% of this time is actually in save/restore, > about .4% of the machine might be spent in save/restore > (SVPCTX/LDPCTX). Of course, they might be used for other things also. >4) Now, let's try published data: Clark & Levy, "Measurement and Analysis of > Instruction Use in the VAX 11/780", 9th Ann. Symp. on Comp. Arch, > April 1982. > a) LDPCTX and SVPCTX aren't on the top 25 in usage of CPU time, > even in VMS Kernel mode. The top 25 instructions use 62% of the > total kernel time, and the smallest shown is REMQUE with 1.31%. > This was for multi-user workloads. > b) MTPR (Move to Processor Register) used 5.27% of the kernel time, > and 1.15% of the total CPU time for all processor modes. From this, > I infer that the kernel was using 21% of the CPU (1.15/5.27). > Hence, the most time-consuming of LDPCTX/SVPCTX could be consuming > no more than 1.31% of the kernel, or .27% of the total CPU. Even > both together could account for no more than .54% of the total CPU. >5) All of this is consistent in bounding the problem: for time-sharing >systems like VAXen, the special context save/restore instructions contribute >at most half a percent to performance. . . . Thanks for the numbers and calculations. I can't argue with your numbers, but I arrive at different conclusions. I agree that the VAX architecture, as implemented on the 780, including the presence and performance of those instructions, limits overhead due to context switching to about 5 percent of processor time. So the performance increase attainable by decreasing this overhead is only 5 percent. The numbers you present don't tells us how much of that 5 percent is spent actually executing LDPCTX/SVPCTX. So we can only estimate the performance aspects of increasing their speed. I accept your estimate of at most half a percent processor time spent, so we can only save about half a percent. What we can't say is how much context switching overhead would rise to if the instructions didn't exist. For instance, if the functionality implemented in the microcode of LDPCTX/SVPCTX were performed by a system routine, overhead might be 90% of processor time at 60 switches per second. In this case, we could say that the instructions contribute about 85% to system performance. Similarly, if hardware on the 780 autosaved and restored registers as needed by processor modes and subroutine instructions, overhead might be 0% of processor time, but cycles would take longer. In other words, I think the numbers you present prove that only about half a percent performance increase can be attained by tweaking the special instructions. They don't prove that the special instructions contribute only 10% to context switching or only half a percent to system performance related to context switching. Suppose 50 percent of system time was spent executing them. Would you conclude that they contribute 50 percent to performance? I would conclude that they subtract 50 percent from performance. In other other words, you look at measurements of context switching on 780's and since the special instructions represent so little processor time, you conclude they don't contribute much to performance. I wonder how much more processor time would be spent acheiving the same functionality in different ways if the instructions didn't exist and didn't execute at their measured speeds. I conclude that we don't know enough to assess the contribution of the special instructions to a 780's ability to keep context switching overhead down to about 5 percent. Therefore, we don't know how important the special instructions are to timesharing, or how clever it is to put them into your architecture.
mash@mips.UUCP (06/17/86)
In article <217@njitcccc.UUCP> ken@njitcccc.UUCP (Kenneth Ng) writes: >In article <506@mips.UUCP>, mash@mips.UUCP writes: >> a) LDPCTX and SVPCTX aren't on the top 25 in usage of CPU time, >> even in VMS Kernel mode. The top 25 instructions use 62% of the >> total kernel time, and the smallest shown is REMQUE with 1.31%. >> This was for multi-user workloads. >Out of curiousity, what were the top 5 or so instructions and the >percentage of cpu time used? 1) Instruction Distributions - Multi-user workload, VMS Kernel Mode Frequency Order Time Order % % 1 MOVL 10.21 MOVL 8.19 2 BEQL 5.59 MTPR 5.27 3 RSB 4.33 BBC 3.39 4 BNEQ 3.99 REI 3.29 5 MOVZWL 3.25 BSBW 3.08 2) Instruction Distributions - Multi-user workload, All Modes Frequency Order Time Order % % 1 MOVL 11.40 MOVC3 13.14 2 BEQL 5.85 CALLS 7.80 3 BNEQ 3.07 MOVL 6.60 4 MOVZBL 3.07 RET 4.07 5 BBS 2.77 MULF3 3.59 I won't attempt to summarize the paper; it's well worth reading. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086