[net.arch] AT&T MIPS claim really task-switching

mash@mips.UUCP (06/15/86)

In article <4138@sun.uucp> guy@sun.uucp (Guy Harris) writes:
>...
>The 68000 does, indeed, not have a single "switch task" instruction, but who
>cares?  The fact that operation X is performed by a single instruction in no
>way implies that operation X is exceptionally fast.  Furthermore, I have no
>idea how much of the task-switch time on VMS or UNIX is spent doing what the
>"load process context" instruction does; it has to figure out which task to
>run, for instance, which adds a few more instructions.
Guy is right on; furthermore:
1) Register save/restores speed are almsot entirely dominated by memory
system time anyway.
2) When measured by the "2 processes writing 1 byte circularly thru
pipes" benchmark, each complete UNIX context switch takes on the order of
700 microseconds on a 780.  Actual register save/restore time is dominated by
write-stalls and data cache misses, which are a function of the memory
system, not of the instruction set.  The only real difference is in
extra instruction-cache misses one may hit by having to do a sequence
of loads/stores instead of single micro-coded instructions.
Having looked at the code, I guarantee that most of the code is doing other
things than saving/restoring registers.
3) Let's try some back-of-the-envelope numbers:
	a) At 60 cs/second (typical) and 700 usec/cs, the VAX would spend
	60*700 = 42,000 usecs, or about 4.2% of the time doing conxtext
	switches.
	b) Supposing that that 10% of this time is actually in save/restore, 
	about .4% of the machine might be spent in save/restore
	(SVPCTX/LDPCTX).  Of course, they might be used for other things also.
4) Now, let's try published data: Clark & Levy, "Measurement and Analysis of
	Instruction Use in the VAX 11/780", 9th Ann. Symp. on Comp. Arch,
	April 1982.
	a) LDPCTX and SVPCTX aren't on the top 25 in usage of CPU time,
	even in VMS Kernel mode. The top 25 instructions use 62% of the
	total kernel time, and the smallest shown is REMQUE with 1.31%.
	This was for multi-user workloads.
	b) MTPR (Move to Processor Register) used 5.27% of the kernel time,
	and 1.15% of the total CPU time for all processor modes.  From this,
	I infer that the kernel was using 21% of the CPU (1.15/5.27).
	Hence, the most time-consuming of LDPCTX/SVPCTX could be consuming
	no more than 1.31% of the kernel, or .27% of the total CPU.  Even
	both together could account for no more than .54% of the total CPU.
5) All of this is consistent in bounding the problem: for time-sharing
systems like VAXen, the special context save/restore instructions contribute
at most half a percent to performance.  [Reminder: this says nothing about
whether such instructions are important for real-time systems or other
environments.  Also, some forms of these instructions have important
structural properties or other rationales, but NOT SPEED IN THIS DOMAIN.]
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

ken@njitcccc.UUCP (Kenneth Ng) (06/15/86)

In article <506@mips.UUCP>, mash@mips.UUCP writes:
> 	a) LDPCTX and SVPCTX aren't on the top 25 in usage of CPU time,
> 	even in VMS Kernel mode. The top 25 instructions use 62% of the
> 	total kernel time, and the smallest shown is REMQUE with 1.31%.
> 	This was for multi-user workloads.
Out of curiousity, what were the top 5 or so instructions and the
percentage of cpu time used?

All in all, that was an impressive item though. 

-- 
Kenneth Ng: uucp(unreliable) ihnp4!allegra!bellcore!njitcccc!ken
	    soon uucp:ken@rigel.cccc.njit.edu
	    bitnet(prefered) ken@njitcccc.bitnet
	    soon bitnet: ken@orion.cccc.njit.edu
(Yes, we are slowly moving to RFC 920, kicking and screaming)

New Jersey Institute of Technology
Computerized Conferencing and Communications Center
Newark, New Jersey 07102

Vulcan jealousy: "I fail to see the logic in prefering Stonn over me"
Movie "Short Circuit": Number 5: "I need input"

tuba@ur-tut.UUCP (Jon Krueger) (06/16/86)

In article <506@mips.UUCP> mash@mips.UUCP (John Mashey) writes:
> . . .
>3) Let's try some back-of-the-envelope numbers:
>	a) At 60 cs/second (typical) and 700 usec/cs, the VAX would spend
>	60*700 = 42,000 usecs, or about 4.2% of the time doing conxtext
>	switches.
>	b) Supposing that that 10% of this time is actually in save/restore, 
>	about .4% of the machine might be spent in save/restore
>	(SVPCTX/LDPCTX).  Of course, they might be used for other things also.
>4) Now, let's try published data: Clark & Levy, "Measurement and Analysis of
>	Instruction Use in the VAX 11/780", 9th Ann. Symp. on Comp. Arch,
>	April 1982.
>	a) LDPCTX and SVPCTX aren't on the top 25 in usage of CPU time,
>	even in VMS Kernel mode. The top 25 instructions use 62% of the
>	total kernel time, and the smallest shown is REMQUE with 1.31%.
>	This was for multi-user workloads.
>	b) MTPR (Move to Processor Register) used 5.27% of the kernel time,
>	and 1.15% of the total CPU time for all processor modes.  From this,
>	I infer that the kernel was using 21% of the CPU (1.15/5.27).
>	Hence, the most time-consuming of LDPCTX/SVPCTX could be consuming
>	no more than 1.31% of the kernel, or .27% of the total CPU.  Even
>	both together could account for no more than .54% of the total CPU.
>5) All of this is consistent in bounding the problem: for time-sharing
>systems like VAXen, the special context save/restore instructions contribute
>at most half a percent to performance. . . .

Thanks for the numbers and calculations.  I can't argue with your numbers,
but I arrive at different conclusions.

I agree that the VAX architecture, as implemented on the 780, including the
presence and performance of those instructions, limits overhead due to
context switching to about 5 percent of processor time.  So the performance
increase attainable by decreasing this overhead is only 5 percent.  The
numbers you present don't tells us how much of that 5 percent is spent
actually executing LDPCTX/SVPCTX.  So we can only estimate the performance
aspects of increasing their speed.  I accept your estimate of at most half a
percent processor time spent, so we can only save about half a percent.

What we can't say is how much context switching overhead would rise to if
the instructions didn't exist.  For instance, if the functionality
implemented in the microcode of LDPCTX/SVPCTX were performed by a system
routine, overhead might be 90% of processor time at 60 switches per second.
In this case, we could say that the instructions contribute about 85% to
system performance.  Similarly, if hardware on the 780 autosaved
and restored registers as needed by processor modes and subroutine
instructions, overhead might be 0% of processor time, but cycles
would take longer.

In other words, I think the numbers you present prove that only about half a
percent performance increase can be attained by tweaking the special
instructions.  They don't prove that the special instructions contribute
only 10% to context switching or only half a percent to system performance
related to context switching.  Suppose 50 percent of system time was spent
executing them.  Would you conclude that they contribute 50 percent to
performance?  I would conclude that they subtract 50 percent from
performance.

In other other words, you look at measurements of context switching on 780's
and since the special instructions represent so little processor time, you
conclude they don't contribute much to performance.  I wonder how much more
processor time would be spent acheiving the same functionality in different
ways if the instructions didn't exist and didn't execute at their measured
speeds.  I conclude that we don't know enough to assess the contribution of
the special instructions to a 780's ability to keep context switching
overhead down to about 5 percent.  Therefore, we don't know how important
the special instructions are to timesharing, or how clever it is to put them
into your architecture.

mash@mips.UUCP (06/17/86)

In article <217@njitcccc.UUCP> ken@njitcccc.UUCP (Kenneth Ng) writes:
>In article <506@mips.UUCP>, mash@mips.UUCP writes:
>> 	a) LDPCTX and SVPCTX aren't on the top 25 in usage of CPU time,
>> 	even in VMS Kernel mode. The top 25 instructions use 62% of the
>> 	total kernel time, and the smallest shown is REMQUE with 1.31%.
>> 	This was for multi-user workloads.
>Out of curiousity, what were the top 5 or so instructions and the
>percentage of cpu time used?

1) Instruction Distributions - Multi-user workload, VMS Kernel Mode
	Frequency Order		Time Order
		%			%
1	MOVL	10.21		MOVL	8.19
2	BEQL	5.59		MTPR	5.27
3	RSB	4.33		BBC	3.39
4	BNEQ	3.99		REI	3.29
5	MOVZWL	3.25		BSBW	3.08

2) Instruction Distributions - Multi-user workload, All Modes
	Frequency Order		Time Order
		%			%
1	MOVL	11.40		MOVC3	13.14
2	BEQL	5.85		CALLS	7.80
3	BNEQ	3.07		MOVL	6.60
4	MOVZBL	3.07		RET	4.07
5	BBS	2.77		MULF3	3.59

I won't attempt to summarize the paper; it's well worth reading.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086