[comp.arch] CISC instructions

chris@mimsy.UUCP (Chris Torek) (08/27/88)

Someone recently mentioned something that reminded me to try
open-coding the VAX `It sings, it dances!  It leaps and it prances!
It's a dessert topping *and* a floor wax!' subroutine call instruction
(`calls').  The results of a trivial test, on a VAX-11/785: calls, no
arguments, null function (1 million iterations): 13.5 user seconds;
open-coded, no arguments, null function:  14.6 user seconds.

Here they are:

	/* calls version */
		.globl	_null
	_null:	.word	0
		ret

		.globl	_main
	_main:	.word	0
		movl	$1000000,r11
	0:	calls	$0,_null
		sobgtr	r11,0b
		ret

	/* open-coded version */
		.globl	_null
	_null:	.word	0
		ret

		.globl	_main
	_main:	.word	0
		movl	$1000000,r11

	0:	pushl	$0			# nargs
		movl	sp,r2			# new ap
		moval	_null,r0		# routine to call
		movzwl	(r0)+,r1		# get register mask
		pushr	r1			# save registers
		movab	1f,-(sp)		# fr_savpc
		movq	ap,-(sp)		# fr_savfp, fr_savap
		bisw2	$0x2000,r1		# fake a `calls'
		ashl	$16,r1,-(sp)		# save mask in 16..27: psw=0
		pushl	$0
		movl	r2,ap			# set ap
		movl	sp,fp			# set fp
		jmp	(r0)			# `call'
	1:
		sobgtr	r11,0b
		ret

Some notes about the open coded version:

	It conforms to the `calls' frame format (although using some
	other format could make it much faster: see below).

	It does not save the current psw, losing the condition codes,
	and perhaps more importantly, the trace bit, and some trap bits
	(which as it happens are zero anyway, at least during all
	normal operation).

	It does not align the stack.  (The stack should never become
	unaligned anyway.)

	It clobbers registers r0 through r2.  (These are always free at
	subroutine call boundaries in the Berkeley VAX compilers.)

	I found that `pushl $0' is faster than `clrl -(sp)' (though
	both are two bytes long---most peculiar).

If I am allowed to avoid the standard stack frame format, I can cut the
time to 7.1 seconds:

	/* modified open coded call */
		.globl	_null
	_null:	movl	sp,fp		# build frame
		rsb

		.globl	_main
	_main:	.word	0
		movl	$1000000,r11
	0:	movq	ap,-(sp)	# save ap, fp
		moval	4(sp),ap	# new ap
		jsb	_null		# call
		movq	(sp)+,ap	# restore ap, fp
		sobgtr	r11,0b
		ret

This is somewhat less realistic as no registers are saved, and none
restored; if `null' were to use some, it would have to read:

	_null:	pushr	$mask		# save local registers
		movl	sp,fp		# build frame
		/* body */
		movl	fp,sp		# set up for return
		popr	$mask		# restore registers
		rsb

which adds three instructions, two of them relatively slow (pushr and
popr), changing the time (for mask=0) to 8.7 seconds.

Summary: the fancy VAX instruction call is severe overkill.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

earl@mips.COM (Earl Killian) (08/28/88)

In article <13254@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes:
[much analysis of VAX CALLS instruction -- a classic CISC mistake]
...
> If I am allowed to avoid the standard stack frame format, I can cut the
> time to 7.1 seconds:
> 
> 	/* modified open coded call */
> 		.globl	_null
> 	_null:	movl	sp,fp		# build frame
> 		rsb
> 
> 		.globl	_main
> 	_main:	.word	0
> 		movl	$1000000,r11
> 	0:	movq	ap,-(sp)	# save ap, fp
> 		moval	4(sp),ap	# new ap
> 		jsb	_null		# call
> 		movq	(sp)+,ap	# restore ap, fp
> 		sobgtr	r11,0b
> 		ret
> 
> This is somewhat less realistic as no registers are saved, and none
> restored; if `null' were to use some, it would have to read:
> 
> 	_null:	pushr	$mask		# save local registers
> 		movl	sp,fp		# build frame
> 		/* body */
> 		movl	fp,sp		# set up for return
> 		popr	$mask		# restore registers
> 		rsb
> 
> which adds three instructions, two of them relatively slow (pushr and
> popr), changing the time (for mask=0) to 8.7 seconds.

You can do better than this.  Here's the output from a compiler that's
been around for 4 years or so, using its -fast_call option:

/* pastel 2.3 compiled test.p on 27 August 88 22:16 PST by earl
   4 statements, 8 instructions in 29 bytes, 0 static bytes */
.globl _a
.globl _pascal_runtime__main_end
.globl _pascal_runtime__main_start

_a:
 #   1	procedure a();
	rsb

 #   5	program test;
.globl _main
_main:
	jsb _pascal_runtime__main_start/* ?, main_start */
 #   9	    for i := 1000000 downto 1 do begin
	movl $1000000,r2		/* ?, i */
L107:
 #  10	      a();
	bsbb _a
	decl r2				/* i */
	bneq L107
	jsb _pascal_runtime__main_end/* ?, main_end */
	rsb

On a VAX 780 this is 3.3 seconds, whereas yours is 9.3 seconds, or
2.8x slower.  You mistakenly assumed you need a frame pointer, 
and used an argument pointer, both of which the compiler-generated
code avoided.  No registers saved is actually quite realistic, when
the call protocol uses a mix of both callee and caller-saved
registers.
-- 
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086

chris@mimsy.UUCP (Chris Torek) (08/28/88)

In article <2912@wright.mips.COM> earl@mips.COM (Earl Killian) writes:
>[2.8x faster version].  You mistakenly assumed you need a frame pointer, 
>and used an argument pointer ....

Well, actually, I was thinking in terms of what I could do with PCC
and the current 4.3BSD-tahoe, without doing serious surgery.  I need
ap to make sigtramp work as is (otherwise I could just use positive
offsets from fp, a la the Tahoe); and I need fp to keep PCC happy,
and to make alloca() work (you do not expect me to force everyone
to give up GNU Emacs! :-) ).

>No registers saved is actually quite realistic, when the call protocol
>uses a mix of both callee and caller-saved registers.

Not possible `without doing serious surgery'.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

casey@admin.cognet.ucla.edu (Casey Leedom) (08/29/88)

In article <13266@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
| In article <2912@wright.mips.COM> earl@mips.COM (Earl Killian) writes:
| > [2.8x faster version].  You mistakenly assumed you need a frame pointer, 
| > and used an argument pointer ....
| 
| Well, actually, I was thinking in terms of what I could do with PCC
| and the current 4.3BSD-tahoe, without doing serious surgery.  I need
| ap to make sigtramp work as is (otherwise I could just use positive
| offsets from fp, a la the Tahoe); and I need fp to keep PCC happy,
| and to make alloca() work.

  Actually Chris, the only thing you have to give up if you stop using AP
is nargs() (which isn't even documented and only the fortran library uses
it anymore).  We have both sigtramp and alloca() in 2.10BSD all without an
AP.

Casey

earl@mips.COM (Earl Killian) (08/29/88)

In article <13266@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes:
> Well, actually, I was thinking in terms of what I could do with PCC
> and the current 4.3BSD-tahoe, without doing serious surgery.  I need
> ap to make sigtramp work as is (otherwise I could just use positive
> offsets from fp, a la the Tahoe); and I need fp to keep PCC happy,
> and to make alloca() work (you do not expect me to force everyone
> to give up GNU Emacs! :-) ).

Yeah, I understand, but do you realize how unfortunate this is?  Every
VAX Unix on the planet could be made an average of 25% faster if
someone would just use the Pastel compiler's procedure call protocol.
Perhaps this is hard with PCC, what about GCC?

As for alloca, you don't need to give it up.  Pastel generates a frame
pointer iff one is really needed.  For C, the equivalent would be for
your compiler to detect calls to alloca and use a frame pointer in
that one procedure.

> >No registers saved is actually quite realistic, when the call protocol
> >uses a mix of both callee and caller-saved registers.
> 
> Not possible `without doing serious surgery'.

Actually 4.3bsd VAX Unix has 8 callee-saved registers (r6-r11, fp, ap)
and 6 caller-saved registers (r0-r5), so it already works.  People
just don't think of the registers this way; they think of them as
variables and temps.  But a real register allocator (as in Pastel or
gcc) wouldn't need to think of them that way.
-- 
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086