chris@mimsy.UUCP (Chris Torek) (08/27/88)
Someone recently mentioned something that reminded me to try open-coding the VAX `It sings, it dances! It leaps and it prances! It's a dessert topping *and* a floor wax!' subroutine call instruction (`calls'). The results of a trivial test, on a VAX-11/785: calls, no arguments, null function (1 million iterations): 13.5 user seconds; open-coded, no arguments, null function: 14.6 user seconds. Here they are: /* calls version */ .globl _null _null: .word 0 ret .globl _main _main: .word 0 movl $1000000,r11 0: calls $0,_null sobgtr r11,0b ret /* open-coded version */ .globl _null _null: .word 0 ret .globl _main _main: .word 0 movl $1000000,r11 0: pushl $0 # nargs movl sp,r2 # new ap moval _null,r0 # routine to call movzwl (r0)+,r1 # get register mask pushr r1 # save registers movab 1f,-(sp) # fr_savpc movq ap,-(sp) # fr_savfp, fr_savap bisw2 $0x2000,r1 # fake a `calls' ashl $16,r1,-(sp) # save mask in 16..27: psw=0 pushl $0 movl r2,ap # set ap movl sp,fp # set fp jmp (r0) # `call' 1: sobgtr r11,0b ret Some notes about the open coded version: It conforms to the `calls' frame format (although using some other format could make it much faster: see below). It does not save the current psw, losing the condition codes, and perhaps more importantly, the trace bit, and some trap bits (which as it happens are zero anyway, at least during all normal operation). It does not align the stack. (The stack should never become unaligned anyway.) It clobbers registers r0 through r2. (These are always free at subroutine call boundaries in the Berkeley VAX compilers.) I found that `pushl $0' is faster than `clrl -(sp)' (though both are two bytes long---most peculiar). If I am allowed to avoid the standard stack frame format, I can cut the time to 7.1 seconds: /* modified open coded call */ .globl _null _null: movl sp,fp # build frame rsb .globl _main _main: .word 0 movl $1000000,r11 0: movq ap,-(sp) # save ap, fp moval 4(sp),ap # new ap jsb _null # call movq (sp)+,ap # restore ap, fp sobgtr r11,0b ret This is somewhat less realistic as no registers are saved, and none restored; if `null' were to use some, it would have to read: _null: pushr $mask # save local registers movl sp,fp # build frame /* body */ movl fp,sp # set up for return popr $mask # restore registers rsb which adds three instructions, two of them relatively slow (pushr and popr), changing the time (for mask=0) to 8.7 seconds. Summary: the fancy VAX instruction call is severe overkill. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
earl@mips.COM (Earl Killian) (08/28/88)
In article <13254@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes: [much analysis of VAX CALLS instruction -- a classic CISC mistake] ... > If I am allowed to avoid the standard stack frame format, I can cut the > time to 7.1 seconds: > > /* modified open coded call */ > .globl _null > _null: movl sp,fp # build frame > rsb > > .globl _main > _main: .word 0 > movl $1000000,r11 > 0: movq ap,-(sp) # save ap, fp > moval 4(sp),ap # new ap > jsb _null # call > movq (sp)+,ap # restore ap, fp > sobgtr r11,0b > ret > > This is somewhat less realistic as no registers are saved, and none > restored; if `null' were to use some, it would have to read: > > _null: pushr $mask # save local registers > movl sp,fp # build frame > /* body */ > movl fp,sp # set up for return > popr $mask # restore registers > rsb > > which adds three instructions, two of them relatively slow (pushr and > popr), changing the time (for mask=0) to 8.7 seconds. You can do better than this. Here's the output from a compiler that's been around for 4 years or so, using its -fast_call option: /* pastel 2.3 compiled test.p on 27 August 88 22:16 PST by earl 4 statements, 8 instructions in 29 bytes, 0 static bytes */ .globl _a .globl _pascal_runtime__main_end .globl _pascal_runtime__main_start _a: # 1 procedure a(); rsb # 5 program test; .globl _main _main: jsb _pascal_runtime__main_start/* ?, main_start */ # 9 for i := 1000000 downto 1 do begin movl $1000000,r2 /* ?, i */ L107: # 10 a(); bsbb _a decl r2 /* i */ bneq L107 jsb _pascal_runtime__main_end/* ?, main_end */ rsb On a VAX 780 this is 3.3 seconds, whereas yours is 9.3 seconds, or 2.8x slower. You mistakenly assumed you need a frame pointer, and used an argument pointer, both of which the compiler-generated code avoided. No registers saved is actually quite realistic, when the call protocol uses a mix of both callee and caller-saved registers. -- UUCP: {ames,decwrl,prls,pyramid}!mips!earl USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086
chris@mimsy.UUCP (Chris Torek) (08/28/88)
In article <2912@wright.mips.COM> earl@mips.COM (Earl Killian) writes: >[2.8x faster version]. You mistakenly assumed you need a frame pointer, >and used an argument pointer .... Well, actually, I was thinking in terms of what I could do with PCC and the current 4.3BSD-tahoe, without doing serious surgery. I need ap to make sigtramp work as is (otherwise I could just use positive offsets from fp, a la the Tahoe); and I need fp to keep PCC happy, and to make alloca() work (you do not expect me to force everyone to give up GNU Emacs! :-) ). >No registers saved is actually quite realistic, when the call protocol >uses a mix of both callee and caller-saved registers. Not possible `without doing serious surgery'. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
casey@admin.cognet.ucla.edu (Casey Leedom) (08/29/88)
In article <13266@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes: | In article <2912@wright.mips.COM> earl@mips.COM (Earl Killian) writes: | > [2.8x faster version]. You mistakenly assumed you need a frame pointer, | > and used an argument pointer .... | | Well, actually, I was thinking in terms of what I could do with PCC | and the current 4.3BSD-tahoe, without doing serious surgery. I need | ap to make sigtramp work as is (otherwise I could just use positive | offsets from fp, a la the Tahoe); and I need fp to keep PCC happy, | and to make alloca() work. Actually Chris, the only thing you have to give up if you stop using AP is nargs() (which isn't even documented and only the fortran library uses it anymore). We have both sigtramp and alloca() in 2.10BSD all without an AP. Casey
earl@mips.COM (Earl Killian) (08/29/88)
In article <13266@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes: > Well, actually, I was thinking in terms of what I could do with PCC > and the current 4.3BSD-tahoe, without doing serious surgery. I need > ap to make sigtramp work as is (otherwise I could just use positive > offsets from fp, a la the Tahoe); and I need fp to keep PCC happy, > and to make alloca() work (you do not expect me to force everyone > to give up GNU Emacs! :-) ). Yeah, I understand, but do you realize how unfortunate this is? Every VAX Unix on the planet could be made an average of 25% faster if someone would just use the Pastel compiler's procedure call protocol. Perhaps this is hard with PCC, what about GCC? As for alloca, you don't need to give it up. Pastel generates a frame pointer iff one is really needed. For C, the equivalent would be for your compiler to detect calls to alloca and use a frame pointer in that one procedure. > >No registers saved is actually quite realistic, when the call protocol > >uses a mix of both callee and caller-saved registers. > > Not possible `without doing serious surgery'. Actually 4.3bsd VAX Unix has 8 callee-saved registers (r6-r11, fp, ap) and 6 caller-saved registers (r0-r5), so it already works. People just don't think of the registers this way; they think of them as variables and temps. But a real register allocator (as in Pastel or gcc) wouldn't need to think of them that way. -- UUCP: {ames,decwrl,prls,pyramid}!mips!earl USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086