pardo@cs.washington.edu (David Keppel) (12/06/89)
[This has nothing to do with the m88k, as such. It is about an
implementation of save-only-the-needed-registers that could be done on
the 88k. If you're not interested, hit `n' now.]
There's been an ongoing discussion about register saves. Somebody
asked me to justify my assertion ``save-by-mask probably isn't worth
it on a RISC''. I posted an analysis a while. Paul Wilson has
suggested an alternative implementation that might be faster, but
still probably doesn't make it worth it.
The basic idea is to use the register save mask as an index in to a
table. The table contains the addresses of routines that save
(restore) registers. There is one routine for each of the 2**n
possible mask values.
This method looks like its average case is marginally better than the
worst case for split caller saves/callee saves. Here's my writeup.
>[Use the mask as an index in to a jump table.]
# Call function `foo'.
movi $mask, r16
call foo
# Foo:
foo:
andl $mask, r16, r16
load jump_table(r16)
movi return, r17
jump (r16)
return:
...
load return_table(r16)
jump (r16)
# NOTREACHED
# Transfer tables
jump_table:
.address save_none
.address save_0
.address save_1
.address save_01
.address save_2
...
#
# A bunch of save routines, one for each of the 2**n possible
# mask combinations. For 2**16, that might take about 1/2Mb.
# Assumes that the mask is storead in r16 and the function
# address is stored in r17.
# The registers that can be saved-by-mask are in r0-r15.
#
save_none:
jbr (r17)
save_0: # save register 0
subl $4, sp, sp
store r0, 4(sp)
jbr (r17)
save_1: # save register 1
subl $4, sp, sp
store r0, 4(sp)
jbr (r17)
...
save_e: # save register 15
subl $4, sp, sp
store r0, 4(sp)
jbr (r17)
save_01:
# save registers 0 and 1
subl $8, sp, sp
store r0, 4(sp)
store r1, 8(sp)
jbr (r17)
save_02:
# save registers 0 and 2
subl $8, sp, sp
store r0, 4(sp)
store r2, 8(sp)
jbr (r17)
...
#
# Note: some compaction is possible at some expense:
#
# ...
# save_012:
# adjust sp
# save r2
# save_01:
# adjust sp
# save r1
# save_0:
# adjust sp
# save r0
# save_none:
# jbr ...
#
# A bunch of restore routines, one for each 2**n possible
# mask combinations. (~1/2Mb.)
# Assumes that the mask is stored in r16 and the return
# address is in r17.
#
restore_none:
ret
restore_0:
load 4(sp), r0
addl $4, sp, sp
ret
So I count
1 mask instruction at caller site.
4 mask/indirect instructions at callee site.
1-18 (average 4?) in jump table
1 indirect instruction at callee site (plus possible reload
of r16)
1-18 (average same as jump table) at return table.
Adds up to 14 instructions.
If you belive Robert Firth's claim of average 2 registers
saved with caller saves, you're *way* at a loss.
The worst case for caller/callee saves (according to my
previous analysis) is 8 saves, 8 restores. If each takes 1
cycle, then you've got 16 cycles. If you believe my claim
(see also last posting) that you might normally need to save
only, say, r1 and r3, then you're kicking 14 cycles with a
bunch of possible branch penalties vs. 16 cycles of
straight-line code.
There, wasn't that satisfying?
By the time this has spread across the USENET backbone, I will have
saved about 1 billion processor cycles :-)
;-D on ( Counting rings on a forest ) Pardo
--
pardo@cs.washington.edu
{rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo