[comp.arch] bizarre instructions really subroutine overhead

mash@mips.com (John Mashey) (03/05/91)

In article <3062@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
>In article <658@spim.mips.COM> mash@mips.com (John Mashey) writes:
...
> > 	I cannot think of any popular architecture designed in the
> > 	last 5-8 years that hasn't paid at least some attention to
> > 	making fast function calls, one way or another.

>I have extensive figures about subroutine call overhead (in this case about

>Below follows a table for a number of machines that indicates the speedup
>when optimization level 2 is used.  Also in the last two columns it
>indicates whether the base multiplication and division routines are small,
>medium size or large.  I use small when a single instruction could be used,
>medium when the hardware provides step instructions (multiply step or divide
>step) and large when bit fiddling should be done.  Medium is also used if FP
>operations are used.
 CPU ISA ARCHITECTURE DESIGNED IN WHAT YEAR?  WHOLE BUNCH OF GUESSES, within
1-2 YEARS, MAYBE.

 ?	Gould NP1				32.3%	small	small
 ?	Gould PowerNode				32.1%	small	small
 1979	Alliant FX/4 (68020 look-alike)		27.5%	small	small
 1979	Sony (68030)				27.1%	small	small
 1975	Vax 780					26.0%	small	small
 1979	Encore (32532)				24.9%	small	small
 1979	Sequent Balance (32032)			21.9%	small	small
 1979	Sun3 (68020)				20.4%	small	small
 1981	IBM RT/PC				18.6%	medium	medium
 1985	SGI 4D/25 (MIPS) FP			14.5%	medium	medium
 1986	Alliant FX2800 (i860)			13.8%	medium	medium
 1985	SGI 4D/25 (MIPS) INT			10.0%	small	large==>WHY?
<1984	Harris HCX7 (Tahoe)			 3.8%	small	small
 1985	Sun4 (SPARC)				 3.1%	medium	large
Just out of curiosity, say more about the nature of the routines for divide,
given than MIPS does have an integer divide, and SPARC doesn't, but both are
called large.  Also, just out of curiosity, could we see the two columns
of numbers from which the % was computed, i.e., the measured times?

>We see indeed that the subroutine call overhead decreases when subroutines
>become larger.  What is extremely surprising is the position of the Tahoe
>contradictionary to John Mashey's remark that modern machines have lower
>subroutine call overhead.  That depends on your definition of modern.

Please re-read what I posted.  I asserted that one could NOT assume
function calls were expensive these days, that on current RISCs, they
were cheap, and popular architectures designed in the last 5-8 years
had paid attention to function calls.  I did NOT assert that NO earlier
machines had fast function calls, nor that modern machines always have
lower function call overhead....  (I would (almost) never assert such a
thing, since all generalizations are false :-)
On the other hand, given the data above,
I think I would claim that MOST modern machines have less subroutine
overhead than MOST older ones....

>430 for the divide).  In all, it would be good to look at how the Tahoe
>does subroutine calls.
Can you post something on that? also, How about cycle counts for
the mul/div code used on the Tahoe?
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086

dik@cwi.nl (Dik T. Winter) (03/08/91)

In article <712@spim.mips.COM> mash@mips.com (John Mashey) writes:
(in response to me, adjusting my table)
 >  CPU ISA ARCHITECTURE DESIGNED IN WHAT YEAR?  WHOLE BUNCH OF GUESSES, within
 > 1-2 YEARS, MAYBE.
 > 
 >  ?	Gould NP1				32.3%	small	small
 >  ?	Gould PowerNode				32.1%	small	small
I would say early 1980's.  However, the ISA's are similar but not the same.
NP1 leaves quite a few things from the PN.  The NP1 is from 1987.  (PN is SEL
compatible, NP1 is not.)
...
 >  1975	Vax 780					26.0%	small	small
...
 > <1984	Harris HCX7 (Tahoe)			 3.8%	small	small
Depends. If you put the Gould NP1 and PN ISA in the same year, the Tahoe ISA
must be put in the same year as the VAX.  (The similarities and differences
are of the same scale.)  Anyhow, the first printing of the Harris manual (for
which I have the second revision) dates from 1985.  So in that case 1983/1984
is about right.

 > Just out of curiosity, say more about the nature of the routines for divide,
 > given than MIPS does have an integer divide, and SPARC doesn't, but both are
 > called large.
Multiply is 30*30->60 bits multiply; divide is 60/30->30+30 divide.  In that
case having a 32 over 32 bits divide does not help very much.  It is possible
to use it, but than you need to do double word shifts (to normalize numbers
to get a reasonable approximation of the quotient).  I felt lazy (and seeing
that a double word shift is also not easy on a MIPS...) so I did everything
by tentative subtract cq. add etc. using two threads, one where subtracts
are done, one where adds are done, optionally jumping from one thread to the
other.  Actually similar code is used on the SPARC.  But because the SPARC
has status bits and add with carry it could be coded on the SPARC with
four instructions per bit while I needed on the MIPS 8 instructions per bit
(and all duplicated for both threads).

 >                Also, just out of curiosity, could we see the two columns
 > of numbers from which the % was computed, i.e., the measured times?
I did not post them on purpose.  They are not relevant to this discussion
as there may be multiple reasons why one machine is faster on this program
than another.  I will give some figures at the end.
 > 
 > Please re-read what I posted.
Sorry that I misread your posting.

 > On the other hand, given the data above,
 > I think I would claim that MOST modern machines have less subroutine
 > overhead than MOST older ones....
I agree about that.
 > 
 > Can you post something on that? also, How about cycle counts for
 > the mul/div code used on the Tahoe?
My manual does not give cycle counts.  However, to show the similarity
between VAX and Tahoe, below the code for the division operation on those
machines:

Tahoe:
	.globl	_li_xdiv
_li_xdiv:
	.word	0x0000			# mask no registers
	emul	$base,4(fp),8(fp),r0	# a*base + b in r0
 #	tahoe bug (or undocumented feature); remainder must be quadword aligned!
	ediv	12(fp),r0,*20(fp),r0	# r0/c, rem in r0 quo in e
	movl	r0,*16(fp)		# rem in d
	ret

VAX:
	.globl	_li_xdiv
_li_xdiv:
	.word	0x0000			# mask no registers
	emul	$base,4(ap),8(ap),r0 	# a*base + b in r0
	ediv	12(ap),r0,*20(ap),*16(ap) # r0/c, rem in d quo in e
	ret


Now some figures from which the percentages where derived.  Before I create
more misunderstanding (but without doubt I will create some): the before
figures come from the situation where a single multiply and a single divide
routine is used; the after figures come from the situation where a loop over
multiply and divide sequences is used.  In the latter case many routine calls
are replaced by a single one, reducing the call overhead.  NOTE: these
figures should *NOT* be used as a basis to compare machines because they
rely very much on my programming skills.  However, the speedup is something
you can find from inlining, because the loop code was essentially taken over
from compiler output for similar loops.
All figures are times for a complete program in milliseconds (a bit
approximate on some machines, but reproducable within reasonable bounds).
I give figures for the more saliant systems:  VAX, Tahoe, Gould NP1, Sun
and SGI.  (To tease I also include the RT/PC timings, and no, I am not a
IBM fan.)  The figures from the column 'plain' are when plain HLL code is
used (implying fewer number of bits in a word).
	before	 after	 plain
VAX:	 93550	 58983	211600
Tahoe:	 25117	 22583	 60517
NP1:	 12440	  6900	 39340
RTPC:	 17317	 14100	 47067
SS1:	 16320	 15780	 57060
SGI:	  7740	  6270	 11430
SGI:	  7810	  5580	   *	(using FP; 26 bits/word, no plain timings)

I believe the SGI and SS1 are both 20Mhz machines, I am not sure about the
others.  These figures are from Fortran source, I found very similar figures
for the C source.  What do we find here?  Going from the 'plain' column to
the 'before' column we see the effect in the program of using more bits in
an integer (as Peter Montgomery and Bob Silverman want).  But going from
'before' to 'after' we see the effect of reducing the number of subroutine
calls.  As I said before, the Tahoe does extremely well while the NP1 is
extremely bad for subroutine timings.  (I have similar timings for 14 other
machines, if you want them send me mail.  But not all timings I have are as
complete as these ones.)  (So John, even if the divide on the MIPS requires
several hundreds of instructions, it still comes out at the top!  But I
think that also the quality of compilation of the remainder of the program
plays a role.  It is only some 6000 lines of Fortran source code.)
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl