mash@mips.com (John Mashey) (03/05/91)
In article <3062@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: >In article <658@spim.mips.COM> mash@mips.com (John Mashey) writes: ... > > I cannot think of any popular architecture designed in the > > last 5-8 years that hasn't paid at least some attention to > > making fast function calls, one way or another. >I have extensive figures about subroutine call overhead (in this case about >Below follows a table for a number of machines that indicates the speedup >when optimization level 2 is used. Also in the last two columns it >indicates whether the base multiplication and division routines are small, >medium size or large. I use small when a single instruction could be used, >medium when the hardware provides step instructions (multiply step or divide >step) and large when bit fiddling should be done. Medium is also used if FP >operations are used. CPU ISA ARCHITECTURE DESIGNED IN WHAT YEAR? WHOLE BUNCH OF GUESSES, within 1-2 YEARS, MAYBE. ? Gould NP1 32.3% small small ? Gould PowerNode 32.1% small small 1979 Alliant FX/4 (68020 look-alike) 27.5% small small 1979 Sony (68030) 27.1% small small 1975 Vax 780 26.0% small small 1979 Encore (32532) 24.9% small small 1979 Sequent Balance (32032) 21.9% small small 1979 Sun3 (68020) 20.4% small small 1981 IBM RT/PC 18.6% medium medium 1985 SGI 4D/25 (MIPS) FP 14.5% medium medium 1986 Alliant FX2800 (i860) 13.8% medium medium 1985 SGI 4D/25 (MIPS) INT 10.0% small large==>WHY? <1984 Harris HCX7 (Tahoe) 3.8% small small 1985 Sun4 (SPARC) 3.1% medium large Just out of curiosity, say more about the nature of the routines for divide, given than MIPS does have an integer divide, and SPARC doesn't, but both are called large. Also, just out of curiosity, could we see the two columns of numbers from which the % was computed, i.e., the measured times? >We see indeed that the subroutine call overhead decreases when subroutines >become larger. What is extremely surprising is the position of the Tahoe >contradictionary to John Mashey's remark that modern machines have lower >subroutine call overhead. That depends on your definition of modern. Please re-read what I posted. I asserted that one could NOT assume function calls were expensive these days, that on current RISCs, they were cheap, and popular architectures designed in the last 5-8 years had paid attention to function calls. I did NOT assert that NO earlier machines had fast function calls, nor that modern machines always have lower function call overhead.... (I would (almost) never assert such a thing, since all generalizations are false :-) On the other hand, given the data above, I think I would claim that MOST modern machines have less subroutine overhead than MOST older ones.... >430 for the divide). In all, it would be good to look at how the Tahoe >does subroutine calls. Can you post something on that? also, How about cycle counts for the mul/div code used on the Tahoe? -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086
dik@cwi.nl (Dik T. Winter) (03/08/91)
In article <712@spim.mips.COM> mash@mips.com (John Mashey) writes: (in response to me, adjusting my table) > CPU ISA ARCHITECTURE DESIGNED IN WHAT YEAR? WHOLE BUNCH OF GUESSES, within > 1-2 YEARS, MAYBE. > > ? Gould NP1 32.3% small small > ? Gould PowerNode 32.1% small small I would say early 1980's. However, the ISA's are similar but not the same. NP1 leaves quite a few things from the PN. The NP1 is from 1987. (PN is SEL compatible, NP1 is not.) ... > 1975 Vax 780 26.0% small small ... > <1984 Harris HCX7 (Tahoe) 3.8% small small Depends. If you put the Gould NP1 and PN ISA in the same year, the Tahoe ISA must be put in the same year as the VAX. (The similarities and differences are of the same scale.) Anyhow, the first printing of the Harris manual (for which I have the second revision) dates from 1985. So in that case 1983/1984 is about right. > Just out of curiosity, say more about the nature of the routines for divide, > given than MIPS does have an integer divide, and SPARC doesn't, but both are > called large. Multiply is 30*30->60 bits multiply; divide is 60/30->30+30 divide. In that case having a 32 over 32 bits divide does not help very much. It is possible to use it, but than you need to do double word shifts (to normalize numbers to get a reasonable approximation of the quotient). I felt lazy (and seeing that a double word shift is also not easy on a MIPS...) so I did everything by tentative subtract cq. add etc. using two threads, one where subtracts are done, one where adds are done, optionally jumping from one thread to the other. Actually similar code is used on the SPARC. But because the SPARC has status bits and add with carry it could be coded on the SPARC with four instructions per bit while I needed on the MIPS 8 instructions per bit (and all duplicated for both threads). > Also, just out of curiosity, could we see the two columns > of numbers from which the % was computed, i.e., the measured times? I did not post them on purpose. They are not relevant to this discussion as there may be multiple reasons why one machine is faster on this program than another. I will give some figures at the end. > > Please re-read what I posted. Sorry that I misread your posting. > On the other hand, given the data above, > I think I would claim that MOST modern machines have less subroutine > overhead than MOST older ones.... I agree about that. > > Can you post something on that? also, How about cycle counts for > the mul/div code used on the Tahoe? My manual does not give cycle counts. However, to show the similarity between VAX and Tahoe, below the code for the division operation on those machines: Tahoe: .globl _li_xdiv _li_xdiv: .word 0x0000 # mask no registers emul $base,4(fp),8(fp),r0 # a*base + b in r0 # tahoe bug (or undocumented feature); remainder must be quadword aligned! ediv 12(fp),r0,*20(fp),r0 # r0/c, rem in r0 quo in e movl r0,*16(fp) # rem in d ret VAX: .globl _li_xdiv _li_xdiv: .word 0x0000 # mask no registers emul $base,4(ap),8(ap),r0 # a*base + b in r0 ediv 12(ap),r0,*20(ap),*16(ap) # r0/c, rem in d quo in e ret Now some figures from which the percentages where derived. Before I create more misunderstanding (but without doubt I will create some): the before figures come from the situation where a single multiply and a single divide routine is used; the after figures come from the situation where a loop over multiply and divide sequences is used. In the latter case many routine calls are replaced by a single one, reducing the call overhead. NOTE: these figures should *NOT* be used as a basis to compare machines because they rely very much on my programming skills. However, the speedup is something you can find from inlining, because the loop code was essentially taken over from compiler output for similar loops. All figures are times for a complete program in milliseconds (a bit approximate on some machines, but reproducable within reasonable bounds). I give figures for the more saliant systems: VAX, Tahoe, Gould NP1, Sun and SGI. (To tease I also include the RT/PC timings, and no, I am not a IBM fan.) The figures from the column 'plain' are when plain HLL code is used (implying fewer number of bits in a word). before after plain VAX: 93550 58983 211600 Tahoe: 25117 22583 60517 NP1: 12440 6900 39340 RTPC: 17317 14100 47067 SS1: 16320 15780 57060 SGI: 7740 6270 11430 SGI: 7810 5580 * (using FP; 26 bits/word, no plain timings) I believe the SGI and SS1 are both 20Mhz machines, I am not sure about the others. These figures are from Fortran source, I found very similar figures for the C source. What do we find here? Going from the 'plain' column to the 'before' column we see the effect in the program of using more bits in an integer (as Peter Montgomery and Bob Silverman want). But going from 'before' to 'after' we see the effect of reducing the number of subroutine calls. As I said before, the Tahoe does extremely well while the NP1 is extremely bad for subroutine timings. (I have similar timings for 14 other machines, if you want them send me mail. But not all timings I have are as complete as these ones.) (So John, even if the divide on the MIPS requires several hundreds of instructions, it still comes out at the top! But I think that also the quality of compilation of the remainder of the program plays a role. It is only some 6000 lines of Fortran source code.) -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl