djones@awesome.Berkeley.EDU (12/27/89)
I was faced with a program which ran as fast on a SUN 3/60 as it did on a SUN 4/280, when there should a factor of 2-3 difference if you believe the MIPS rating. Using profiling "cc -pg", it became evident that the source is the SPARC divide instruction -- I gather there is none. This is, of course, part of the RISC strategy. I'm still just a bit surprised that SUN/SPARC hasn't figured out a way to get integer divisions done a little faster on a SUN 4/280 than on a SUN 3/60! I was amused to see some of the "functions" that gprof found using up all my CPU time. I gather the code checks to see if the numbers are "not_really_big", or "not_too_big" to do the division (ahem) faster. So are we stuck with this poor multiply/divide performance in SPARC, or is this shortcoming being addressed? Heck, would it be faster to hand off these operations to the Floating Point chip? % cumulative self self total time seconds seconds calls ms/call ms/call name 13.9 106.87 36.19 divloop [4] 13.8 142.71 35.84 divloop [5] 3.3 162.28 8.69 divide [10] 3.3 170.84 8.56 not_really_big [11] 3.2 179.13 8.29 divide [12] 3.1 187.27 8.14 not_really_big [13] 3.0 203.11 7.71 end_regular_divide [15] 2.9 210.67 7.56 end_regular_divide [16] 2.5 223.95 6.50 9326374 0.00 0.00 .rem [18] 2.3 229.85 5.91 9326374 0.00 0.00 .div [20] 1.6 239.22 4.27 got_result [23] 1.4 242.88 3.66 got_result [24] 0.6 248.88 1.69 do_regular_divide [25] 0.6 250.43 1.55 do_regular_divide [26] 0.5 251.65 1.22 end_single_divloop [27] 0.5 254.04 1.19 end_single_divloop [29] 0.2 256.09 0.62 4 155.02 155.02 .urem [33] 0.1 257.94 0.38 do_single_div [36] 0.1 258.32 0.38 do_single_div [37] 0.1 259.03 0.36 5 71.01 71.01 .udiv [39] 0.1 259.38 0.35 not_too_big [40] 0.1 259.64 0.27 not_too_big [41] 0.1 260.28 0.17 single_divloop [45] 0.0 260.35 0.07 single_divloop [48] 0.0 260.51 0.01 zero_divide [55]
sritacco@hpdml93.hp.com (Steve Ritacco) (01/11/90)
Well, This really isn't the RISC strategy. Check into some other RISC chips and you will see better multiply and divide performance. I think the R2000/R2000 takes 12 cycles for multiply and 30 cycles for divide. The multiply divide unit is seperate from the rest of the ALU so you are also allowed to perform other operations while the multiply or divide are going on.