[net.unix] timing study of uVAX II instructions

stuart@wheaton (Stuart Ceilous Williams) (04/23/86)

................................................................................

Here are the results of a small timing study we did on some of the
instructions of our MicrovaxII (with FPU):

We are running Ultrix 1.0; the times were computed by executing a tight
loop 10^7 times, timing it with a call to getrusage, and subracting the
time it took to execute an empty loop 10^7 times.  The first set of 
instructions were executed with register mode addressing.

Command     Time(microseconds)
------------------------------
bitb  :     0.347
decl  :     0.348
bisw3 :     0.349
subb2 :     0.349
bicb3 :     0.352
bicl2 :     0.354
movw  :     0.354
incb  :     0.355
movb  :     0.355
incw  :     0.356
bicw2 :     0.357
movl  :     0.357
bisl2 :     0.358
subb3 :     0.358
addw3 :     0.359
mnegb :     0.359
movl  :     0.359
movzwl:     0.359
movl  :     0.360
bisl2 :     0.361
bicb2 :     0.362
cmpw  :     0.362
xorl2 :     0.362
bitl  :     0.363
bicl2 :     0.364
bisw2 :     0.365
bitw  :     0.365
mnegw :     0.365
movzbw:     0.366
addw2 :     0.367
movw  :     0.367
movw  :     0.368
cmpb  :     0.369
subl2 :     0.371
cmpl  :     0.375
mnegl :     0.375
bitl  :     0.378
clrw  :     0.378
movb  :     0.378
subw3 :     0.378
subw2 :     0.379
subl3 :     0.384
movb  :     0.424
movb  :     0.439
movq  :     0.521
subw3 :     0.524
cvtwl :     0.529
addl2 :     0.530
decb  :     0.532
bisb2 :     0.533
cvtbw :     0.542
clrd  :     0.543
clrb  :     0.548
cvtbl :     0.562
bicw3 :     0.563
incl  :     0.762
decw  :     0.764
cvtlw :     0.766
movq  :     0.769
clrl  :     0.786
movf  :     0.960
cvtlb :     0.999
bisb3 :     1.008
addl3 :     1.012
cvtwb :     1.214
mnegf :     1.230
brb   :     1.278
movd  :     1.401
mnegd :     1.604
blbc  :     1.824
jmp   :     1.972
cvtfd :     2.022
cvtfl :     2.037
cvtfb :     2.043
cvtfl :     2.044
cvtfw :     2.052
mulf2 :     2.313
divf3 :     2.314
divf2 :     2.319
mulf3 :     2.323
cvtdb :     2.440
cvtdl :     2.443
cvtdw :     2.459
cvtdf :     2.466
cvtdl :     2.478
cmpf  :     2.502
bbcc  :     2.688
addf3 :     2.921
rotl  :     3.058
cvtwf :     3.077
muld3 :     3.099
cvtlf :     3.108
subf3 :     3.130
cmpd  :     3.137
addf2 :     3.267
divd2 :     3.268
cvtwd :     3.271
muld2 :     3.273
subf2 :     3.275
bsbb  :     3.277
cvtld :     3.290
cvtbf :     3.543
divd3 :     3.703
subd3 :     3.726
cvtbd :     3.733
addd3 :     3.739
ashl  :     3.776
jsb   :     3.857
subd2 :     3.906
addd2 :     3.914
ashq  :     4.237
mulb2 :     4.397
mulb3 :     4.757
mull2 :     4.999
mull3 :     5.410
mulw2 :     5.625
mulw3 :     5.830
divb2 :     6.440
acbf  :     6.475
divb3 :     6.657
divl2 :     7.931
divl3 :     8.110
emul  :     8.313
divw2 :     9.414
divw3 :     9.602
ediv  :    11.632
callg :    12.676
calls :    13.318
movp  :    94.498
cvtlp :   104.533
cvtpl :   128.586
subp4 :   285.064
mulp  :   389.623
-------------------

Comments:
	  The packed instructions above are implemented as
operating system calls, which explains the large execution times.  Some of
these we found were not working in Ultrix 1.1 (they should be fixed by
now).  It is interesting to note that printf uses some of the packed
instructions.
	Also, the time for the calls instruction should give some idea of when
it is better to use a macro than a function call.

(For the following instructions, l1 is the beginning of an array of longs).

Instruction         Addressing Modes                     Time (microseconds)
-----------------------------------------------------------------------------
movl $1,(r1)[r2]    # deferred index mode                   1.538
movl $1,r1          # register mode                          .557
movl $1,l1[r2]      # index mode                            1.246
movl $1,-(r1)       # autodecrement mode                     .567
movl $1,(r1)+       # autoincrement mode                    1.195
movl $1,*(r1)+      # autoincrement deferred                1.624
movl $1,4(r1)       # displacement mode                      .612
movl $1,*4(r1)      # displacement deferred mode            1.241
movl $1,r1          # immediate mode                         .376
movl $1,l1          # relative mode                         1.240
movl $1,*l1         # relative deferred                     1.884
movl $1,(r2)[r2]    # index mode                             .834
movl $1,(r1)+[r2]   # autoincrement indexed mode            1.577    
movl $1,-(r1)[r2]   # autodecrement indexed mode             .992
movl $1,4(r1)[r2]   # displacement indexed                  1.220
movl $1,*4(r1)[r2]  # displacement deferred indexed         1.875
movl $1,*(r1)+[r2]  # autodecrement deferred                2.267
movl (r1),(r2)      # double deferred                       1.128
moval l1,r1         # hmmmm....(initialization)              .887

--------------------
Queries:
	  What are the possible sources of error for this study as briefly
described?  Does the getrusage call provide an accurate accounting of the
time used (we used ru_utime)?  (Mail me responses)

--------------------------------------------------------------------------------
Stuart Williams		 UUCP: ...ihnp4!wheaton!stuart
(Computer Science / Physics student at Wheaton College, near Chicago)

jbs@mit-eddie.MIT.EDU (Jeff Siegal) (04/25/86)

IN ARTICLE <90@WHEATON> STUART@WHEATON.UUCP (STUART CEILOUS WILLIAMS) WRITES:
>[...]
>brb   :     1.278
>[...]
>jmp   :     1.972
>[...]
>bsbb  :     3.277
>[...]
>jsb   :     3.857
>[...]
>callg :    12.676
>calls :    13.318
>[...]
>	Also, the time for the calls instruction should give some idea of when
>it is better to use a macro than a function call.
>[...]

Of course, this has been said before, but the time for the call
instructions also gives some idea of how useful it is to have a
compiler which:

1) Expands function calls inline when appropriate.

2) Uses a faster, but less general, way to get to a function when
possible (usually only for "static" functions).

Jeff Siegal - MIT EECS

operators@watmath.UUCP (M.F.C.F. Operators) (05/04/86)

In article <1730@mit-eddie.MIT.EDU> jbs@mit-eddie.UUCP (Jeff Siegal) writes:
>Of course, this has been said before, but the time for the call
>instructions also gives some idea of how useful it is to have a
>compiler which:
>
>1) Expands function calls inline when appropriate.
>
>2) Uses a faster, but less general, way to get to a function when
>possible (usually only for "static" functions).
>
or 3) Have a compiler designers spent a significant amount of time
     thinking about and designing a call convention call convention
     that is fast and flexible.

It has been observed a number of times that the manufacturer suggested
call convention is seldom even close to being optimal for the machine.
Similarly, the ones used by most C compilers are simply quick adaptations
from the one used on the last machine the compiler implementor saw.
I have often wondered how much the "RISC" performance comes simply
from the fact that registers were used to pass arguments, instead
of being blindly having garbage values "saved" and restored on every call.