news@daver.bungi.com (09/12/90)
Let me begin by saying that I am no dhrystone wizard. Out of the box I am consistently getting 7692.3 dhrystones / second for both 'dry2' and 'dry2reg' (ie dhrystone 2.1, 500000 iterations). As far as I can determine the only meaningful libc functions that are linked in are strcmp.o and strcpy.o. So next I wrote assembly versions of these routines using the string instructions. With this change I am getting 8771.9 dhrystones / second consistenly. Included with the dhry2.1 source code was a compulation of various dhrystone results. Summarizing the results for NS32000 series processor: dry2 dry2reg ----- ----- Encore 32032 10Mhz 1323 1323 Encore 32332 15Mhz 3059 3071 Aeon 32332 15Mhz 3413 3413 Aeon 32532 25Mhz 9998 9998 Encore 32532 25Mhz 11117 11223 The Encore 32532 results are inline with those reported by Dave Rand to this news group. Now, for my question. Is the difference between 8771 that I am getting and the reported 11000 figure do only to compiler efficiency, or are the other factors entering the picture. My naive assumption is there must be other factors since the code generated by GCC looks damm good to me. BTW: As soon as I get motivated to cleanup / double check the code, I will post assembly versions of the following string functions: memchr.s memcmp.s memcpy.s memmove.s memset.s strchr.s strcmp.s strcpy.s strlen.s strncat.s strncmp.s strncpy.s strrchr.s As you can see the performance improvement is generally 2X - 3X. Function New time as percentage of old time --------------------- ----------------------------------------- memcpy(s1, s2, n): [n=4]: 50 [n=25]: 38 [n=1024]: 32 memmove(s1, s2, n): [n=4]: 31 [n=25]: 31 [n=1024]: 32 strcpy(s1, s2): [s2=ATOE]: 71 [s2=ATOZ]: 56 strncpy(s1, s2, n): [s2=ATOZ,n=10]: 58 memcmp(buf, buf2, n): [n=4]: 38 [n=25]: 10 [n=1024]: 2 strcmp(s1, s2): [2*ATOE]: 67 [2*ATOZ]: 66 strncmp(s1, s2, n): [n=4]: 67 [n=25]: 56 memchr(ATOZ, c, 25): [c='E']: 63 [c='Z']: 42 strchr(ATOZ, c): [c='E']: 75 [c='Z']: 50 strrchr(ATOZ, c): [c='A']: 83 [c='E']: 82 [c='Z']: 53 memset(buf, 0, n): [n=4]: 180 [n=1024]: 29 strlen(s): [s=ATOE]: 71 [s=ATOZ]: 59 Best regards, johnc ------------------------------------------------------ DHRYSTONE 2.n RESULTS SORTED BY MANUFACTURER Sun Apr 29 12:37:27 EDT 1990 |--------------------------------------------------------------------------------------------------------------------------------------------------------------| |manuf |model |proc |clock|os |osver |compiler |cver |options | noreg| reg|notes |date |submit | |--------------------------------------------------------------------------------------------------------------------------------------------------------------| |AEON |332/AT |NS3233|15.00|GENIX |V.3 |NS CTP |2.4 |-O -KC332 | 3413| 3413| |03/12/88|John Behrs | |Technologi| |2 | | | | | | | | | | |(boulder!fesk! | |es | | | | | | | | | | | | |ativax!john) | |__________|________________|______|_____|________|________|____________|________|____________|_______|_______|__________________|________|____________________| |AEON |532/AT |NS3253|25.00|GENIX |V.3 |NS CTP |2.4 |-O -KC532 | 9998| 9998|pipelining |03/12/88|John Behrs | |Technologi| |2-A1 | | | | | | | | |disabled, chip | |(boulder!fesk! | |es | | | | | | | | | | |restrictions in | |ativax!john) | | | | | | | | | | | | |effect | | | |__________|________________|______|_____|________|________|____________|________|____________|_______|_______|__________________|________|____________________| |Encore |MULTIMAX |32032 |10.00|Mach | | | |-O -q | 1323| 1323|1 of 16 processors|03/14/88|Lawrence Butcher | |Computer | | | | | | | |novolatile | | | | | | |__________|________________|______|_____|________|________|____________|________|____________|_______|_______|__________________|________|____________________| |Encore |MULTIMAX |32332 |15.00|Mach | | | |-O -q | 3059| 3071|1 of 16 processors|04/16/88|Lawrence Butcher | |Computer | | | | | | | |novolatile | | | | | | |__________|________________|______|_____|________|________|____________|________|____________|_______|_______|__________________|________|____________________| |Encore |Multimax 320 |NS3253|25.00|Umax4.2 |A3.3 |C-32000 |1.8.4 |-O -q o=t | 11117| 11223|Alpha HW. |12/03/88|James R. Grier | |Computer | |2 | | | |Green Hills | | | | |Production will be| | | | | | | | | |Software | | | | |30Mhz. | | | |__________|________________|______|_____|________|________|____________|________|____________|_______|_______|__________________|________|____________________| |National |NS32GX32 |32GX32|30.00|Compiled|GNX |GNX - |3.4 |-O -KC532 | 16087| 16087| 0 wait state |03/09/89|Jonathan Levy | |Semiconduc|Evaluation Board| | |and down|Version |Version 3 C | | | | |2-way interleave | |(nsc!levy) | |tor | | | |loaded |3 |Optimizing | | | | |static ram. | | | | | | | |under | |Compiler | | | | | | | | | | | | |GNX | |(CTP) | | | | | | | | |__________|________________|______|_____|________|________|____________|________|____________|_______|_______|__________________|________|____________________| |Sequent |BALANCE 8000 |32032 | |Mach | | | |-O | 1058| 1110|1 of 32 processors|03/14/88|Lawrence Butcher | |__________|________________|______|_____|________|________|____________|________|____________|_______|_______|__________________|________|____________________| --
dlr@daver.bungi.com (Dave Rand) (09/12/90)
[In the message entitled "Dhrystone 2.1" on Sep 11, 23:35, John Connin writes:] > > Now, for my question. Is the difference between 8771 that I am getting > and the reported 11000 figure do only to compiler efficiency, or > are the other factors entering the picture. My naive assumption > is there must be other factors since the code generated by GCC > looks damm good to me. > The gcc code is not bad, but is not outstanding. The code from the National Semiconductor CTP compiler beats it, as does the Green Hills compilers. You were correct in looking at strcmp/strcpy, but you didn't go far enough at optimizing them... there are some _real_ hairy things you can do to get dhrystone numbers way up there. While at National, I had a "challange" to beat 20,000 dhrystones/sec (1.1 version). I was happy when I hit 19,400. I got tired of optimizing when I hit 30,000 :-) <dig, shuffle> Here is the gist of what you need to look at. I wrote these routines at home, from a neat article in a C programming journal I subsribe to (I can find the original reference, if you are interested). The magic here is to look at a complete double-word at a time, storing it if none of the bytes are zero. I tink this worked out to 5 clocks per byte, which was as good as I could get it. The assembler format is System V. Perhaps someone else can make it a bit better? Central to the routine is the concept of treating a 32 bit register as 4 byte values, with "borrows" between them. Subtracting 1 from each of the byte values (0x01010101) will change a 0x00 to 0xff. The original byte value is then masked off with a BIC instruction (see - I told you CISC's are good for something :-), and then the implied borrow is tested for with a simple AND instruction. If the result is non-zero, at least one of the bytes must have been zero, and we exit the loop. Exitting the loop is interesting code, too... I do a psuedo binary search to find the zero byte. I haven't tested this code real well, so do let me know if you find bugs. Neat stuff, all considered! .file "mstrcpy.s" # # Very fast replacement for regular C string copy / strcmp routines # # Dave Rand # 09/01/88 # # .globl _strcpy,_strcmp .align 4 _strcpy: movd 8(sp),r1 # get source pointer movd 4(sp),r2 # get destination pointer movd r5,tos movd 0(r1),r5 # get source data movd r5,r0 # save it for storage subd $0x01010101,r0 # subtract magic number 1 bicd r5,r0 # clear off original bits andd $0x80808080,r0 # test for borrow cmpqd 0,r0 # is it zero? blo mvby1:b # if any byte was zero, ex .align 4 lp: movd 4(r1),r0 # get source data movd r5,0(r2) # save double in dest. addqd 4,r1 # increment source pointer addqd 4,r2 # increment destination pointer movd r0,r5 # save it for storage subd $0x01010101,r0 # subtract magic number 1 bicd r5,r0 # clear off original bits andd $0x80808080,r0 # test for borrow cmpqd 0,r0 # is it zero? beq lp:b # no, loop # .align 4 # r5 contains the data, at least one byte is zero # r0 contains the bit mask of the byte that contains the zero mvby1: cmpqw 0,r0 # was it in the first two bytes? blo mv1a:b # yes, exit now cmpd $0x80000000,r0 # Check the last two bytes bls mv4:b # it was byte 2 movw r5,0(r2) # save the word movb 2(r1),2(r2) # copy the null bytes movd tos,r5 # restore register movd 4(sp),r0 # return dest. ptr. ret $(0) .align 4 mv4: movd r5,0(r2) # no, it was the last byte movd tos,r5 # restore register movd 4(sp),r0 # return dest. ptr. ret $(0) .align 4 mv1a: cmpb $0x80,r0 # was it byte zero? beq mv1:b # yes, exit mv2: movw r5,0(r2) # save the word movd tos,r5 # restore register movd 4(sp),r0 # return dest. ptr. ret $(0) .align 4 mv1: movb r5,0(r2) # save the byte movd tos,r5 # restore register movd 4(sp),r0 # return dest. ptr. ret $(0) .align 4 _strcmp: movd 8(sp),r1 # get source pointer movd 4(sp),r2 # get destination pointer movd r5,tos movd 0(r1),r5 # get source data movd r5,r0 # save it for storage subd $0x01010101,r0 # subtract magic number 1 bicd r5,r0 # clear off original bits andd $0x80808080,r0 # test for borrow cmpqd 0,r0 # is it zero? blo cpxit:w # if any byte was zero, ex .align 4 cplp: movd 4(r1),r0 # get source data cmpd r5,0(r2) # compare the two bne cpxit1a:b # exit if not equal addqd 4,r1 # increment source pointer addqd 4,r2 # increment destination pointer movd r0,r5 # save it for storage subd $0x01010101,r0 # subtract magic number 1 bicd r5,r0 # clear off original bits andd $0x80808080,r0 # test for borrow cmpqd 0,r0 # is it zero? beq cplp:b # no, loop br cpxit:w .align 4 cpxit1a: # the 4 current bytes don't match. Find out why. # There is no zero byte in the current 4 bytes of source cpxit1: cmpw r5,0(r2) # is it the first word? bne cpx3a1:b # yes, exit now cmpb 2(r1),2(r2) # next? bne cpx3c:b # yes... movb 3(r2),r0 # get s2 subb 3(r1),r0 # subtract s1 movxbd r0,r0 # sign extended return value movd tos,r5 # pop saved register ret $(0) cpx3a1: cmpb r5,0(r2) # is it the first byte? bne cpx3a:b # yes, exit now movb 1(r2),r0 # get destination subb 1(r1),r0 # subtract source movxbd r0,r0 # sign extend return value movd tos,r5 # pop saved register ret $0 .align 4 cpx3a: movb 0(r2),r0 # get destination subb r5,r0 # subtract source movxbd r0,r0 # sign extend return value movd tos,r5 # pop saved register ret $0 .align 4 cpx3c: movb 2(r2),r0 # get destination subb 2(r1),r0 # subtract source movxbd r0,r0 # sign extend return value movd tos,r5 # pop saved register ret $0 .align 4 # 1 of the 4 current bytes is zero. # check to see what it means cpxit: cmpqb 0,r5 # is lsb zero? beq cpx2:w # exit now if it is movb 0(r2),r0 # get s2 cmpb r0,r5 # does it match? bne cpx1:b # no, exit now movb 1(r1),r5 # get s1 cmpqb 0,r5 # is lsb zero? beq cpx2:b # exit now if it is movb 1(r2),r0 # get s2 cmpb r0,r5 # does it match? bne cpx1:b # no, exit now movb 2(r1),r5 # get s1 cmpqb 0,r5 # is lsb zero? beq cpx2:b # exit now if it is movb 2(r2),r0 # get s2 cmpb r5,r0 # does it match? bne cpx1:b # no, exit now movb 3(r1),r5 # get s1 cmpqb 0,r5 # is s1 zero? beq cpx2:b # exit now if it is movb 3(r2),r0 # get s2 .align 4 cpx1: subb r5,r0 # subtract to get diff movxbd r0,r0 # sign extended return value movd tos,r5 # pop saved register ret $(0) .align 4 cpx2: movqd 0,r0 # strings are equal movd tos,r5 # pop saved register ret $0 # and return -- Dave Rand {pyramid|mips|bct|vsi1}!daver!dlr Internet: dlr@daver.bungi.com