[comp.sys.nsc.32k] Dhrystone 2.1

news@daver.bungi.com (09/12/90)

Let me begin by saying that I am no dhrystone wizard.

Out of the box I am consistently getting 7692.3 dhrystones / second for both
'dry2' and 'dry2reg' (ie dhrystone 2.1, 500000 iterations). 

As far as I can determine the only meaningful libc functions that are
linked in are strcmp.o and strcpy.o.  So next I wrote assembly versions
of these routines using the string instructions.  With this change I
am getting 8771.9 dhrystones / second consistenly.

Included with the dhry2.1 source code was a compulation of various dhrystone
results.  Summarizing the results for NS32000 series processor:

                       	dry2  	dry2reg
			-----	----- 
 Encore 32032  10Mhz  	 1323 	 1323
 Encore 32332  15Mhz 	 3059 	 3071
 Aeon   32332  15Mhz 	 3413	 3413
 Aeon   32532  25Mhz 	 9998 	 9998
 Encore 32532  25Mhz 	11117 	11223


The Encore 32532 results are inline with those reported by Dave Rand
to this news group.

Now, for my question.  Is the difference between 8771 that I am getting
and the reported 11000 figure do only to compiler efficiency, or 
are the other factors entering the picture.  My naive assumption
is there must be other factors since the code generated by GCC
looks damm good to me.

BTW: As soon as I get motivated to cleanup / double check the code, I will 
post assembly versions of the following string functions:

  memchr.s memcmp.s memcpy.s memmove.s memset.s
  strchr.s strcmp.s strcpy.s strlen.s strncat.s strncmp.s strncpy.s strrchr.s

As you can see the performance improvement is generally 2X - 3X.

Function		New time as percentage of old time
---------------------	-----------------------------------------
memcpy(s1, s2, n):	[n=4]:     50	[n=25]:    38	[n=1024]:  32	
memmove(s1, s2, n):	[n=4]:     31	[n=25]:    31	[n=1024]:  32	
strcpy(s1, s2):		[s2=ATOE]: 71	[s2=ATOZ]: 56	
strncpy(s1, s2, n):	[s2=ATOZ,n=10]: 58	
memcmp(buf, buf2, n):	[n=4]:     38	[n=25]:    10	[n=1024]:   2	
strcmp(s1, s2):		[2*ATOE]:  67	[2*ATOZ]:  66	
strncmp(s1, s2, n):	[n=4]:     67	[n=25]:    56	
memchr(ATOZ, c, 25):	[c='E']:   63	[c='Z']:   42	
strchr(ATOZ, c):	[c='E']:   75	[c='Z']:   50	
strrchr(ATOZ, c):	[c='A']:   83	[c='E']:   82	[c='Z']:   53	
memset(buf, 0, n):	[n=4]:     180	[n=1024]:  29	
strlen(s):		[s=ATOE]:  71	[s=ATOZ]:  59	

Best regards,
johnc


	------------------------------------------------------


       DHRYSTONE 2.n RESULTS SORTED BY MANUFACTURER Sun Apr 29 12:37:27 EDT 1990

|--------------------------------------------------------------------------------------------------------------------------------------------------------------|
|manuf     |model           |proc  |clock|os      |osver   |compiler    |cver    |options     |  noreg|    reg|notes             |date    |submit              |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------|

|AEON      |332/AT          |NS3233|15.00|GENIX   |V.3     |NS CTP      |2.4     |-O -KC332   |   3413|   3413|                  |03/12/88|John Behrs          |
|Technologi|                |2     |     |        |        |            |        |            |       |       |                  |        |(boulder!fesk!      |
|es        |                |      |     |        |        |            |        |            |       |       |                  |        |ativax!john)        |
|__________|________________|______|_____|________|________|____________|________|____________|_______|_______|__________________|________|____________________|

|AEON      |532/AT          |NS3253|25.00|GENIX   |V.3     |NS CTP      |2.4     |-O -KC532   |   9998|   9998|pipelining        |03/12/88|John Behrs          |
|Technologi|                |2-A1  |     |        |        |            |        |            |       |       |disabled, chip    |        |(boulder!fesk!      |
|es        |                |      |     |        |        |            |        |            |       |       |restrictions in   |        |ativax!john)        |
|          |                |      |     |        |        |            |        |            |       |       |effect            |        |                    |
|__________|________________|______|_____|________|________|____________|________|____________|_______|_______|__________________|________|____________________|




|Encore    |MULTIMAX        |32032 |10.00|Mach    |        |            |        |-O -q       |   1323|   1323|1 of 16 processors|03/14/88|Lawrence Butcher    |
|Computer  |                |      |     |        |        |            |        |novolatile  |       |       |                  |        |                    |
|__________|________________|______|_____|________|________|____________|________|____________|_______|_______|__________________|________|____________________|
|Encore    |MULTIMAX        |32332 |15.00|Mach    |        |            |        |-O -q       |   3059|   3071|1 of 16 processors|04/16/88|Lawrence Butcher    |
|Computer  |                |      |     |        |        |            |        |novolatile  |       |       |                  |        |                    |
|__________|________________|______|_____|________|________|____________|________|____________|_______|_______|__________________|________|____________________|
|Encore    |Multimax 320    |NS3253|25.00|Umax4.2 |A3.3    |C-32000     |1.8.4   |-O -q o=t   |  11117|  11223|Alpha HW.         |12/03/88|James R. Grier      |
|Computer  |                |2     |     |        |        |Green Hills |        |            |       |       |Production will be|        |                    |
|          |                |      |     |        |        |Software    |        |            |       |       |30Mhz.            |        |                    |
|__________|________________|______|_____|________|________|____________|________|____________|_______|_______|__________________|________|____________________|


|National  |NS32GX32        |32GX32|30.00|Compiled|GNX     |GNX -       |3.4     |-O -KC532   |  16087|  16087| 0 wait state     |03/09/89|Jonathan Levy       |
|Semiconduc|Evaluation Board|      |     |and down|Version |Version 3 C |        |            |       |       |2-way interleave  |        |(nsc!levy)          |
|tor       |                |      |     |loaded  |3       |Optimizing  |        |            |       |       |static ram.       |        |                    |
|          |                |      |     |under   |        |Compiler    |        |            |       |       |                  |        |                    |
|          |                |      |     |GNX     |        |(CTP)       |        |            |       |       |                  |        |                    |
|__________|________________|______|_____|________|________|____________|________|____________|_______|_______|__________________|________|____________________|


|Sequent   |BALANCE 8000    |32032 |     |Mach    |        |            |        |-O          |   1058|   1110|1 of 32 processors|03/14/88|Lawrence Butcher    |
|__________|________________|______|_____|________|________|____________|________|____________|_______|_______|__________________|________|____________________|



-- 

dlr@daver.bungi.com (Dave Rand) (09/12/90)

[In the message entitled "Dhrystone 2.1" on Sep 11, 23:35, John Connin writes:]
> 
> Now, for my question.  Is the difference between 8771 that I am getting
> and the reported 11000 figure do only to compiler efficiency, or 
> are the other factors entering the picture.  My naive assumption
> is there must be other factors since the code generated by GCC
> looks damm good to me.
> 

The gcc code is not bad, but is not outstanding. The code from the 
National Semiconductor CTP compiler beats it, as does the Green Hills
compilers.

You were correct in looking at strcmp/strcpy, but you didn't go far
enough at optimizing them... there are some _real_ hairy things you
can do to get dhrystone numbers way up there. While at National,
I had a "challange" to beat 20,000 dhrystones/sec (1.1 version).
I was happy when I hit 19,400. I got tired of optimizing when
I hit 30,000 :-)

<dig, shuffle> Here is the gist of what you need to look at. I wrote
these routines at home, from a neat article in a C programming journal
I subsribe to (I can find the original reference, if you are interested).
The magic here is to look at a complete double-word at a time, storing
it if none of the bytes are zero. I tink this worked out to 5 clocks
per byte, which was as good as I could get it. The assembler format
is System V. Perhaps someone else can make it a bit better?

Central to the routine is the concept of treating a 32 bit register
as 4 byte values, with "borrows" between them. Subtracting 1 from
each of the byte values (0x01010101) will change a 0x00 to 0xff.
The original byte value is then masked off with a BIC instruction
(see - I told you CISC's are good for something :-), and then
the implied borrow is tested for with a simple AND instruction. If
the result is non-zero, at least one of the bytes must have been
zero, and we exit the loop. Exitting the loop is interesting code,
too... I do a psuedo binary search to find the zero byte. I haven't
tested this code real well, so do let me know if you find bugs.

Neat stuff, all considered!

	.file	"mstrcpy.s"
#
# Very fast replacement for regular C string copy / strcmp routines
#
# Dave Rand
# 09/01/88
#
#

	.globl	_strcpy,_strcmp
	.align 4
_strcpy:
	movd	8(sp),r1		# get source pointer
	movd	4(sp),r2		# get destination pointer
	movd	r5,tos

	movd	0(r1),r5		# get source data
	movd	r5,r0			# save it for storage
	subd	$0x01010101,r0		# subtract magic number 1
	bicd	r5,r0			# clear off original bits
	andd	$0x80808080,r0		# test for borrow
	cmpqd	0,r0			# is it zero?
	blo	mvby1:b			# if any byte was zero, ex

	.align	4
lp:	movd	4(r1),r0		# get source data
	movd	r5,0(r2)		# save double in dest.
	addqd	4,r1			# increment source pointer
	addqd	4,r2			# increment destination pointer
	movd	r0,r5			# save it for storage
	subd	$0x01010101,r0		# subtract magic number 1
	bicd	r5,r0			# clear off original bits
	andd	$0x80808080,r0		# test for borrow
	cmpqd	0,r0			# is it zero?
	beq	lp:b			# no, loop

#	.align	4
# r5 contains the data, at least one byte is zero
# r0 contains the bit mask of the byte that contains the zero
mvby1:	cmpqw	0,r0			# was it in the first two bytes?
	blo	mv1a:b			# yes, exit now
	cmpd	$0x80000000,r0		# Check the last two bytes
	bls	mv4:b			# it was byte 2
	movw	r5,0(r2)		# save the word
	movb	2(r1),2(r2)		# copy the null bytes
	movd	tos,r5			# restore register
	movd	4(sp),r0		# return dest. ptr.
	ret	$(0)

	.align	4
mv4:	movd	r5,0(r2)		# no, it was the last byte
	movd	tos,r5			# restore register
	movd	4(sp),r0		# return dest. ptr.
	ret	$(0)

	.align	4
mv1a:	cmpb	$0x80,r0		# was it byte zero?
	beq	mv1:b			# yes, exit
mv2:	movw	r5,0(r2)		# save the word
	movd	tos,r5			# restore register
	movd	4(sp),r0		# return dest. ptr.
	ret	$(0)

	.align	4
mv1:	movb	r5,0(r2)		# save the byte
	movd	tos,r5			# restore register
	movd	4(sp),r0		# return dest. ptr.
	ret	$(0)




	.align 4
_strcmp:
	movd	8(sp),r1		# get source pointer
	movd	4(sp),r2		# get destination pointer
	movd	r5,tos
	
	movd	0(r1),r5		# get source data
	movd	r5,r0			# save it for storage
	subd	$0x01010101,r0		# subtract magic number 1
	bicd	r5,r0			# clear off original bits
	andd	$0x80808080,r0		# test for borrow
	cmpqd	0,r0			# is it zero?
	blo	cpxit:w			# if any byte was zero, ex

	.align	4
cplp:	movd	4(r1),r0		# get source data
	cmpd	r5,0(r2)		# compare the two
	bne	cpxit1a:b		# exit if not equal
	addqd	4,r1			# increment source pointer
	addqd	4,r2			# increment destination pointer
	movd	r0,r5			# save it for storage
	subd	$0x01010101,r0		# subtract magic number 1
	bicd	r5,r0			# clear off original bits
	andd	$0x80808080,r0		# test for borrow
	cmpqd	0,r0			# is it zero?
	beq	cplp:b			# no, loop
	br	cpxit:w

	.align	4
cpxit1a:
# the 4 current bytes don't match. Find out why.
# There is no zero byte in the current 4 bytes of source
cpxit1:	cmpw	r5,0(r2)		# is it the first word?
	bne	cpx3a1:b		# yes, exit now
	cmpb	2(r1),2(r2)		# next?
	bne	cpx3c:b			# yes...
	movb	3(r2),r0		# get s2
	subb	3(r1),r0		# subtract s1
	movxbd	r0,r0			# sign extended return value
	movd	tos,r5			# pop saved register
	ret	$(0)

cpx3a1: cmpb	r5,0(r2)		# is it the first byte?
	bne	cpx3a:b			# yes, exit now
	movb	1(r2),r0		# get destination
	subb	1(r1),r0		# subtract source
	movxbd	r0,r0			# sign extend return value
	movd	tos,r5			# pop saved register
	ret	$0

	.align	4
cpx3a:	movb	0(r2),r0		# get destination
	subb	r5,r0			# subtract source
	movxbd	r0,r0			# sign extend return value
	movd	tos,r5			# pop saved register
	ret	$0
	.align	4
cpx3c:	movb	2(r2),r0		# get destination
	subb	2(r1),r0		# subtract source
	movxbd	r0,r0			# sign extend return value
	movd	tos,r5			# pop saved register
	ret	$0
	.align	4
# 1 of the 4 current bytes is zero.
# check to see what it means
cpxit:	cmpqb	0,r5			# is lsb zero?
	beq	cpx2:w			# exit now if it is
	movb	0(r2),r0		# get s2
	cmpb	r0,r5			# does it match?
	bne	cpx1:b			# no, exit now
	movb	1(r1),r5		# get s1
	cmpqb	0,r5			# is lsb zero?
	beq	cpx2:b			# exit now if it is
	movb	1(r2),r0		# get s2
	cmpb	r0,r5			# does it match?
	bne	cpx1:b			# no, exit now
	movb	2(r1),r5		# get s1
	cmpqb	0,r5			# is lsb zero?
	beq	cpx2:b			# exit now if it is
	movb	2(r2),r0		# get s2
	cmpb	r5,r0			# does it match?
	bne	cpx1:b			# no, exit now
	movb	3(r1),r5		# get s1
	cmpqb	0,r5			# is s1 zero?
	beq	cpx2:b			# exit now if it is
	movb	3(r2),r0		# get s2
	.align	4
cpx1:	subb	r5,r0			# subtract to get diff
	movxbd	r0,r0			# sign extended return value
	movd	tos,r5			# pop saved register
	ret	$(0)
	.align	4
cpx2:	movqd	0,r0			# strings are equal
	movd	tos,r5			# pop saved register
	ret	$0			# and return




-- 
Dave Rand
{pyramid|mips|bct|vsi1}!daver!dlr	Internet: dlr@daver.bungi.com