[comp.sys.m68k] 68000 Tricks

kent@rose3.rosemount.com (Kent Schnaith) (02/24/88)

Sounds like the beginning of a contest ?
Here is my two bits, a couple of routines in a non-standard
version of 68000 asm.

!
!	Kent Schnaith
!	Rosemount Inc.
!	...!ihnp4!rosevax!rose3!kent


!**
!*	pop0cnt: Count 0 bits in a long.
!*	pop1cnt: Count 1 bits in a long.
!*
!*	Entry	-	d0 = long to test
!*	Exit	-	d0 = count (word)
!*	Uses	-	d0, d1, d2
!*

pop0cnt:
	movl	d0,d1		! d1 holds input
	moveq	#32,d0		! assume ALL zeros
	tstl	d1
	bras	popl2

pop1cnt:
	movl	d0,d1
	moveq	#32,d0		! assume ALL ones
	notl	d1		! want to count 1's, not 0's
	bras	popl2

popl1:	movl	d1,d2		! mask off lowest 1 bit using
	subql	#1,d2		!	d1 = d1 & (d1 - 1)
	andl	d2,d1		! thanks to Gene H. Olson

popl2:	dbeq	d0,popl1	! count zero's

	rts

!**
!*	strlen: Find length of a 0 terminated string.
!*
!*	Entry	-	a0 = string address
!*	Exit	-	d0 = length (word)
!*	Uses	-	a0, d0
!*

strlen:
	moveq	#-1,d0		! assume a big string
	
strl1:	tstb	(a0)+		! words would be faster, less trivial
	dbeq	d0,strl1	! 68010 cache is nice here

	negw	d0		! d0 = $FFFF - d0
	addw	#$FFFF,d0

	rts

jangr@microsoft.UUCP (Jan Gray) (02/26/88)

In article <326@rose3.rosemount.com> kent@rose3.rosemount.com (Kent Schnaith) writes:
>Sounds like the beginning of a contest ?
>!*	pop1cnt: Count 1 bits in a long.

(Assuming you can't afford a 256 byte lookup table...)

Bitcount using an adder tree.  (My thanks to some clever people here for
their various improvements to the basic algorithm, which was posted to the
net years ago.)

; 32 bit bitcount, input in d0.l
	movl	#0xaaaaaaaa,d1
	andl	d0,d1
	xorl	d1,d0
	lsrl	#1,d1
	addl	d1,d0		; d0.l = 16 two bit counters.
	movl	#0xcccccccc,d1
	andl	d0,d1
	xorl	d1,d0
	lsrl	#2,d1
	addl	d1,d0		; d0.l = 8 four bit counters.
	movl	#0xf0f0f0f0,d1
	andl	d0,d1
	xorl	d1,d0
	lsrl	#4,d1
	addl	d1,d0		; d0.l = 4 eight bit counters.
	movl	d0,d1
	swap	d1
	addw	d1,d0		; d0.w = 2 eight bit counters.
	movw	d0,-(sp)	; trick to extract upper byte of d0.w
	addb	(sp)+,d0	; d0.b = 1 eight bit sum of all bits.


>!*	strlen: Find length of a 0 terminated string.

On a '020, the four bytes at a time strlen should be about twice as fast as
an unrolled "tst (a0)+; beq done" loop.

; strlen, string in a0, 68020 only (a0 need not be longword aligned)
	movl	a0,a1
	movl	#$01010101,d2
	movl	#$80808080,d3

next4:	movl	(a0)+,d0	; ((l - 0x01010101) & ~l & 0x80808080) != 0 iff
	movl	d0,d1		; l contains a zero byte.
	subl	d2,d0
	notl	d1
	andl	d1,d0
	andl	d3,d0
	beq	next4

	subq	#4,a0
	tstb	(a0)+
	beq	adj
	tstb	(a0)+
	beq	adj
	tstb	(a0)+
	bne	noadj
adj:	subq	#1,a0
noadj:	subl	a1,a0
	movl	a0,d0
; result in d0

Jan Gray   uunet!microsoft!jangr   Microsoft Corp., Redmond Wash.   206-882-8080

cs661s01@uhccux.UUCP (Cs661s01) (03/03/88)

Odd that Kent Schnaith would bring up bit-counting.  I spent part of
the weekend doing a bit-counter.  Of course, I cheated and used a 64K-byte
table, so there's not much point in including the code here.

In Kent's strlen, shouldn't that "add.w #$ffff, d0" be just "subq.w #1, d0"?
And while we're shaving cycles, doesn't
        neg.w d0
        subq.w #1, d0
just become "not.w d0"?  (Take the hint from his comment "d0 = $FFFF - d0").

Here's a puzzle, unsolved as far as I know -- I need a fast way to compare
two bitmaps, to see which has more bits set.  Is there a faster way than
counting each one's bits and comparing?  If it helps, the two maps have
no 1's in common.  (This is to evaluate an endgame in Othello.)  A 68000
solution is preferable, but anything is interesting...

Jan Gray's strlen seems preferable, especially since it looks right for
strings longer than 64K.  Has it actually been timed on an '020?  Seems
like a 7-instruction loop for every four bytes isn't ALL that much
better than a 2-instruction loop for every byte...  I like the simpler
solution adapted to just compare pointers, so the loop is just:
    loop:   tst.b (An)+
            bne.s loop
...and then the final pointer has the initial one subtracted (probably
give or take 1).  Good especially if strings are likely to be short.

 -- Mike Morton // P.O. Box 11378, Honolulu, HI  96828, (808) 456-8455 HST
      Internet: msm@uhmanoa.ics.hawaii.edu
          UUCP: {ihnp4,uunet,dcdwest,ucbvax}!ucsd!nosc!uhmanoa!msm
        BITnet: msm%uhmanoa.ics.hawaii.edu@uhccux
    (anagrams): Mr. Machine Tool; Ethical Mormon; Chosen Immortal; etc.

jangr@microsoft.UUCP (Jan Gray) (03/05/88)

>Jan Gray's strlen seems preferable, especially since it looks right for
>strings longer than 64K.  Has it actually been timed on an '020?  Seems
>like a 7-instruction loop for every four bytes isn't ALL that much
>better than a 2-instruction loop for every byte...  I like the simpler
>solution adapted to just compare pointers, so the loop is just:
>    loop:   tst.b (An)+
>            bne.s loop

I haven't actually timed it, but here are the "from the book" timings:

; traditional, string in a0, unrolled 4 times
				; 68000   68020 b/c/w (best/cache/worst)
loop:	movb	(a0)+,d0	; 8		4/6/7
	beq	end		; 8		1/4/5
	movb	(a0)+,d0	; 8		4/6/7
	beq	end		; 8		1/4/5
	movb	(a0)+,d0	; 8		4/6/7
	beq	end		; 8		1/4/5
	movb	(a0)+,d0	; 8		4/6/7
	bne	loop		; 10		3/6/9
	; total for 4 chars:	; 66		22/42/50


; identity, string in a0, must be longword aligned on 68000
				; 68000   68020 b/c/w
	movl	#$01010101,d2	; 12		0/6/5
	movl	#$80808080,d3	; 12		0/6/5

loop:	movl	(a0)+,d0	; 12		4/6/7
	movl	d0,d1		; 4		0/2/3
	subl	d2,d0		; 8		0/2/3
	notl	d1		; 6		0/2/3
	andl	d1,d0		; 8		0/2/3
	andl	d3,d0		; 8		0/2/3
	beq	loop		; 10		3/6/9
	; total for 4 chars:	; 56		7/22/31

As you can see, byte-at-a-time is slower because it reads from memory four
times more often.  The speedup should be even more pronounced if your memory
subsystem adds wait states.

Note that if you hope to recode strlen on your typical 68020 UNIX box, you
will have to longword align a0 or it may cause a segmentation fault if you
strlen an unaligned string placed at the end of the last valid page of your
process.

If you try the analogue of this code on a 386, you will be amazed to discover
that it scans and copies strings faster than the 386 string instructions
themselves!  (Actually, this is not quite ALWAYS true, the extra startup
costs dominate for short unaligned strings).  However, the string instructions
are usually preferable because they cause less bus traffic.

[To the person doing Othello: consider converting from a base 4 (white, black,
empty, unused) representation to base 3 (white, black, empty) using lookup
tables.  Then many of your tables only use 3^x instead of 4^x bytes.]

Jan Gray   uunet!microsoft!jangr   Microsoft Corp., Redmond Wash.   206-882-8080