kent@rose3.rosemount.com (Kent Schnaith) (02/24/88)
Sounds like the beginning of a contest ? Here is my two bits, a couple of routines in a non-standard version of 68000 asm. ! ! Kent Schnaith ! Rosemount Inc. ! ...!ihnp4!rosevax!rose3!kent !** !* pop0cnt: Count 0 bits in a long. !* pop1cnt: Count 1 bits in a long. !* !* Entry - d0 = long to test !* Exit - d0 = count (word) !* Uses - d0, d1, d2 !* pop0cnt: movl d0,d1 ! d1 holds input moveq #32,d0 ! assume ALL zeros tstl d1 bras popl2 pop1cnt: movl d0,d1 moveq #32,d0 ! assume ALL ones notl d1 ! want to count 1's, not 0's bras popl2 popl1: movl d1,d2 ! mask off lowest 1 bit using subql #1,d2 ! d1 = d1 & (d1 - 1) andl d2,d1 ! thanks to Gene H. Olson popl2: dbeq d0,popl1 ! count zero's rts !** !* strlen: Find length of a 0 terminated string. !* !* Entry - a0 = string address !* Exit - d0 = length (word) !* Uses - a0, d0 !* strlen: moveq #-1,d0 ! assume a big string strl1: tstb (a0)+ ! words would be faster, less trivial dbeq d0,strl1 ! 68010 cache is nice here negw d0 ! d0 = $FFFF - d0 addw #$FFFF,d0 rts
jangr@microsoft.UUCP (Jan Gray) (02/26/88)
In article <326@rose3.rosemount.com> kent@rose3.rosemount.com (Kent Schnaith) writes: >Sounds like the beginning of a contest ? >!* pop1cnt: Count 1 bits in a long. (Assuming you can't afford a 256 byte lookup table...) Bitcount using an adder tree. (My thanks to some clever people here for their various improvements to the basic algorithm, which was posted to the net years ago.) ; 32 bit bitcount, input in d0.l movl #0xaaaaaaaa,d1 andl d0,d1 xorl d1,d0 lsrl #1,d1 addl d1,d0 ; d0.l = 16 two bit counters. movl #0xcccccccc,d1 andl d0,d1 xorl d1,d0 lsrl #2,d1 addl d1,d0 ; d0.l = 8 four bit counters. movl #0xf0f0f0f0,d1 andl d0,d1 xorl d1,d0 lsrl #4,d1 addl d1,d0 ; d0.l = 4 eight bit counters. movl d0,d1 swap d1 addw d1,d0 ; d0.w = 2 eight bit counters. movw d0,-(sp) ; trick to extract upper byte of d0.w addb (sp)+,d0 ; d0.b = 1 eight bit sum of all bits. >!* strlen: Find length of a 0 terminated string. On a '020, the four bytes at a time strlen should be about twice as fast as an unrolled "tst (a0)+; beq done" loop. ; strlen, string in a0, 68020 only (a0 need not be longword aligned) movl a0,a1 movl #$01010101,d2 movl #$80808080,d3 next4: movl (a0)+,d0 ; ((l - 0x01010101) & ~l & 0x80808080) != 0 iff movl d0,d1 ; l contains a zero byte. subl d2,d0 notl d1 andl d1,d0 andl d3,d0 beq next4 subq #4,a0 tstb (a0)+ beq adj tstb (a0)+ beq adj tstb (a0)+ bne noadj adj: subq #1,a0 noadj: subl a1,a0 movl a0,d0 ; result in d0 Jan Gray uunet!microsoft!jangr Microsoft Corp., Redmond Wash. 206-882-8080
cs661s01@uhccux.UUCP (Cs661s01) (03/03/88)
Odd that Kent Schnaith would bring up bit-counting. I spent part of the weekend doing a bit-counter. Of course, I cheated and used a 64K-byte table, so there's not much point in including the code here. In Kent's strlen, shouldn't that "add.w #$ffff, d0" be just "subq.w #1, d0"? And while we're shaving cycles, doesn't neg.w d0 subq.w #1, d0 just become "not.w d0"? (Take the hint from his comment "d0 = $FFFF - d0"). Here's a puzzle, unsolved as far as I know -- I need a fast way to compare two bitmaps, to see which has more bits set. Is there a faster way than counting each one's bits and comparing? If it helps, the two maps have no 1's in common. (This is to evaluate an endgame in Othello.) A 68000 solution is preferable, but anything is interesting... Jan Gray's strlen seems preferable, especially since it looks right for strings longer than 64K. Has it actually been timed on an '020? Seems like a 7-instruction loop for every four bytes isn't ALL that much better than a 2-instruction loop for every byte... I like the simpler solution adapted to just compare pointers, so the loop is just: loop: tst.b (An)+ bne.s loop ...and then the final pointer has the initial one subtracted (probably give or take 1). Good especially if strings are likely to be short. -- Mike Morton // P.O. Box 11378, Honolulu, HI 96828, (808) 456-8455 HST Internet: msm@uhmanoa.ics.hawaii.edu UUCP: {ihnp4,uunet,dcdwest,ucbvax}!ucsd!nosc!uhmanoa!msm BITnet: msm%uhmanoa.ics.hawaii.edu@uhccux (anagrams): Mr. Machine Tool; Ethical Mormon; Chosen Immortal; etc.
jangr@microsoft.UUCP (Jan Gray) (03/05/88)
>Jan Gray's strlen seems preferable, especially since it looks right for >strings longer than 64K. Has it actually been timed on an '020? Seems >like a 7-instruction loop for every four bytes isn't ALL that much >better than a 2-instruction loop for every byte... I like the simpler >solution adapted to just compare pointers, so the loop is just: > loop: tst.b (An)+ > bne.s loop I haven't actually timed it, but here are the "from the book" timings: ; traditional, string in a0, unrolled 4 times ; 68000 68020 b/c/w (best/cache/worst) loop: movb (a0)+,d0 ; 8 4/6/7 beq end ; 8 1/4/5 movb (a0)+,d0 ; 8 4/6/7 beq end ; 8 1/4/5 movb (a0)+,d0 ; 8 4/6/7 beq end ; 8 1/4/5 movb (a0)+,d0 ; 8 4/6/7 bne loop ; 10 3/6/9 ; total for 4 chars: ; 66 22/42/50 ; identity, string in a0, must be longword aligned on 68000 ; 68000 68020 b/c/w movl #$01010101,d2 ; 12 0/6/5 movl #$80808080,d3 ; 12 0/6/5 loop: movl (a0)+,d0 ; 12 4/6/7 movl d0,d1 ; 4 0/2/3 subl d2,d0 ; 8 0/2/3 notl d1 ; 6 0/2/3 andl d1,d0 ; 8 0/2/3 andl d3,d0 ; 8 0/2/3 beq loop ; 10 3/6/9 ; total for 4 chars: ; 56 7/22/31 As you can see, byte-at-a-time is slower because it reads from memory four times more often. The speedup should be even more pronounced if your memory subsystem adds wait states. Note that if you hope to recode strlen on your typical 68020 UNIX box, you will have to longword align a0 or it may cause a segmentation fault if you strlen an unaligned string placed at the end of the last valid page of your process. If you try the analogue of this code on a 386, you will be amazed to discover that it scans and copies strings faster than the 386 string instructions themselves! (Actually, this is not quite ALWAYS true, the extra startup costs dominate for short unaligned strings). However, the string instructions are usually preferable because they cause less bus traffic. [To the person doing Othello: consider converting from a base 4 (white, black, empty, unused) representation to base 3 (white, black, empty) using lookup tables. Then many of your tables only use 3^x instead of 4^x bytes.] Jan Gray uunet!microsoft!jangr Microsoft Corp., Redmond Wash. 206-882-8080