[net.bugs.uucp] Further improvements to pk0.c - chksum

bb@wjh12.UUCP (byer) (07/11/83)

[ Reference to article umcp-cs.674 ]

Ferst, we shuddn't be so hastey (* see below) to criticize (destructively,
no less) other people's code (``... that routine ... sucks''); as Chris
is surely aware, during development of a large & complex program, the
author is correct to concentrate on the larger problem and leave the
`trivial' details until later.  After all, the routine works properly
on the 11 & Vax, and those are the architectures for which Bell licenses
it.  If anyone wishes to sling stones (or feces), they should aim at
Unisoft, for they did not bother to verify the compatibility of their
implementation before shipping it.  Enough dirt; on to the meat...

I have made further improvements to Chris Torek's submission; the
resulting routine is about 17% faster.  Since I don't believe in
`fast enough', I have fiddled with the Vax assembly code for an
additional 25% savings.  Considering the many megabytes passing
through this routine daily, such anality might be warranted.
-----
* 3 spelling errors - prerequisite for a news submission, no??
-----

For those with 11's, there are savings to be gained, but smaller.
Only with mods to the assembly code could I get anything really
worthwhile (20%).

Timings:  microsecs per call of chksum(buf, 128)
		original	C. Torek's	below (C)	below (as)
VAX-11/780	  2460		  2050		  1720		  1300
PDP-11/44	  2500		   --		  2400		  2140


The modified code follows:

For the Vax:

chksum (s, n)
register char *s;
register n;
{
	register sum, x;
	register unsigned t;

	sum = -1;
	x = 0;
	do {
		/* Rotate left, copying bit 15 to bit 0 */
		sum <<= 1;
/* NOTE #1 */	if (sum & 0x10000) {
			sum &= 0xffff;
			sum++;
		}
		t = sum;
/* NOTE #2 */	sum += (*s++ & 0377);
/* NOTE #2 */	sum &= 0xffff;
		x += sum ^ n;
		if ((unsigned)sum <= t)	/* (unsigned) not necessary */
			sum ^= x;	/* but doesn't hurt */
	} while (--n > 0);
/* NOTE #3 ^^ */

	return (int) (short) sum;
}

NOTES:
	1. All the savings are here; you figure it out.
	2. This simplification didn't make any difference on the Vax,
	   but probably would on other architectures/optimizers.
	3. Surprisingly, this very common loop terminator generated
	   less than optimal code. (dec r10; jgtr Lxx    instead of
							sobgtr r10,Lxx )

Warning: hard-core hacking below -- (but it's worth an extra 25%)
  [ cf:    cc -S -O chksum.c  (extracted from pk0.c) ]

.align	1
.globl	_chksum
.set	L25,0xf80
.data
.text
_chksum:.word	L25
movl	4(ap),r11
movl	8(ap),r10
mnegl	$1,r9
clrl	r8
L31:ashl	$1,r9,r9
jbc	$16,r9,L32
movzwl	r9,r9
incl	r9
L32:movl	r9,r7
movzbl	(r11)+,r0
addw2	r0,r9
xorl3	r10,r9,r0
addl2	r0,r8
cmpl	r9,r7
jgtru	L30
xorl2	r8,r9
L30:sobgtr	r10,L31
cvtwl	r9,r0
ret

( Can you squeeze another microsecond out? )
------
For the 11:

In C, the only change to be made is in the loop control, so as to
execute the sob (an instruction, not a cry for capital punishment).
Replace		do {
			...
		} while (--n > 0);

with		for (n++; --n >= 0; ) {
			...
		}
That's worth a measly 4%


For a 20% saving, use the following assembly code:
  [ cf:   cc -S -O chksum.c ]

.globl	_chksum
_chksum:
~~chksum:
jsr	r5,csv
mov	4(r5),r4
~s=r4
mov	6(r5),r3
~n=r3
sub	$4,sp
~sum=r2
~t=r1
~x=177766
mov	$-1,r2
clr	-12(r5)
inc	r3
jbr	L13
L20005:tst	r2
jge	L15
asl	r2
inc	r2
jbr	L16
L15:asl	r2
L16:mov	r2,r1
movb	(r4)+,r0
bic	$-400,r0
add	r0,r2
mov	r3,r0
xor	r2,r0
add	r0,-12(r5)
cmp	r1,r2
jlo	L13
mov	-12(r5),r0
xor	r0,r2
L13:sob	r3,L20005
mov	r2,r0
jmp	cret
-------

Brent Byer		 ``I think we're all bozos on this bus.''
  Textware Intl.	(decvax!genrad!wjh12!textware!brent)

padpowell@wateng.UUCP (PAD Powell[Admin]) (07/13/83)

When a hacker is stung, he really can do it.   Personally,  I wonder if anybody
is going to go out and discover that the checksum thing can be done by peeking
in a register in an HDLC chip,  and then building a board...
Sigh.
Patrick Powell