[comp.unix.microport] speeding up compress on 286

david@bdt.UUCP (07/10/88)

Has anyone looked into speeding up compress on the 286?  Under
Microport System V/AT compress runs really slooow.  On a my 10MHz
286 with (only) 2MB RAM a 60K test file generally gives times of:

	20.0u 1.0s

or there abouts.   I assume the slowness is mostly due to the hack
to "simulate" larger than 64K arrays (which Xenix and Microport don't
handle!).  My particular problem may be raleted to swapping, in which
case the speeds might be better if I had more RAM.

Does anybody else suffer similar performance problems and if so, has
anybody looked into speeding it up?
-- 
David Beckemeyer (david@bdt.uucp)	| "Yea I've got medicine..." as the 
Beckemeyer Development Tools		| cookie cocks a his Colt, "and if
478 Santa Clara Ave, Oakland, CA 94610	| you don't keep your mouth shut, I'm
UUCP: {unisoft,sun}!hoptoad!bdt!david 	| gonna give you a big dose of it!"

hutch@hubcap.UUCP (David Hutchens) (07/13/88)

From article <347@bdt.UUCP>, by david@bdt.UUCP (David Beckemeyer):
> 
> Has anyone looked into speeding up compress on the 286?  Under
> Microport System V/AT compress runs really slooow.  On a my 10MHz
> 286 with (only) 2MB RAM a 60K test file generally gives times of:
> 
> 	20.0u 1.0s
> 
> or there abouts.   I assume the slowness is mostly due to the hack
> to "simulate" larger than 64K arrays (which Xenix and Microport don't
> handle!).  My particular problem may be raleted to swapping, in which
> case the speeds might be better if I had more RAM.
> 
> Does anybody else suffer similar performance problems and if so, has
> anybody looked into speeding it up?
> -- 

I don't know about Microport, but I have found that a LOT of time
is spent doing long shifts on my Xenix system when I use a 16-bit
compress.  This is in part because the C compiler generates a call
to a routine to do long shifts.  What is worse, they coded the
routine so that it is space efficient, rather than time efficient (It
uses a total of 3 or 4 286 instructions looping through them as many
times as the number of bits you wish to shift: i.e. it shifts one
bit each time through the loop.)  I found that I could write my
own routine - using a grand total of 50 more bytes or so, and in doing
so I decreased the time required to do a 16-bit compress by about 30%!

I don't have the code in front of me but the basic idea was to use
the 16-bit shift instructions and OR together the appropriate results.
I suspect that for 1 and possibly 2 bit shifts the provided routine is
faster, but compress does a lot of shifts of 10 bits or more, and with
these, my routine wins by a BIG margin.

		David Hutchens
		hutch@hubcap.clemson.edu
		...!gatech!hubcap!hutch

wes@obie.UUCP (Barnacle Wes) (07/14/88)

In article <347@bdt.UUCP>, david@bdt.UUCP (David Beckemeyer) writes:
> Has anyone looked into speeding up compress on the 286?  Under
> Microport System V/AT compress runs really slooow.

Mine, too.  That's why I stopped running compress - I just get
everything batched but uncompressed.  It's slower, but fast enough, and
works well.

> ................   I assume the slowness is mostly due to the hack
> to "simulate" larger than 64K arrays (which Xenix and Microport don't
> handle!).  My particular problem may be raleted to swapping, in which
> case the speeds might be better if I had more RAM.

It might be, but I upgraded my system from 1 meg to 3, and it really
didn't help much.  The 16-bit compress is right near the limit for
process size on V/AT, and it seems to swap a lot regardless of how much
memory you have.  You might want to look at the 13-bit compress, I
understand it is much faster, especially on brain-dead architectures
like the '286.
-- 
                     {hpda, uwmcsd1}!sp7040!obie!wes
           "Happiness lies in being priviledged to work hard for
           long hours in doing whatever you think is worth doing."
                         -- Robert A. Heinlein --

jsilva@cogsci.berkeley.edu (John Silva) (07/16/88)

I just finished hacking compress to be MUCH faster on my AT system (SCO 2.2.0g)
by replacing the original 32 bit shift routines with a set of hand coded
routines.  I managed to speed up compress by about 24%!  (16 bit compressions
spend most of the time shifting around long integers, and the Microsoft
compiler uses a one bit at a time shift routine for 32 bit shifts)

If anyone would like a copy of these routines (two 8086 asm sources),
I would be happy to mail them.  However, keep in mind that they may not
function correctly on flavors of xenix other than SCO.

John P. Silva

---
UUCP:	ucbvax!cogsci!jsilva
DOMAIN:	jsilva@cogsci.berkeley.edu

hutch@hubcap.UUCP (David Hutchens) (07/26/88)

New improved version, now with assembly source.

Earlier I wrote:
> 
> I don't know about Microport, but I have found that a LOT of time
> is spent doing long shifts on my Xenix system when I use a 16-bit
> compress.  This is in part because the C compiler generates a call
> to a routine to do long shifts.  What is worse, they coded the
> routine so that it is space efficient, rather than time efficient (It
> uses a total of 3 or 4 286 instructions looping through them as many
> times as the number of bits you wish to shift: i.e. it shifts one
> bit each time through the loop.)  I found that I could write my
> own routine - using a grand total of 50 more bytes or so, and in doing
> so I decreased the time required to do a 16-bit compress by about 30%!
> 
> I don't have the code in front of me but the basic idea was to use
> the 16-bit shift instructions and OR together the appropriate results.
> I suspect that for 1 and possibly 2 bit shifts the provided routine is
> faster, but compress does a lot of shifts of 10 bits or more, and with
> these, my routine wins by a BIG margin.

I received several replies requesting the source.  Again, I must caution
that these routines are designed to work with Microsoft Xenix 2.0.
I don't have any idea whether they work with any other compiler/os.

It turns out that the Microsoft compiler uses a non-standard call
sequence to call its own built in routines, including the long shift
operations.  These routines assume that the number to be shifted
is in the A (lower order bits) and D (higher order bits) registers at
entry (That is where the Microsoft C compiler I'm using puts them).
They assume that the number of bits to be shifted is in the CL register.
They distroy the CH register (I'm not positive if this is really safe, but
it works for the programs I have tried!).  I assemble the following with 'as'
and link it with the compress source.  Best of luck.  Remember to test it
well before giving it any trust.
 
 		David Hutchens
 		hutch@hubcap.clemson.edu
 		...!gatech!hubcap!hutch


----------  CUT HERE  -----------
;	Static Name Aliases
;
	TITLE   shift

	.287
_TEXT	SEGMENT  BYTE PUBLIC 'CODE'
_TEXT	ENDS
CONST	SEGMENT  WORD PUBLIC 'CONST'
CONST	ENDS
_BSS	SEGMENT  WORD PUBLIC 'BSS'
_BSS	ENDS
DGROUP	GROUP	CONST,	_BSS
	ASSUME  CS: _TEXT, DS: DGROUP, SS: DGROUP, ES: DGROUP
_TEXT      SEGMENT
	PUBLIC	__lshr
__lshr	PROC FAR
	cmp	cl,15
	jle	$LSRSMALL
	sub	cl,16
	xchg	ax,dx
	sar	ax,cl
	cwd
	ret
$LSRSMALL:
	mov	ch,cl
	push	dx
	shr	ax,cl
	sub	cl,16
	neg	cl
	shl	dx,cl
	or	ax,dx
	pop	dx
	mov	cl,ch
	sar	dx,cl
	ret	
__lshr	ENDP

	PUBLIC	__ulshr
__ulshr	PROC FAR
	cmp	cl,15
	jle	$ULSRSMALL
	sub	cl,16
	xchg	ax,dx
	shr	ax,cl
	sub	dx,dx
	ret
$ULSRSMALL:
	mov	ch,cl
	push	dx
	shr	ax,cl
	sub	cl,16
	neg	cl
	shl	dx,cl
	or	ax,dx
	pop	dx
	mov	cl,ch
	shr	dx,cl
	ret	
__ulshr	ENDP

	PUBLIC	__lshl
__lshl	PROC FAR
	cmp	cl,15
	jle	$LSLSMALL
	sub	cl,16
	mov	dx,ax
	shl	dx,cl
	sub	ax,ax
	ret
$LSLSMALL:
	mov	ch,cl
	push	ax
	shl	dx,cl
	sub	cl,16
	neg	cl
	shr	ax,cl
	or	dx,ax
	pop	ax
	mov	cl,ch
	shl	ax,cl
	ret	
__lshl	ENDP
_TEXT	ENDS
END