[comp.unix.wizards] Faster C

fish@aati.UUCP (William G. Fish) (02/05/88)

I wish to make the following C code run as fast as possible under 4.1 BSD 
on a VAX 11/750.  I've seen VAX instructions such as movc3 and cmpc3 make
code run 10 to 50 times faster.  Are there any CISC instructions that can
be used in this case?

scan(in, out, c, C, S)
    register short *in;
    register float *out;
    register c;
    int C, S;
{
    register sample;
    register peak;
    register s;

    for (s = 0; s < S; s++, c += C)
    {
	out[s] = sample = in[c];	/* short to float conversion */
	if (sample < 0)
	    sample = -sample;		/* absolute value */
	if (peak < sample)
	    peak = sample;		/* peak detection */
    }
    return peak;
}
--
Bill Fish                       Analysis & Technology
321 River Road                  153 Williams Street
Mystic, CT 06355                New London, CT 06320
(203) 536-3301 (evenings)       (203) 444-7722 (days)
(203) 536-0137 (2nd line)       ihnp4!{hsi,rayssd}!aati!fish

pardo@june.cs.washington.edu (David Keppel) (02/06/88)

In article <473@aati.UUCP> fish@aati.UUCP (William G. Fish) writes:
->I wish to make the following C code run as fast as possible under 4.1 BSD 
->on a VAX 11/750.  I've seen VAX instructions such as movc3 and cmpc3 make
->code run 10 to 50 times faster.  Are there any CISC instructions that can
->be used in this case?
->
->scan(in, out, c, C, S)
->    register short *in;
->    register float *out;
->    register c;
->    int C, S;
->{
->    register sample;
->    register s;
->
->    for (s = 0; s < S; s++, c += C)
->    {
->	out[s] = sample = in[c];	/* short to float conversion */

If you change the out[s] and in[c] to use a pointer that is incremented
each iteration, you may be able to save yourself an ashl each time.

	;-D on  (*gazing)  Pardo

chris@mimsy.UUCP (Chris Torek) (02/06/88)

>In article <473@aati.UUCP> fish@aati.UUCP (William G. Fish) writes:
>>I wish to make the following C code run as fast as possible under 4.1 BSD 
>>on a VAX 11/750.

declarations:
>>    register short *in;
>>    register float *out;
>>    register c;
>>    int C, S;
>>    register sample, s;
>>
loop:
>>    for (s = 0; s < S; s++, c += C) {
>>	out[s] = sample = in[c];	/* short to float conversion */

In article <4177@june.cs.washington.edu> pardo@june.cs.washington.edu
(David Keppel) writes:
>If you change the out[s] and in[c] to use a pointer that is incremented
>each iteration, you may be able to save yourself an ashl each time.

The `out[s]' and `in[c]' should generate VAX `subscript' mode instructions,
something like

	cvtwl	(in)[c],sample
	cvtlf	sample,(out)[s]

and indeed, feeding the equivalent through /lib/ccom produces

	cvtwl	(r11)[r9],r7
	cvtlf	r7,(r10)[r8]

A pointer version might be a wee bit faster for `out':

	for (s = 0; s < S; s++, c += C) {
		*out++ = sample = in[c];

or

	cvtwl	(r11)[r9],r7
	cvtlf	r7,(r10)+

One more tiny gain is to loop down to zero instead of up to S:

loop:
	for (s = S, out += s, c += C * s; c -= C, --s >= 0;) {
		*--out = sample = in[c];
		...

If the loop is short enough (8 instructions or less), the
optimiser (/lib/c2) will turn the decrement/test/branch into
a `sobgeq' instruction.  It looked as though the loop was
not that short.  Still,

	decl	rN
	bneq	loop

will be ever so slightly faster than

	incl	rN
	cmpl	rN,-S(fp)
	blss	loop

on a 750.  Since in the original code fragment `c' did not count
up from zero, we still need a counter like `s'.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

trt@rti.UUCP (Thomas Truscott) (02/07/88)

In article <473@aati.UUCP>, fish@aati.UUCP (William G. Fish) writes:
> I wish to make the following C code run as fast as possible under 4.1 BSD 
> on a VAX 11/750.  ...

The code performs two different functions:
It copies an array of short integers, converting to float,
and it computes the largest absolute value in the array.
Alas, the VAX 750 lacks the vector instructions that could do it,
and the code presented was quite good to begin with,
so all that is left are small tweaks.  I think.

Building on what Chris Torek suggested:
1. Change "out[s]" to "*out++", eliminate the "s" variable,
and count "S" down instead.

2. Change 'in[c]; c += C' to '*in; in += <horrible thing>'.
(It would be nice if C provided a cleaner way to do this hack.)

3. Tweak the register declarations a bit.


Here is the revised routine, it might still work!

scan(in, out, c, C, S)
    register short *in;
    register float *out;
    int c;
    register int C;
    register int S;
{
    register int sample;
    register int peak;


    C = (sizeof(short)/sizeof(char))*C;	/* eternal damnation awaits ... */

    /* XXX might be good to initialize peak */

    in += c;
    while (--S >= 0)
    {
	sample = *in;
	*out++ = sample;		/* short to float conversion */
	in = (short *)(((char *)in) + C);	/* ... in the abyss */
	/* vax seems to lack an absolute value instruction, sigh */
	if (sample < 0)
	    sample = -sample;		/* absolute value */
	if (peak < sample)
	   peak = sample;		/* peak detection */
    }
    return peak;
}

4. Touch up the assembler output (Gag me):
18,19c18
< 	cvtld	r7,r0
< 	cvtdf	r0,(r10)+
---
> 	cvtld	r7,(r10)+
27,28c26
< L16:	decl	r8
< 	jgeq	L2000001
---
> L16:	sobgeq	r8,L2000001


5. Learn more about the original problem.

a) What fraction of the time is sample < 0?
Perhaps we should keep "minpeak" and "maxpeak" variables
so we need never negate "sample".

b) What values can "sample" take on? 
If it is in the range -5000..5000 we can
use table lookup to compute absolute value.
If it is in the range -31..31 we could use table lookup
to obtain bit masks that we just OR together,
then locate the highest bit to determine "peak".
(Okay, I'm weird.)

6.  A VAX 750 is slow.  Buy a faster machine.  Brute force is beautiful.
	Tom Truscott

merlyn@rose3.rosemount.com (Brian Westley) (02/08/88)

BZZT!
All the code improvments posted about the peak finding routine have
missed a FATAL ERROR!

The variable peak is NEVER INITIALIZED!

Your answer is quite possibly garbage...

"Make it right before you make it faster"
_The Elements of Programming Style_

Merlyn LeRoy

kent@tifsie.UUCP (Russell Kent) (02/09/88)

in article <473@aati.UUCP>, fish@aati.UUCP (William G. Fish) says:
> I wish to make the following C code run as fast as possible under 4.1 BSD 
> on a VAX 11/750.  I've seen VAX instructions such as movc3 and cmpc3 make
> code run 10 to 50 times faster.  Are there any CISC instructions that can
> be used in this case?
    [ code fragment deleted - RAK ]
> Bill Fish                       Analysis & Technology
> 321 River Road                  153 Williams Street
> Mystic, CT 06355                New London, CT 06320
> (203) 536-3301 (evenings)       (203) 444-7722 (days)
> (203) 536-0137 (2nd line)       ihnp4!{hsi,rayssd}!aati!fish

Using suggestions from David Keppel, Chris Torek, and Thomas Truscott, as
well as generous use of a 1981 vintage VAX Architecture Handbook, and
Ultrix cc's "-S" option, I was able to handcraft this assembler routine.
PLEASE be very careful to check that it does what you _really_ think
you want (I have been known to mis-interpret C code fragments :-).
"``Interpret??'' Hey Mo, I thought C was a compiled language, nyuck nyuck"

The cute little "jump" opcodes (you mean you've never used a jgeq opcode
on a VAX??) are the compiler's love notes to the optimizer.

Write this into a file ending with ".s" and feed it to your compiler.
Caveat emptor.

Russell Kent
(Beware the signature at the bottom)
----8<----8<----8<----8<----8<----8<----8<----8<----8<----8<----8<---
LL0:
	.data
	.text
	.align	2
	.globl	_scan

/* 			    scan(in, out, c, C, S)
       				register short *in;
       				register float *out;
       				register c;
       				int C, S;
*/
_scan:
	.word	0x0FC0		/* save r11-r6 */
/*
    				register sample;
    				register peak;
    				register s;
*/
	jbr 	L14		/* cause possible page fault */
L15:
	movl	4(ap),r11       /* load "in"      into r11 */
	movl	8(ap),r10	/* load "out"     into r10 */
	movl	12(ap),r9	/* load "c"       into r9  */
    				/*      "sample"  isin r8  */
    				/*      "peak"    isin r7  */
    				/*      "s"       isin r6  */
			
	movl	20(ap),r6	/* s = S+1;            */
	incl	r6
	jbr	L17		/* use sobgtr at bottom */
L18:
	cvtwl	(r11)[r9],r8	/* sample = in[c]      */
	cvtlf	r8,(r10)+	/* *out++ = sample     */
		/* this will set the condition bits so no "test" is needed */
	jgeq	L19		/* if (sample < 0)     */
	mnegl	r8,r8		/*     sample= -sample */
L19:
	cmpl	r7,r8		/* if (peak < sample)  */
	jgeq	L16
	movl	r8,r7		/*     peak = sample   */
L16:
	addl2	16(ap),r9	/*       c += C        */
L17:
	sobgtr	r6,L18		/* bottom of loop      */
	movl	r7,r0		/* return peak;        */
	ret
L14:
	jbr 	L15
	.data
----8<----8<----8<----8<----8<----8<----8<----8<----8<----8<----8<---
-- 
Russell Kent                    Phone: +1 214 995 3501
Texas Instruments               UUCP address:
P.O. Box 655012   MS 3635       ...!convex!smu!tifsie!kent
Dallas, TX 75265                ...!ut-sally!im4u!ti-csl!tifsie!kent