[net.unix-wizards] pdp11: C, f77, f4p compile study

CSvax:Physics:suitti (10/21/82)

***Warning: high verbosity (bogosity) level***
Intro:
	Two hours, 52 minutes and 27 seconds of cpu were expended in an
attempt to determine how to shave a few seconds off of an atypical program
run here at purdue physics.  The sources are given at the end.	Other
information about the test is given in random order.  Some of the conclusions
may be useful, others are just flames.

			The Numbers
Lang	type	(secs)	text	data	bss	dec	name
average float	 76.4
f77	float	120.7	19180 +	 2578 + 10186 = 31944	benf
f77#	float	107.7	18728 +	 2578 + 10162 = 31468	binf
f4p	float	 39.4	 7634 +	 5318 +	    0 = 12952	bnchf
as	float	 27.6	 3270 +	  320 +	 6948 = 10538	pfa
Cp	float	 60.6	 3310 +	  322 +	 6952 = 10584	pf
Ca	float	102.2	 3430 +	  330 +	 6944 = 10704	af
average double	 74.7
f77	double	117.5	19176 +	 2578 + 14986 = 36740	bend
f77#	double	 98.8	18724 +	 2578 + 14962 = 36264	bind
f4p	double	 53.5	 7638 + 10118 +	    0 = 17756	bnchd
as	double	 36.9	 3308 +	  322 + 11748 = 15378	pda
Cp	double	 47.0	 3302 +	  322 + 11752 = 15376	pd
Ca	double	 94.6	 3424 +	  330 + 11744 = 15498	ad
average int	 38.0
f77	int	 70.2	19132 +	 2578 +	 7786 = 29496	beni
f77#	int	 58.9	18706 +	 2578 +	 7762 = 29046	bini
f4p	int	 21.0	 7602 +	 2918 +	    0 = 10520	bnchi
as	int	 12.5	 3282 +	  322 +	 4548 =	 8152	pia
Cp	int	 16.0	 3292 +	  322 +	 4552 =	 8166	pi
Ca	int	 49.3	 3394 +	  330 +	 4544 =	 8268	ai
average long	 98.4
f77	long	140.7	19282 +	 2578 + 10186 = 32046	benl
f77#	long	130.2	18840 +	 2578 + 10162 = 31580	binl
f4p	long	 93.1	 7730 +	 5322 +	    0 = 13052	bnchl
as	long	 41.1	 3338 +	  322 +	 6948 = 10608	pla
Cp	long	 70.5	 3394 +	  322 +	 6952 = 10668	pl
Ca	long	115.1	 3706 +	  330 +	 6944 = 10980	al
Basic	       1721.5 (run only once)			slow.bas

where
average is the average time used for that data type.  A meaningless number?

f77	is f77 with it's default integer*4 i,j,k indexes.

f77#	is f77 using integer*2 for the i,j,k indexes.

f4p	is the CULC ported fortran 66 compiler.	 I consider it to be smarter
	than the average joe.  These were compiled as split I & D.

Ca	is the Ritche compiler, using the first version of the benchmark,
	which subscripts into arrays as follows:
	c[i][j++] = foo;
	as opposed to the pointer version which would do it as:
	*cp++ = foo;

Cp	is the Ritche compiler, using the second version of the benchmark,
	which uses pointer arithmetic to speed execution.  This demostrates
	one of the (now obvious) philosophies of C: a language designed so
	that a programmer can get the program written quickly, but optimization
	is also done by the programmer.	 He can get C to produce something
	very close to what he'd have written in assembler (if he knew it).

as	means that the Cp AS output was modified so as to be truely optimal.
	Temporaries r0, r1 were utilized for variables if possible.  Single
	precision floating was use in for the single precision floating
	version.  Floating registers were used as summers (float and double).
	The code was cleaned up in general.  Things that were pushed on the
	stack were left in registers.  I wouldn't expect a compiler to do
	everything that was done here.	It should be thought of as a basis
	in time for how fast the machine is, and what penalty you get for
	using a higher level language.

basic	A basic interpreter that happens to be laying around here.

Notes on the benchmark.
 1.	Benchmarks, 100 iterations, 20 x 20 matrix.
	Environment: 11/44 with 2 Meg of RAM (not much swapping) running 2.8
	bsd.  Load averages ranged from about 1 to about 6.  Little control
	was exercised.	The times are averaged over 5 runs of each program.
	Lastcomm was used to determine total runtimes.

 2.	The benchmark is mainly designed to test the speed of calculation,
	loop control, and array subcripting, in programs which run for long
	periods of time.  It is relatively simple.  No conditional flow of
	control was tested.  The nesting level was three.  It turns out that
	f4p starts to run out of registers at this point, and has to start
	using temporaries (at absolute locations).  C only has three register
	variables for pointer subscription.  F77 is at a loss always.

 3.	A quick study will show that the init portions; init a, b, c,
	as well as the checksum at the end are all N**2 operations.  The
	matrix multiply is an n**3 operation.  N=20, therefore, we have
	4 * 20 * 20 + 20 * 20 * 20 = 1600 + 8000 = 9600, or some 83% of the
	time is spent in the matrix multiply.  Profile'ing confirms this,
	more or less.  The point is that most of the time is spent in a
	minority of code.  One could think that too little code is tested,
	and be correct.	 Thus, one may take the following numbers with a
	grain of sodium cloride.

 4.	All of the compiles were done with optimize switches, if any.  F77
	produces bad enough swill with the option.  No tests of which
	pessimizer is worse (better?) was done here.

Some conclusions:
 1.	When the data type is fixed, the order in speedy languages is:
		as	Cp	f4p	Ca	f77#	f77
	unless using float, in which case it's
		as	f4p	Cp	Ca	f77#	f77
	C uses doubles internally to compute floats, whereas f4p uses floats
	to compute floats.  If this were not the case (as with my assembly
	version), it can be shown that one can write floating point code
	which will run faster in C than (even) f4p.  It will still be slow
	if you use generalized array subscripting.  This benchmark does not
	show how anything compiles truely random access array subscription.
	The f77 code is very similar to the Ca code, only worse.

 2.	When the language utilized is fixed, the order in speedy data types is:
		int	double	float	long
	unless using f4p, in which case it's
		int	float	double	long
	This is because the C uses doubles to do floats.  It converts to
	double, computes the value, converts back to float.  This makes
	it necessarily slower than double.  Note that computation in double
	is half the speed of computation in float, at least on an 11/44.
	The numbers in the book also point to this fact.  Since C does
	floats with doubles, so does f77.  Thus f77 moves like a thundering
	turtle.

 3.	Longs are a lossage on an 11.  There is no hardware which supports
	them.  They take up many regular registers to do anything with them.
	C likes to put long temporaries on the stack during more complex
	operations.

What can be done to C?
 1.	Make it do floats in single precision.	An enormous amount of code
	is produced to convert back and forth.	Many floating registers are
	used for the converted values during calcualation.
	Consider the line:
		*a+ = *b+ + *c+;	/* register float *a, *b, *c */
	C will produce something like:
		movof	(r2)+,r0	/ convert, move into floating r0
		movof	(r3)+,r1	/ convert, move into floating r1
		addf	r1,r0		/ add, double precision
		movfo	r0,(r4)+	/ convert, move answer to destination
	instead of (worst case)
		setf			/ do it in single precision
		movf	(r2)+,r0	/ move 4 bytes to floating r0
		addf	(r3)+,r0	/ add in single precision
		movf	r0,(r4)+	/ move to destination

	Advantages:
a.	The new code is much faster.  The add instruction takes less time.
	A movf is faster than a movfo or movof.	 The setd is very fast.
b.	The new code uses only one floating register (r0).  This leaves
	floating r1, r2, r3, and if you are truely mutant: r4 and r5.
c.	With a little work, the setf (which is very fast) would not be
	needed.	 This saves a word.  This starts to be of more value in
	complex statements.  Just have C remember to force the mode at
	the begining of the subroutine (if you use floating point math, of
	course) and after every subroutine call (only if more math is to
	be done.

 2.	A compiler option to compile long multiplies in line.  This is
	usually just a waste of space.	In this case, however, most of
	the time saved in the AS version was saved by doing just this.
	No csv or cret.	 The arguments didn't have to be pushed on the
	stack, or poped off.

 3.	C, in more complex programs than this, often gets confused between
	floating registers and regular registers.  It often allocates a
	regular register when it wanted to do a floating (single precision)
	calculation.  This is a bug which crops up once in a while.  I've
	ported some C that runs on vaxen which wouldn't compile on my 11.
	It wanted to use three floating registers - r0, r1 and r2.  It didn't
	like the fact that the regular register r2 was allocated.  It gave an
	error message that either said that the expression was to complex
	or that two many registers were in use.	 C also seems to want to
	do floating multiplies in floating odd	numbered registers (just as
	must be done with ints).

 4.	The C language could be improved.  How about declaration of floating
	point register variables?  That way if you said:
		register float x;
		register int i;
		register float *j;
		float foo[100];
		x = 0.0;
		j = &foo[0];
		i = 100;
		do {
			x += *j++;
		} while (--i);
	it could produce:
		setf			/ set single presision floating
		clrf	r0		/ x = 0.0
		mov	r5,r3		/ calculate addr of foo
		add	$foo,r3
		mov	$100.,r4	/ i = 100.
	L1:	addf	(r3)+,r0	/ x += *j++
		sob	r4,L1		/ while (--i)

	One would have to worry about saving floating registers during
	subroutine calls.  I'd suggest having the caller save them, and
	only if used, rather than having csv do it.  All we need is for
	csv to become more cumbersome.	The kernel uses it all the time,
	after all, and the kernel doesn't use many (if any) floats.

 5.	If you square a value:
		main()
		{
			register i,j,k;
			i = (j-k)*(j-k);
		}
	you get:
	_main:	jsr	r5,csv		/i=r4, j=r3, k=r2
		mov	r3,-(sp)	/could have used r0, why didn't it?
		sub	r2,(sp)		/ j - k
		mov	r3,r1
		sub	r2,r1		/ j - k, again
		mul	(sp)+,r1	/ (j-k) * (j-k)
		mov	r1,r4		/ answer in i
		jmp	cret
	which is not:
	_main:	jsr	r5,csv		/i=r4, j=r3, k=r2
		mov	r3,r1
		sub	r2,r1		/ j - k
		mul	r1,r1		/ squared
		mov	r1,r4		/ answer in i
		jmp	cret

	There are two solutions to this (very common math) problem.  The
	first is that C could try to recognize the operation as one of
	squaring, and only compute the difference once.	 This would be
	difficult, as side effects abound.  How about:
		a = (*b++ - *c++) * (*b++ - *c++);
	which isn't a squaring operation at all?

	The other obvious solution is to put an exponentiation operator
	into the C syntax and support it.  Squaring is done with one multiply.
	Qubing is done with two.  The fourth power is also done with two, as
	follows:
	a ** 0 = 1			0 multiplies
	a ** 1 = a			0 multiplies
	a ** 2 = a * a			1 multiply
	a ** 3 = a * a * a		2 multiplies
	a ** 4 = (a * a) ** 2		3 multiplies
	a ** 5 = a * (a ** 4)		4 multiplies
	a ** 6 = ((a ** 3) ** 2)	3 multiplies
	a ** 7 = a * (a ** 6)		4 multiplies
	a ** 8 = (a ** 4) ** 2		3 multiplies
	There are fortran compilers that I have seen that know up to ** 32.
	If you are raising to a variable, or an unkown constant, then call
	a subroutine.

 6.	A controversial topic:	As far as I can tell, CSV and CRET are not
	needed.	 Thus a stack frame is not absolutely needed.  Indeed a
	frame is an excellent means of figuring out how your program got
	to where it bombed.  It is not, however, fool proof.  Consider an
	array reference out of bounds error.  Off by one.  Local array
	(allocated on the stack).  Wrote to the return address.	 Oopps.
	The advantages of a frame are:
	a. Ease the compiler implementation. (?)
	b. Call sequence information is availble when program crashes.
	c. Allows quick restore of stack on wierd exit of subroutine.
	The disadvantages of a stack frame are:
	a. It must always use a cpu register (the 11 only has 6)
	b. Subroutine linkage is slower.
	c. The stack frame always uses more stack.
	d. Implementation with csv & cret means that all three register
	   variables are pushed, needed or not.
	e. C seems to always restore it's stack before going to cret anyway.

	Code without stack frame:
		jsr	pc,foo
		.
	foo:	mov	r4,-(sp)	/foo happens to use r4
		mov	r3,-(sp)	/and r3
		mov	r2,-(sp)	/and r2
		.
		.			/ the body of foo
		.
		mov	(sp)+,r2
		mov	(sp)+,r3
		mov	(sp)+,r4
		rts	pc

	the stack frame code is

		jsr	pc,foo
		.
	foo:	jsr	r5,csv		/i=r4, j=r3, k=r2
		.
	csv:	mov	r5,r0
		mov	sp,r5
		mov	r4,-(sp)
		mov	r3,-(sp)
		mov	r2,-(sp)
		jsr	pc,(r0)		/ the extra push
		.
		.			/ the body of foo
		.
		jmp	cret
		.
		mov	r5,r2
		mov	-(r2),r4
		mov	-(r2),r3
		mov	-(r2),r2
		mov	r5,sp
		mov	(sp)+,r5	/ restore the previous fram
		rts	pc		/ return to first caller

	How do you do locals on the stack?  On the stack!  You've got an
	sp, use it.  Just remember that as you push things on (for
	temporary use) to adjust your index into the stack to get your
	variables.  An added compiler complication, to be sure.
	But, you get the use of four register variables.  Of course,
	you could just buy a vax, and have nine of them.  Infinite resources
	will get you everywhere.

 7.	If you don't care about speed, you can go out and get a basic
	interpreter.

/* Benchmark from Byte, Sept issue.  Converted to C by Stephen Uitti */
/* This is the array version.  The FORTRAN version is similar */
#define ARTYP int	/* ARTYP may be double float or int */
#define M 20
#define N 20

ARTYP a[M][N], b[N][M], c[M][N];	/* The arrays, global */
double summ;				/* The checksum, global */

main()
{
	register i;
	for (i=0; i < 100; i++) {	/* 100 iterations */
		summ = 0.0;
		filla();
		fillb();
		fillc();
		matmult();
		summit();
	}
	printf("Summ is : %lf\n", summ);	/* should be 465880 */
}

filla()
{
	register int i,j;
	for (i = 0; i < M; i++)
		for (j = 0; j < N; j++)
			a[i][j] = i + j + 2;
}

fillb()
{
	register int i,j;
	long k;
	for (i = 0; i < M; i++)
		for (j = 0; j < N; j++) {
			k = (ARTYP)(i + j + 2) / (ARTYP)(j + 1);
			b[i][j] = k;
		}
}

fillc()
{
	register int i,j;
	for (i = 0; i < M; i++)
		for (j = 0; j < N; j++)
			c[i][j] = 0;
}

matmult()
{
	register int i, j, k;
	for (i = 0; i < M; i++)
		for (j = 0; j < N; j++)
			for (k = 0; k < M; k++)
				c[i][j] += a[i][k] * b[k][j];
}

summit()
{
	register int i,j;
	for (i = 0; i < M; i++)
		for (j = 0; j < M; j++)
			summ += c[i][j];
}

/* ************************************************************ */
#define ARTYP long	/* Pointer version */
#define M 20
#define N 20

ARTYP a[M][N], b[N][M], c[M][N];
long summ;

main()
{
	register i = 100;	/* 100 iterations */
	do {
		summ = 0.0;
		filla();
		fillb();
		fillc();
		matmult();
		summit();
	} while (--i);
	printf("Summ is : %ld\n", summ);
}

filla()
{
	register int i,j;
	register ARTYP *k;
	k = &a[M-1][N];
	i = M;
	do {
		j = N;
		do {
			*--k = i + j;
		} while (--j);
	} while (--i);
}

fillb()
{
	register int i,j;
	register ARTYP *k;
	k = &b[M-1][N];
	i = M;
	do {
		j = N;
		do {
			*--k =	(i + j) / (j);
		} while (--j);
	} while (--i);
}

fillc()
{
	register int i;
	register ARTYP *k;
	k = c;
	i = M * N;
	do
		*k++ = 0;
	while (--i);
}

matmult()
{
	register ARTYP *ap, *bp;		/* in order of importance */
	register int k, i, j;
	register ARTYP *cp, *aptmp;

	cp = &c[M-1][N-1];
	ap = &a[M-1][N-1];
	i = M;
	do {
		aptmp = ap;
		j = N;
		do {
			bp = &b[M-1][j-1];
			k = M;
			ap = aptmp;
			do {
				*cp += *ap * *bp;
				ap--;
				bp -= M;
			} while (--k);
			cp--;
		} while (--j);
	} while (--i);
}


summit()
{
	register int i,j;
	register ARTYP *k;
	k = c;
	i = M * N;
	do
		summ += *k++;
	while (--i);
}

c A Steve Uitti hackification, fortran aversion
c benchmark from byte, Sept 82 issue, just curiosity
	real*8 a,b,c
	integer*4 summ
	common/xxx/ a(20, 20), b(20, 20), c(20, 20), summ
	do 20 i = 1, 100
	summ = 0.0
	call filla
	call fillb
	call fillc
	call matmul
	call summit
20	continue
	write (6,10) summ
10	format(' ', i8)
	end

	subroutine filla
	real*8 a,b,c
	integer*4 summ
	common/xxx/ a(20, 20), b(20, 20), c(20, 20), summ
	do 100 i = 1, 20
	 do 100 j = 1, 20
	  a(j, i) = i + j
100	continue
	return
	end

	subroutine fillb
	real*8 a,b,c
	integer*4 summ
	common/xxx/ a(20, 20), b(20, 20), c(20, 20), summ
	do 200 i = 1, 20
	 do 200 j = 1, 20
	  b(j, i) = (i + j) / i
200	continue
	return
	end

	subroutine fillc
	real*8 a,b,c
	integer*4 summ
	common/xxx/ a(20, 20), b(20, 20), c(20, 20), summ
	do 300 i = 1, 20
	 do 300 j = 1, 20
	  c(j, i) = 0.0
300	continue
	return
	end

	subroutine matmul
	real*8 a,b,c
	integer*4 summ
	common/xxx/ a(20, 20), b(20, 20), c(20, 20), summ
	do 400 i = 1, 20
	 do 400 j = 1, 20
	  do 400 k = 1, 20
	   c(j, i) = c(j, i) + a(k, i) * b(k, j)
400	continue
	return
	end

	subroutine summit
	real*8 a,b,c
	integer*4 summ
	common/xxx/ a(20, 20), b(20, 20), c(20, 20), summ
	do 500 i = 1, 20
	 do 500 j = 1, 20
	  summ = summ + c(j, i)
500	continue
	return
	end
/*
 * for brevity, I'm not including all four assembly sources. These
 * are available on request.
 */
					I just love to sign my name:
					Stephen Uitti
					...pur-ee!physics:suitti
					...pur-ee!pur-phy!suitti

gwyn@Brl@sri-unix (10/28/82)

From:     Doug Gwyn <gwyn@Brl>
Date:     25 Oct 82 15:11:44-EDT (Mon)
I like the idea of generating in-line code for PDP-11 long multiplication
(it only takes a few instructions).

Whitesmiths (Mark Krieger, I believe) adjusted their PDP-11 code generator
so that it generated a fast (two-instruction) entry sequence instead of
calling csv() and jumped to crets instead of cret, whenever a procedure
didn't use the non-volatile registers (R2,R3,R4).

In my experience, most "real" heavy uses of floating-point computation
do require double precision to avoid loss of accuracy (e.g. when inverting
matrices, linear programming, solving D.E.s, etc.).  Therefore I think the
C default mode is reasonable.  Just use doubles in the first place to avoid
the conversion from/to float.