CSvax:Physics:suitti (10/21/82)
***Warning: high verbosity (bogosity) level***
Intro:
Two hours, 52 minutes and 27 seconds of cpu were expended in an
attempt to determine how to shave a few seconds off of an atypical program
run here at purdue physics. The sources are given at the end. Other
information about the test is given in random order. Some of the conclusions
may be useful, others are just flames.
The Numbers
Lang type (secs) text data bss dec name
average float 76.4
f77 float 120.7 19180 + 2578 + 10186 = 31944 benf
f77# float 107.7 18728 + 2578 + 10162 = 31468 binf
f4p float 39.4 7634 + 5318 + 0 = 12952 bnchf
as float 27.6 3270 + 320 + 6948 = 10538 pfa
Cp float 60.6 3310 + 322 + 6952 = 10584 pf
Ca float 102.2 3430 + 330 + 6944 = 10704 af
average double 74.7
f77 double 117.5 19176 + 2578 + 14986 = 36740 bend
f77# double 98.8 18724 + 2578 + 14962 = 36264 bind
f4p double 53.5 7638 + 10118 + 0 = 17756 bnchd
as double 36.9 3308 + 322 + 11748 = 15378 pda
Cp double 47.0 3302 + 322 + 11752 = 15376 pd
Ca double 94.6 3424 + 330 + 11744 = 15498 ad
average int 38.0
f77 int 70.2 19132 + 2578 + 7786 = 29496 beni
f77# int 58.9 18706 + 2578 + 7762 = 29046 bini
f4p int 21.0 7602 + 2918 + 0 = 10520 bnchi
as int 12.5 3282 + 322 + 4548 = 8152 pia
Cp int 16.0 3292 + 322 + 4552 = 8166 pi
Ca int 49.3 3394 + 330 + 4544 = 8268 ai
average long 98.4
f77 long 140.7 19282 + 2578 + 10186 = 32046 benl
f77# long 130.2 18840 + 2578 + 10162 = 31580 binl
f4p long 93.1 7730 + 5322 + 0 = 13052 bnchl
as long 41.1 3338 + 322 + 6948 = 10608 pla
Cp long 70.5 3394 + 322 + 6952 = 10668 pl
Ca long 115.1 3706 + 330 + 6944 = 10980 al
Basic 1721.5 (run only once) slow.bas
where
average is the average time used for that data type. A meaningless number?
f77 is f77 with it's default integer*4 i,j,k indexes.
f77# is f77 using integer*2 for the i,j,k indexes.
f4p is the CULC ported fortran 66 compiler. I consider it to be smarter
than the average joe. These were compiled as split I & D.
Ca is the Ritche compiler, using the first version of the benchmark,
which subscripts into arrays as follows:
c[i][j++] = foo;
as opposed to the pointer version which would do it as:
*cp++ = foo;
Cp is the Ritche compiler, using the second version of the benchmark,
which uses pointer arithmetic to speed execution. This demostrates
one of the (now obvious) philosophies of C: a language designed so
that a programmer can get the program written quickly, but optimization
is also done by the programmer. He can get C to produce something
very close to what he'd have written in assembler (if he knew it).
as means that the Cp AS output was modified so as to be truely optimal.
Temporaries r0, r1 were utilized for variables if possible. Single
precision floating was use in for the single precision floating
version. Floating registers were used as summers (float and double).
The code was cleaned up in general. Things that were pushed on the
stack were left in registers. I wouldn't expect a compiler to do
everything that was done here. It should be thought of as a basis
in time for how fast the machine is, and what penalty you get for
using a higher level language.
basic A basic interpreter that happens to be laying around here.
Notes on the benchmark.
1. Benchmarks, 100 iterations, 20 x 20 matrix.
Environment: 11/44 with 2 Meg of RAM (not much swapping) running 2.8
bsd. Load averages ranged from about 1 to about 6. Little control
was exercised. The times are averaged over 5 runs of each program.
Lastcomm was used to determine total runtimes.
2. The benchmark is mainly designed to test the speed of calculation,
loop control, and array subcripting, in programs which run for long
periods of time. It is relatively simple. No conditional flow of
control was tested. The nesting level was three. It turns out that
f4p starts to run out of registers at this point, and has to start
using temporaries (at absolute locations). C only has three register
variables for pointer subscription. F77 is at a loss always.
3. A quick study will show that the init portions; init a, b, c,
as well as the checksum at the end are all N**2 operations. The
matrix multiply is an n**3 operation. N=20, therefore, we have
4 * 20 * 20 + 20 * 20 * 20 = 1600 + 8000 = 9600, or some 83% of the
time is spent in the matrix multiply. Profile'ing confirms this,
more or less. The point is that most of the time is spent in a
minority of code. One could think that too little code is tested,
and be correct. Thus, one may take the following numbers with a
grain of sodium cloride.
4. All of the compiles were done with optimize switches, if any. F77
produces bad enough swill with the option. No tests of which
pessimizer is worse (better?) was done here.
Some conclusions:
1. When the data type is fixed, the order in speedy languages is:
as Cp f4p Ca f77# f77
unless using float, in which case it's
as f4p Cp Ca f77# f77
C uses doubles internally to compute floats, whereas f4p uses floats
to compute floats. If this were not the case (as with my assembly
version), it can be shown that one can write floating point code
which will run faster in C than (even) f4p. It will still be slow
if you use generalized array subscripting. This benchmark does not
show how anything compiles truely random access array subscription.
The f77 code is very similar to the Ca code, only worse.
2. When the language utilized is fixed, the order in speedy data types is:
int double float long
unless using f4p, in which case it's
int float double long
This is because the C uses doubles to do floats. It converts to
double, computes the value, converts back to float. This makes
it necessarily slower than double. Note that computation in double
is half the speed of computation in float, at least on an 11/44.
The numbers in the book also point to this fact. Since C does
floats with doubles, so does f77. Thus f77 moves like a thundering
turtle.
3. Longs are a lossage on an 11. There is no hardware which supports
them. They take up many regular registers to do anything with them.
C likes to put long temporaries on the stack during more complex
operations.
What can be done to C?
1. Make it do floats in single precision. An enormous amount of code
is produced to convert back and forth. Many floating registers are
used for the converted values during calcualation.
Consider the line:
*a+ = *b+ + *c+; /* register float *a, *b, *c */
C will produce something like:
movof (r2)+,r0 / convert, move into floating r0
movof (r3)+,r1 / convert, move into floating r1
addf r1,r0 / add, double precision
movfo r0,(r4)+ / convert, move answer to destination
instead of (worst case)
setf / do it in single precision
movf (r2)+,r0 / move 4 bytes to floating r0
addf (r3)+,r0 / add in single precision
movf r0,(r4)+ / move to destination
Advantages:
a. The new code is much faster. The add instruction takes less time.
A movf is faster than a movfo or movof. The setd is very fast.
b. The new code uses only one floating register (r0). This leaves
floating r1, r2, r3, and if you are truely mutant: r4 and r5.
c. With a little work, the setf (which is very fast) would not be
needed. This saves a word. This starts to be of more value in
complex statements. Just have C remember to force the mode at
the begining of the subroutine (if you use floating point math, of
course) and after every subroutine call (only if more math is to
be done.
2. A compiler option to compile long multiplies in line. This is
usually just a waste of space. In this case, however, most of
the time saved in the AS version was saved by doing just this.
No csv or cret. The arguments didn't have to be pushed on the
stack, or poped off.
3. C, in more complex programs than this, often gets confused between
floating registers and regular registers. It often allocates a
regular register when it wanted to do a floating (single precision)
calculation. This is a bug which crops up once in a while. I've
ported some C that runs on vaxen which wouldn't compile on my 11.
It wanted to use three floating registers - r0, r1 and r2. It didn't
like the fact that the regular register r2 was allocated. It gave an
error message that either said that the expression was to complex
or that two many registers were in use. C also seems to want to
do floating multiplies in floating odd numbered registers (just as
must be done with ints).
4. The C language could be improved. How about declaration of floating
point register variables? That way if you said:
register float x;
register int i;
register float *j;
float foo[100];
x = 0.0;
j = &foo[0];
i = 100;
do {
x += *j++;
} while (--i);
it could produce:
setf / set single presision floating
clrf r0 / x = 0.0
mov r5,r3 / calculate addr of foo
add $foo,r3
mov $100.,r4 / i = 100.
L1: addf (r3)+,r0 / x += *j++
sob r4,L1 / while (--i)
One would have to worry about saving floating registers during
subroutine calls. I'd suggest having the caller save them, and
only if used, rather than having csv do it. All we need is for
csv to become more cumbersome. The kernel uses it all the time,
after all, and the kernel doesn't use many (if any) floats.
5. If you square a value:
main()
{
register i,j,k;
i = (j-k)*(j-k);
}
you get:
_main: jsr r5,csv /i=r4, j=r3, k=r2
mov r3,-(sp) /could have used r0, why didn't it?
sub r2,(sp) / j - k
mov r3,r1
sub r2,r1 / j - k, again
mul (sp)+,r1 / (j-k) * (j-k)
mov r1,r4 / answer in i
jmp cret
which is not:
_main: jsr r5,csv /i=r4, j=r3, k=r2
mov r3,r1
sub r2,r1 / j - k
mul r1,r1 / squared
mov r1,r4 / answer in i
jmp cret
There are two solutions to this (very common math) problem. The
first is that C could try to recognize the operation as one of
squaring, and only compute the difference once. This would be
difficult, as side effects abound. How about:
a = (*b++ - *c++) * (*b++ - *c++);
which isn't a squaring operation at all?
The other obvious solution is to put an exponentiation operator
into the C syntax and support it. Squaring is done with one multiply.
Qubing is done with two. The fourth power is also done with two, as
follows:
a ** 0 = 1 0 multiplies
a ** 1 = a 0 multiplies
a ** 2 = a * a 1 multiply
a ** 3 = a * a * a 2 multiplies
a ** 4 = (a * a) ** 2 3 multiplies
a ** 5 = a * (a ** 4) 4 multiplies
a ** 6 = ((a ** 3) ** 2) 3 multiplies
a ** 7 = a * (a ** 6) 4 multiplies
a ** 8 = (a ** 4) ** 2 3 multiplies
There are fortran compilers that I have seen that know up to ** 32.
If you are raising to a variable, or an unkown constant, then call
a subroutine.
6. A controversial topic: As far as I can tell, CSV and CRET are not
needed. Thus a stack frame is not absolutely needed. Indeed a
frame is an excellent means of figuring out how your program got
to where it bombed. It is not, however, fool proof. Consider an
array reference out of bounds error. Off by one. Local array
(allocated on the stack). Wrote to the return address. Oopps.
The advantages of a frame are:
a. Ease the compiler implementation. (?)
b. Call sequence information is availble when program crashes.
c. Allows quick restore of stack on wierd exit of subroutine.
The disadvantages of a stack frame are:
a. It must always use a cpu register (the 11 only has 6)
b. Subroutine linkage is slower.
c. The stack frame always uses more stack.
d. Implementation with csv & cret means that all three register
variables are pushed, needed or not.
e. C seems to always restore it's stack before going to cret anyway.
Code without stack frame:
jsr pc,foo
.
foo: mov r4,-(sp) /foo happens to use r4
mov r3,-(sp) /and r3
mov r2,-(sp) /and r2
.
. / the body of foo
.
mov (sp)+,r2
mov (sp)+,r3
mov (sp)+,r4
rts pc
the stack frame code is
jsr pc,foo
.
foo: jsr r5,csv /i=r4, j=r3, k=r2
.
csv: mov r5,r0
mov sp,r5
mov r4,-(sp)
mov r3,-(sp)
mov r2,-(sp)
jsr pc,(r0) / the extra push
.
. / the body of foo
.
jmp cret
.
mov r5,r2
mov -(r2),r4
mov -(r2),r3
mov -(r2),r2
mov r5,sp
mov (sp)+,r5 / restore the previous fram
rts pc / return to first caller
How do you do locals on the stack? On the stack! You've got an
sp, use it. Just remember that as you push things on (for
temporary use) to adjust your index into the stack to get your
variables. An added compiler complication, to be sure.
But, you get the use of four register variables. Of course,
you could just buy a vax, and have nine of them. Infinite resources
will get you everywhere.
7. If you don't care about speed, you can go out and get a basic
interpreter.
/* Benchmark from Byte, Sept issue. Converted to C by Stephen Uitti */
/* This is the array version. The FORTRAN version is similar */
#define ARTYP int /* ARTYP may be double float or int */
#define M 20
#define N 20
ARTYP a[M][N], b[N][M], c[M][N]; /* The arrays, global */
double summ; /* The checksum, global */
main()
{
register i;
for (i=0; i < 100; i++) { /* 100 iterations */
summ = 0.0;
filla();
fillb();
fillc();
matmult();
summit();
}
printf("Summ is : %lf\n", summ); /* should be 465880 */
}
filla()
{
register int i,j;
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
a[i][j] = i + j + 2;
}
fillb()
{
register int i,j;
long k;
for (i = 0; i < M; i++)
for (j = 0; j < N; j++) {
k = (ARTYP)(i + j + 2) / (ARTYP)(j + 1);
b[i][j] = k;
}
}
fillc()
{
register int i,j;
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
c[i][j] = 0;
}
matmult()
{
register int i, j, k;
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
for (k = 0; k < M; k++)
c[i][j] += a[i][k] * b[k][j];
}
summit()
{
register int i,j;
for (i = 0; i < M; i++)
for (j = 0; j < M; j++)
summ += c[i][j];
}
/* ************************************************************ */
#define ARTYP long /* Pointer version */
#define M 20
#define N 20
ARTYP a[M][N], b[N][M], c[M][N];
long summ;
main()
{
register i = 100; /* 100 iterations */
do {
summ = 0.0;
filla();
fillb();
fillc();
matmult();
summit();
} while (--i);
printf("Summ is : %ld\n", summ);
}
filla()
{
register int i,j;
register ARTYP *k;
k = &a[M-1][N];
i = M;
do {
j = N;
do {
*--k = i + j;
} while (--j);
} while (--i);
}
fillb()
{
register int i,j;
register ARTYP *k;
k = &b[M-1][N];
i = M;
do {
j = N;
do {
*--k = (i + j) / (j);
} while (--j);
} while (--i);
}
fillc()
{
register int i;
register ARTYP *k;
k = c;
i = M * N;
do
*k++ = 0;
while (--i);
}
matmult()
{
register ARTYP *ap, *bp; /* in order of importance */
register int k, i, j;
register ARTYP *cp, *aptmp;
cp = &c[M-1][N-1];
ap = &a[M-1][N-1];
i = M;
do {
aptmp = ap;
j = N;
do {
bp = &b[M-1][j-1];
k = M;
ap = aptmp;
do {
*cp += *ap * *bp;
ap--;
bp -= M;
} while (--k);
cp--;
} while (--j);
} while (--i);
}
summit()
{
register int i,j;
register ARTYP *k;
k = c;
i = M * N;
do
summ += *k++;
while (--i);
}
c A Steve Uitti hackification, fortran aversion
c benchmark from byte, Sept 82 issue, just curiosity
real*8 a,b,c
integer*4 summ
common/xxx/ a(20, 20), b(20, 20), c(20, 20), summ
do 20 i = 1, 100
summ = 0.0
call filla
call fillb
call fillc
call matmul
call summit
20 continue
write (6,10) summ
10 format(' ', i8)
end
subroutine filla
real*8 a,b,c
integer*4 summ
common/xxx/ a(20, 20), b(20, 20), c(20, 20), summ
do 100 i = 1, 20
do 100 j = 1, 20
a(j, i) = i + j
100 continue
return
end
subroutine fillb
real*8 a,b,c
integer*4 summ
common/xxx/ a(20, 20), b(20, 20), c(20, 20), summ
do 200 i = 1, 20
do 200 j = 1, 20
b(j, i) = (i + j) / i
200 continue
return
end
subroutine fillc
real*8 a,b,c
integer*4 summ
common/xxx/ a(20, 20), b(20, 20), c(20, 20), summ
do 300 i = 1, 20
do 300 j = 1, 20
c(j, i) = 0.0
300 continue
return
end
subroutine matmul
real*8 a,b,c
integer*4 summ
common/xxx/ a(20, 20), b(20, 20), c(20, 20), summ
do 400 i = 1, 20
do 400 j = 1, 20
do 400 k = 1, 20
c(j, i) = c(j, i) + a(k, i) * b(k, j)
400 continue
return
end
subroutine summit
real*8 a,b,c
integer*4 summ
common/xxx/ a(20, 20), b(20, 20), c(20, 20), summ
do 500 i = 1, 20
do 500 j = 1, 20
summ = summ + c(j, i)
500 continue
return
end
/*
* for brevity, I'm not including all four assembly sources. These
* are available on request.
*/
I just love to sign my name:
Stephen Uitti
...pur-ee!physics:suitti
...pur-ee!pur-phy!suittigwyn@Brl@sri-unix (10/28/82)
From: Doug Gwyn <gwyn@Brl> Date: 25 Oct 82 15:11:44-EDT (Mon) I like the idea of generating in-line code for PDP-11 long multiplication (it only takes a few instructions). Whitesmiths (Mark Krieger, I believe) adjusted their PDP-11 code generator so that it generated a fast (two-instruction) entry sequence instead of calling csv() and jumped to crets instead of cret, whenever a procedure didn't use the non-volatile registers (R2,R3,R4). In my experience, most "real" heavy uses of floating-point computation do require double precision to avoid loss of accuracy (e.g. when inverting matrices, linear programming, solving D.E.s, etc.). Therefore I think the C default mode is reasonable. Just use doubles in the first place to avoid the conversion from/to float.