CSvax:Physics:suitti (10/21/82)
***Warning: high verbosity (bogosity) level*** Intro: Two hours, 52 minutes and 27 seconds of cpu were expended in an attempt to determine how to shave a few seconds off of an atypical program run here at purdue physics. The sources are given at the end. Other information about the test is given in random order. Some of the conclusions may be useful, others are just flames. The Numbers Lang type (secs) text data bss dec name average float 76.4 f77 float 120.7 19180 + 2578 + 10186 = 31944 benf f77# float 107.7 18728 + 2578 + 10162 = 31468 binf f4p float 39.4 7634 + 5318 + 0 = 12952 bnchf as float 27.6 3270 + 320 + 6948 = 10538 pfa Cp float 60.6 3310 + 322 + 6952 = 10584 pf Ca float 102.2 3430 + 330 + 6944 = 10704 af average double 74.7 f77 double 117.5 19176 + 2578 + 14986 = 36740 bend f77# double 98.8 18724 + 2578 + 14962 = 36264 bind f4p double 53.5 7638 + 10118 + 0 = 17756 bnchd as double 36.9 3308 + 322 + 11748 = 15378 pda Cp double 47.0 3302 + 322 + 11752 = 15376 pd Ca double 94.6 3424 + 330 + 11744 = 15498 ad average int 38.0 f77 int 70.2 19132 + 2578 + 7786 = 29496 beni f77# int 58.9 18706 + 2578 + 7762 = 29046 bini f4p int 21.0 7602 + 2918 + 0 = 10520 bnchi as int 12.5 3282 + 322 + 4548 = 8152 pia Cp int 16.0 3292 + 322 + 4552 = 8166 pi Ca int 49.3 3394 + 330 + 4544 = 8268 ai average long 98.4 f77 long 140.7 19282 + 2578 + 10186 = 32046 benl f77# long 130.2 18840 + 2578 + 10162 = 31580 binl f4p long 93.1 7730 + 5322 + 0 = 13052 bnchl as long 41.1 3338 + 322 + 6948 = 10608 pla Cp long 70.5 3394 + 322 + 6952 = 10668 pl Ca long 115.1 3706 + 330 + 6944 = 10980 al Basic 1721.5 (run only once) slow.bas where average is the average time used for that data type. A meaningless number? f77 is f77 with it's default integer*4 i,j,k indexes. f77# is f77 using integer*2 for the i,j,k indexes. f4p is the CULC ported fortran 66 compiler. I consider it to be smarter than the average joe. These were compiled as split I & D. Ca is the Ritche compiler, using the first version of the benchmark, which subscripts into arrays as follows: c[i][j++] = foo; as opposed to the pointer version which would do it as: *cp++ = foo; Cp is the Ritche compiler, using the second version of the benchmark, which uses pointer arithmetic to speed execution. This demostrates one of the (now obvious) philosophies of C: a language designed so that a programmer can get the program written quickly, but optimization is also done by the programmer. He can get C to produce something very close to what he'd have written in assembler (if he knew it). as means that the Cp AS output was modified so as to be truely optimal. Temporaries r0, r1 were utilized for variables if possible. Single precision floating was use in for the single precision floating version. Floating registers were used as summers (float and double). The code was cleaned up in general. Things that were pushed on the stack were left in registers. I wouldn't expect a compiler to do everything that was done here. It should be thought of as a basis in time for how fast the machine is, and what penalty you get for using a higher level language. basic A basic interpreter that happens to be laying around here. Notes on the benchmark. 1. Benchmarks, 100 iterations, 20 x 20 matrix. Environment: 11/44 with 2 Meg of RAM (not much swapping) running 2.8 bsd. Load averages ranged from about 1 to about 6. Little control was exercised. The times are averaged over 5 runs of each program. Lastcomm was used to determine total runtimes. 2. The benchmark is mainly designed to test the speed of calculation, loop control, and array subcripting, in programs which run for long periods of time. It is relatively simple. No conditional flow of control was tested. The nesting level was three. It turns out that f4p starts to run out of registers at this point, and has to start using temporaries (at absolute locations). C only has three register variables for pointer subscription. F77 is at a loss always. 3. A quick study will show that the init portions; init a, b, c, as well as the checksum at the end are all N**2 operations. The matrix multiply is an n**3 operation. N=20, therefore, we have 4 * 20 * 20 + 20 * 20 * 20 = 1600 + 8000 = 9600, or some 83% of the time is spent in the matrix multiply. Profile'ing confirms this, more or less. The point is that most of the time is spent in a minority of code. One could think that too little code is tested, and be correct. Thus, one may take the following numbers with a grain of sodium cloride. 4. All of the compiles were done with optimize switches, if any. F77 produces bad enough swill with the option. No tests of which pessimizer is worse (better?) was done here. Some conclusions: 1. When the data type is fixed, the order in speedy languages is: as Cp f4p Ca f77# f77 unless using float, in which case it's as f4p Cp Ca f77# f77 C uses doubles internally to compute floats, whereas f4p uses floats to compute floats. If this were not the case (as with my assembly version), it can be shown that one can write floating point code which will run faster in C than (even) f4p. It will still be slow if you use generalized array subscripting. This benchmark does not show how anything compiles truely random access array subscription. The f77 code is very similar to the Ca code, only worse. 2. When the language utilized is fixed, the order in speedy data types is: int double float long unless using f4p, in which case it's int float double long This is because the C uses doubles to do floats. It converts to double, computes the value, converts back to float. This makes it necessarily slower than double. Note that computation in double is half the speed of computation in float, at least on an 11/44. The numbers in the book also point to this fact. Since C does floats with doubles, so does f77. Thus f77 moves like a thundering turtle. 3. Longs are a lossage on an 11. There is no hardware which supports them. They take up many regular registers to do anything with them. C likes to put long temporaries on the stack during more complex operations. What can be done to C? 1. Make it do floats in single precision. An enormous amount of code is produced to convert back and forth. Many floating registers are used for the converted values during calcualation. Consider the line: *a+ = *b+ + *c+; /* register float *a, *b, *c */ C will produce something like: movof (r2)+,r0 / convert, move into floating r0 movof (r3)+,r1 / convert, move into floating r1 addf r1,r0 / add, double precision movfo r0,(r4)+ / convert, move answer to destination instead of (worst case) setf / do it in single precision movf (r2)+,r0 / move 4 bytes to floating r0 addf (r3)+,r0 / add in single precision movf r0,(r4)+ / move to destination Advantages: a. The new code is much faster. The add instruction takes less time. A movf is faster than a movfo or movof. The setd is very fast. b. The new code uses only one floating register (r0). This leaves floating r1, r2, r3, and if you are truely mutant: r4 and r5. c. With a little work, the setf (which is very fast) would not be needed. This saves a word. This starts to be of more value in complex statements. Just have C remember to force the mode at the begining of the subroutine (if you use floating point math, of course) and after every subroutine call (only if more math is to be done. 2. A compiler option to compile long multiplies in line. This is usually just a waste of space. In this case, however, most of the time saved in the AS version was saved by doing just this. No csv or cret. The arguments didn't have to be pushed on the stack, or poped off. 3. C, in more complex programs than this, often gets confused between floating registers and regular registers. It often allocates a regular register when it wanted to do a floating (single precision) calculation. This is a bug which crops up once in a while. I've ported some C that runs on vaxen which wouldn't compile on my 11. It wanted to use three floating registers - r0, r1 and r2. It didn't like the fact that the regular register r2 was allocated. It gave an error message that either said that the expression was to complex or that two many registers were in use. C also seems to want to do floating multiplies in floating odd numbered registers (just as must be done with ints). 4. The C language could be improved. How about declaration of floating point register variables? That way if you said: register float x; register int i; register float *j; float foo[100]; x = 0.0; j = &foo[0]; i = 100; do { x += *j++; } while (--i); it could produce: setf / set single presision floating clrf r0 / x = 0.0 mov r5,r3 / calculate addr of foo add $foo,r3 mov $100.,r4 / i = 100. L1: addf (r3)+,r0 / x += *j++ sob r4,L1 / while (--i) One would have to worry about saving floating registers during subroutine calls. I'd suggest having the caller save them, and only if used, rather than having csv do it. All we need is for csv to become more cumbersome. The kernel uses it all the time, after all, and the kernel doesn't use many (if any) floats. 5. If you square a value: main() { register i,j,k; i = (j-k)*(j-k); } you get: _main: jsr r5,csv /i=r4, j=r3, k=r2 mov r3,-(sp) /could have used r0, why didn't it? sub r2,(sp) / j - k mov r3,r1 sub r2,r1 / j - k, again mul (sp)+,r1 / (j-k) * (j-k) mov r1,r4 / answer in i jmp cret which is not: _main: jsr r5,csv /i=r4, j=r3, k=r2 mov r3,r1 sub r2,r1 / j - k mul r1,r1 / squared mov r1,r4 / answer in i jmp cret There are two solutions to this (very common math) problem. The first is that C could try to recognize the operation as one of squaring, and only compute the difference once. This would be difficult, as side effects abound. How about: a = (*b++ - *c++) * (*b++ - *c++); which isn't a squaring operation at all? The other obvious solution is to put an exponentiation operator into the C syntax and support it. Squaring is done with one multiply. Qubing is done with two. The fourth power is also done with two, as follows: a ** 0 = 1 0 multiplies a ** 1 = a 0 multiplies a ** 2 = a * a 1 multiply a ** 3 = a * a * a 2 multiplies a ** 4 = (a * a) ** 2 3 multiplies a ** 5 = a * (a ** 4) 4 multiplies a ** 6 = ((a ** 3) ** 2) 3 multiplies a ** 7 = a * (a ** 6) 4 multiplies a ** 8 = (a ** 4) ** 2 3 multiplies There are fortran compilers that I have seen that know up to ** 32. If you are raising to a variable, or an unkown constant, then call a subroutine. 6. A controversial topic: As far as I can tell, CSV and CRET are not needed. Thus a stack frame is not absolutely needed. Indeed a frame is an excellent means of figuring out how your program got to where it bombed. It is not, however, fool proof. Consider an array reference out of bounds error. Off by one. Local array (allocated on the stack). Wrote to the return address. Oopps. The advantages of a frame are: a. Ease the compiler implementation. (?) b. Call sequence information is availble when program crashes. c. Allows quick restore of stack on wierd exit of subroutine. The disadvantages of a stack frame are: a. It must always use a cpu register (the 11 only has 6) b. Subroutine linkage is slower. c. The stack frame always uses more stack. d. Implementation with csv & cret means that all three register variables are pushed, needed or not. e. C seems to always restore it's stack before going to cret anyway. Code without stack frame: jsr pc,foo . foo: mov r4,-(sp) /foo happens to use r4 mov r3,-(sp) /and r3 mov r2,-(sp) /and r2 . . / the body of foo . mov (sp)+,r2 mov (sp)+,r3 mov (sp)+,r4 rts pc the stack frame code is jsr pc,foo . foo: jsr r5,csv /i=r4, j=r3, k=r2 . csv: mov r5,r0 mov sp,r5 mov r4,-(sp) mov r3,-(sp) mov r2,-(sp) jsr pc,(r0) / the extra push . . / the body of foo . jmp cret . mov r5,r2 mov -(r2),r4 mov -(r2),r3 mov -(r2),r2 mov r5,sp mov (sp)+,r5 / restore the previous fram rts pc / return to first caller How do you do locals on the stack? On the stack! You've got an sp, use it. Just remember that as you push things on (for temporary use) to adjust your index into the stack to get your variables. An added compiler complication, to be sure. But, you get the use of four register variables. Of course, you could just buy a vax, and have nine of them. Infinite resources will get you everywhere. 7. If you don't care about speed, you can go out and get a basic interpreter. /* Benchmark from Byte, Sept issue. Converted to C by Stephen Uitti */ /* This is the array version. The FORTRAN version is similar */ #define ARTYP int /* ARTYP may be double float or int */ #define M 20 #define N 20 ARTYP a[M][N], b[N][M], c[M][N]; /* The arrays, global */ double summ; /* The checksum, global */ main() { register i; for (i=0; i < 100; i++) { /* 100 iterations */ summ = 0.0; filla(); fillb(); fillc(); matmult(); summit(); } printf("Summ is : %lf\n", summ); /* should be 465880 */ } filla() { register int i,j; for (i = 0; i < M; i++) for (j = 0; j < N; j++) a[i][j] = i + j + 2; } fillb() { register int i,j; long k; for (i = 0; i < M; i++) for (j = 0; j < N; j++) { k = (ARTYP)(i + j + 2) / (ARTYP)(j + 1); b[i][j] = k; } } fillc() { register int i,j; for (i = 0; i < M; i++) for (j = 0; j < N; j++) c[i][j] = 0; } matmult() { register int i, j, k; for (i = 0; i < M; i++) for (j = 0; j < N; j++) for (k = 0; k < M; k++) c[i][j] += a[i][k] * b[k][j]; } summit() { register int i,j; for (i = 0; i < M; i++) for (j = 0; j < M; j++) summ += c[i][j]; } /* ************************************************************ */ #define ARTYP long /* Pointer version */ #define M 20 #define N 20 ARTYP a[M][N], b[N][M], c[M][N]; long summ; main() { register i = 100; /* 100 iterations */ do { summ = 0.0; filla(); fillb(); fillc(); matmult(); summit(); } while (--i); printf("Summ is : %ld\n", summ); } filla() { register int i,j; register ARTYP *k; k = &a[M-1][N]; i = M; do { j = N; do { *--k = i + j; } while (--j); } while (--i); } fillb() { register int i,j; register ARTYP *k; k = &b[M-1][N]; i = M; do { j = N; do { *--k = (i + j) / (j); } while (--j); } while (--i); } fillc() { register int i; register ARTYP *k; k = c; i = M * N; do *k++ = 0; while (--i); } matmult() { register ARTYP *ap, *bp; /* in order of importance */ register int k, i, j; register ARTYP *cp, *aptmp; cp = &c[M-1][N-1]; ap = &a[M-1][N-1]; i = M; do { aptmp = ap; j = N; do { bp = &b[M-1][j-1]; k = M; ap = aptmp; do { *cp += *ap * *bp; ap--; bp -= M; } while (--k); cp--; } while (--j); } while (--i); } summit() { register int i,j; register ARTYP *k; k = c; i = M * N; do summ += *k++; while (--i); } c A Steve Uitti hackification, fortran aversion c benchmark from byte, Sept 82 issue, just curiosity real*8 a,b,c integer*4 summ common/xxx/ a(20, 20), b(20, 20), c(20, 20), summ do 20 i = 1, 100 summ = 0.0 call filla call fillb call fillc call matmul call summit 20 continue write (6,10) summ 10 format(' ', i8) end subroutine filla real*8 a,b,c integer*4 summ common/xxx/ a(20, 20), b(20, 20), c(20, 20), summ do 100 i = 1, 20 do 100 j = 1, 20 a(j, i) = i + j 100 continue return end subroutine fillb real*8 a,b,c integer*4 summ common/xxx/ a(20, 20), b(20, 20), c(20, 20), summ do 200 i = 1, 20 do 200 j = 1, 20 b(j, i) = (i + j) / i 200 continue return end subroutine fillc real*8 a,b,c integer*4 summ common/xxx/ a(20, 20), b(20, 20), c(20, 20), summ do 300 i = 1, 20 do 300 j = 1, 20 c(j, i) = 0.0 300 continue return end subroutine matmul real*8 a,b,c integer*4 summ common/xxx/ a(20, 20), b(20, 20), c(20, 20), summ do 400 i = 1, 20 do 400 j = 1, 20 do 400 k = 1, 20 c(j, i) = c(j, i) + a(k, i) * b(k, j) 400 continue return end subroutine summit real*8 a,b,c integer*4 summ common/xxx/ a(20, 20), b(20, 20), c(20, 20), summ do 500 i = 1, 20 do 500 j = 1, 20 summ = summ + c(j, i) 500 continue return end /* * for brevity, I'm not including all four assembly sources. These * are available on request. */ I just love to sign my name: Stephen Uitti ...pur-ee!physics:suitti ...pur-ee!pur-phy!suitti
gwyn@Brl@sri-unix (10/28/82)
From: Doug Gwyn <gwyn@Brl> Date: 25 Oct 82 15:11:44-EDT (Mon) I like the idea of generating in-line code for PDP-11 long multiplication (it only takes a few instructions). Whitesmiths (Mark Krieger, I believe) adjusted their PDP-11 code generator so that it generated a fast (two-instruction) entry sequence instead of calling csv() and jumped to crets instead of cret, whenever a procedure didn't use the non-volatile registers (R2,R3,R4). In my experience, most "real" heavy uses of floating-point computation do require double precision to avoid loss of accuracy (e.g. when inverting matrices, linear programming, solving D.E.s, etc.). Therefore I think the C default mode is reasonable. Just use doubles in the first place to avoid the conversion from/to float.