speck@cit-vax.ARPA (Don Speck) (02/20/85)
Index: lib/mip 4.2 Fix Description: Speed up raw I/O by making PCC produce better code. Kernel profiling reveals that large (8K) raw writes spend close to half of their time in vslock()/vsunlock(). Much of this is spent doing unsigned division by 2 via udiv(). A right shift will accomplish the same thing according to Appendix A section 7.5 of K&R. Two more raw I/O speedups (in vslock/vsunlock, and ubasetup) also given. Repeat-by: time a full dump before and after applying fixes; system time will be reduced ~20%. Fix: The shift optimization is just as machine-independent as changing multiply to shift, so we can do it in the machine- independent part of PCC (this should be done in ALL pcc's). Also optimize unsigned modulus by powers of two (dump uses this a lot for bitmaps) and addition of zero. diff -c old/optim.c /usr/src/lib/mip/optim.c *** /usr/src/lib/mip/old/optim.c Tue Jun 8 20:43:45 1982 --- /usr/src/lib/mip/optim.c Sat Jan 26 16:05:43 1985 *************** *** 138,143 RV(p) = -RV(p); o = p->in.op = MINUS; } break; case DIV: --- 138,150 ----- RV(p) = -RV(p); o = p->in.op = MINUS; } + + /*FALLTHROUGH*/ + case RS: + case LS: + /* Operations with zero -- DAS 1/20/85 */ + if( (o==PLUS || o==MINUS || o==OR || o==ER || o==LS || o==RS) + && nncon(p->in.right) && RV(p)==0 ) goto zapright; break; case DIV: *************** *** 142,147 case DIV: if( nncon( p->in.right ) && p->in.right->tn.lval == 1 ) goto zapright; break; case EQ: --- 149,169 ----- case DIV: if( nncon( p->in.right ) && p->in.right->tn.lval == 1 ) goto zapright; + + /* Unsigned division by a power of two -- DAS 1/13/85 */ + if( nncon(p->in.right) && (i=ispow2(RV(p)))>=0 && ISUNSIGNED(p->in.type) ){ + o = p->in.op = RS; + p->in.right->in.type = p->in.right->fn.csiz = INT; + RV(p) = i; + } + break; + + case MOD: + /* Unsigned mod by a power of two -- DAS 1/13/85 */ + if( nncon(p->in.right) && (i=ispow2(RV(p)))>=0 && ISUNSIGNED(p->in.type) ){ + o = p->in.op = AND; + RV(p)--; + } break; case EQ: --------end of diff-------- The second biggest speedup again comes in vslock/vsunlock, by in-line expansion of mlock()/munlock(). This is extremely easy and obvious to do, so no diff is given. [/sys/sys/vm_mem.c] The third biggest speedup is to optimize the main loop in ubasetup(). The loop that sets the map registers unnecessarily repeats a bit-field extraction. Bit-field operations are rather expensive even with the fancy VAX field instructions; PCC needs to learn to use simpler instructions (BIC/BIS) when possible. diff old/uba.c /sys/vaxuba/uba.c 162c162,163 < if (pte->pg_pfnum == 0) --- > register int pfnum; > if ((pfnum=pte->pg_pfnum) == 0) 164c165,166 < *(int *)io++ = pte++->pg_pfnum | temp; --- > pte++; > *(int *)io++ = pfnum | temp; With these three fixes installed, and recompiling the kernel, the per-page part of raw I/O setup fell from 160us to 60us on our vax/750. It takes about 20% off the CPU requirements of /etc/dump. It's probably not necessary to recompile the entire kernel; /sys/sys/vm_mem.c, /sys/sys/vm_subr.c, and etc/dump/dumptraverse.c are the biggest benefactors of the added PCC optimization. Don Speck