david@ukma.UUCP (David Herron, NPR Lover) (08/19/85)
Index: /lib/c2, BRL-Unix release 3, likely appears in other 4.?BSD.
Problem: Calls instructions have a format "calls <const-#-of-args>,<addr>".
The <const-#-of-args> argument will sometimes be "optimized" into
a register reference. This isn't necessary, and breaks the
massager which helps create the code in the kernel.
Repeat-by: Compile the following program segment:
b() {}
a()
{
register int i;
i = 0;
b();
}
For which you get:
LL0:
.data
.text
.align 1
.globl _b
_b:
.word L12
jbr L14
L15:
ret
.set L12,0x0
L14:
jbr L15
.data
.text
.align 1
.globl _a
_a:
.word L17
jbr L19
L20:
clrl r11
calls $0,_b
ret
.set L17,0x800
L19:
jbr L20
After running /lib/c2 -i:
.data
LL0:.data
.text
.align 1
.globl _b
.set L12,0x0
.data
.text
_b:.word L12
ret
.align 1
.globl _a
.set L17,0x800
.data
.text
_a:.word L17
clrl r11
calls r11,_b
ret
Notice that the $0 has been changed to r11.
Fix: I was going to provide a fix, but I took one look at the code for
c2 and decided I wasn't that brave :-). I did notice a few
places in c21.c which looked to be likely candidates for fixing.
What I suggest is that c2 be taught not to optimize calls instructions.
Work-around: Add rules to asm.sed or inline (as appropriate) to match
all the possibilities that calls might be optimized into.
Since this is non-trivial I would really prefer for c2 to be fixed.
Could someone who still runs 4.2BSD (and 4.3BSD too) check to see if the
bug is that widely spread? And System V while yer at it..... :-)
--
--- David Herron
--- ARPA-> ukma!david@ANL-MCS.ARPA
--- UUCP-> {ucbvax,unmvax,boulder,oddjob}!anlams!ukma!david
--- {ihnp4,decvax,ucbvax}!cbosgd!ukma!david
Hackin's in me blood. My mother was known as Miss Hacker before she married!
chris@umcp-cs.UUCP (Chris Torek) (08/22/85)
>Index: /lib/c2, BRL-Unix release 3, likely appears in other 4.?BSD. >Problem: Calls instructions have a format "calls <const-#-of-args>,<addr>". > The <const-#-of-args> argument will sometimes be "optimized" into > a register reference. This isn't necessary, and breaks the > massager which helps create the code in the kernel. It saves a little teensy bit of time. Anyway, here's how to work around it. I added a new flag to /lib/c2 (-c) but it should be obvious how to make it always do what -c does. First, change /usr/src/lib/c2/c20.c to handle "-c". (Beware, these are to what amounts to the 4.3BSD c2; your code may vary slightly.) RCS file: RCS/c20.c,v retrieving revision 1.1 diff -c2 -r1.1 c20.c *** /tmp/,RCSt1015521 Thu Aug 22 01:45:07 1985 --- c20.c Thu Aug 22 01:44:13 1985 *************** *** 15,18 caddr_t sbrk(); int ioflag; int fflag; long isn = 2000000; --- 15,19 ----- caddr_t sbrk(); int ioflag; + int cflag; int fflag; long isn = 2000000; *************** *** 57,60 if ((*argv)[1]=='i') ioflag++; else if ((*argv)[1]=='f') fflag++; else nflag++; } else if (infound==0) { --- 58,62 ----- if ((*argv)[1]=='i') ioflag++; else if ((*argv)[1]=='f') fflag++; + else if ((*argv)[1]=='c') cflag++; else nflag++; } else if (infound==0) { Second, modify the optimizer code in /usr/src/lib/c2/c21.c to check cflag before optimizing "calls" constants. This also includes a fix from Donn Seeley for a longstanding fencepost error: RCS file: RCS/c21.c,v retrieving revision 1.1 diff -c2 -r1.1 c21.c *** /tmp/,RCSt1015526 Thu Aug 22 01:45:24 1985 --- c21.c Thu Aug 22 01:44:49 1985 *************** *** 14,17 #define NUSE 6 int ioflag; int biti[NUSE] = {1,2,4,8,16,32}; int bitsize[] = { /* index by type codes */ --- 14,18 ----- #define NUSE 6 int ioflag; + int cflag; int biti[NUSE] = {1,2,4,8,16,32}; int bitsize[] = { /* index by type codes */ *************** *** 509,512 break; /* .rx,.rx,.rx */ case PROBER: --- 510,521 ----- break; + /* .rx,.rx */ + case CALLS: + if (cflag) /* don't optimize constants in calls */ + break; + /* fall through */ + case MTPR: + case CMP: + case BIT: /* .rx,.rx,.rx */ case PROBER: *************** *** 514,522 case CASE: case MOVC3: - /* .rx,.rx */ - case MTPR: - case CALLS: - case CMP: - case BIT: splitrand(p); /* fool repladdr into doing right number of operands */ --- 523,526 ----- case CASE: case MOVC3: splitrand(p); /* fool repladdr into doing right number of operands */ *************** *** 724,728 if (!equstr(regs[RT3],"-(sp)")) p->combop=T(MOVA,BYTE); else {p->combop=T(PUSHA,BYTE); *cp2=0;} ! if (uses[r]==0) {uses[r]=p; regs[r][0]=OPX<<4;} p->pop=0; } --- 728,735 ----- if (!equstr(regs[RT3],"-(sp)")) p->combop=T(MOVA,BYTE); else {p->combop=T(PUSHA,BYTE); *cp2=0;} ! if (r < NUSE && uses[r] == 0) { ! uses[r]=p; ! regs[r][0]=OPX<<4; ! } p->pop=0; } -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 4251) UUCP: seismo!umcp-cs!chris CSNet: chris@umcp-cs ARPA: chris@maryland
ado@elsie.UUCP (Arthur David Olson) (08/22/85)
> >Index: /lib/c2, BRL-Unix release 3, likely appears in other 4.?BSD. > >Problem: Calls instructions have a format "calls <const-#-of-args>,<addr>". > > The <const-#-of-args> argument will sometimes be "optimized" into > > a register reference. This isn't necessary. . . > > It saves a little teensy bit of time. . . Hmmm. . .let's check this out with the following program on a VAX 11/750 (4.1bsd, cp_urev = 94, cp_hrev = 72): main(argc, argv) int argc; char * argv[]; { register int i; i = atoi(argv[1]); do { one(); two(); } while (--i > 0); } dummy1() { dummy1(); dummy1(); dummy1(); dummy1(); dummy1(); } one() { register int i, j; i = 1000; do { j = 0; subr(); } while (--i > 0); } dummy2() { dummy2(); dummy2(); dummy2(); dummy2(); dummy2(); } two() { register int i, j; i = 1000; do { subr(); j = 0; } while (--i > 0); } dummy3() { dummy3(); dummy3(); dummy3(); dummy3(); dummy3(); } subr() {} If I name the above "try.c" and cc -c -O try.c ; cc -p -O try.o ; a.out 10000 ; prof I get this output: %time cumsecs #call ms/call name 60.8 325.10 _subr 20.2 433.28 _one 18.9 534.22 _two 0.1 534.73 _main which says that the function "one", with its L33:clrl r10 calls r10,_subr sobgtr r11,L33 takes more time than function "two", with its L46:calls $0,_subr clrl r10 sobgtr r11,L46 loop. Comments? -- Bugs is a Warner Brothers trademark. UNIX is an AT&T Bell Laboratories trademark. BRL is a Bethesda Research Laboratories trademark. -- UUCP: ..decvax!seismo!elsie!ado ARPA: elsie!ado@seismo.ARPA DEC, VAX and Elsie are Digital Equipment and Borden trademarks
hans@log-hb.UUCP (Hans Albertsson) (08/24/85)
In article <5207@elsie.UUCP> ado@elsie.UUCP (Arthur David Olson) writes: > ..... >which says that the function "one", with its > L33:clrl r10 > calls r10,_subr > sobgtr r11,L33 >takes more time than function "two", with its > L46:calls $0,_subr > clrl r10 > sobgtr r11,L46 >loop. > >Comments? You have changed the alignment, the "_subr" arg to calls is at a non-optimum address, such as an odd word address or maybe even worse, in one, assuming it's on an optimum address in two. The VAX architecture allows any alignment for any part of any instruction, but the bus is fixed in both width and memory access alignment, even for cache accesses. This may sometimes severely penalise seemingly optimum programs. The difference 20 to 18 is furthermore very small in comparison with some such effects I have seen. I remember once when removing completely an inactive ( always false ) IF statement ( some 8 out of 30 lines, I seem to remember ) DOUBLED both user and system times..... That felt VERY humiliating, I can assure you. The code became smaller consistent with the reduction of the source program size, but... We could find NO other explanation. ( The language was a very early stage of the TeleSOFT ADA. ) This would also seem to completely invalidate stuff like Dhrystone benchmarks... ---------- This is the FIRST time ever I feel the need to point out that any opinions expressed above are my own, and does not represent an official TeleLOGIC opinion. In fact, some people here are likely to disagree violently. With anything I say, on any subject... -- Hans Albertsson, USENET/uucp: {decvax,philabs}!mcvax!enea!log-hb!hans Real World: TeleLOGIC AB, Box 1001, S-14901 Nynashamn,SWEDEN