eliot%CS.QMC.AC.UK@MITVMA.MIT.EDU (Eliot Miranda) (02/20/89)
There is a bug with global registers in gcc Version 1.33 running on a SUN 3. The bug is the ommision of an assignment to a global register variable. See the lines asm("HERE:"); to asm("ENDHERE:"); in the C source and the assembler. instrPointer is a global register variable register unsigned char *instrPointer asm("a4"); The assignment is the sole use of instrPointer within the function. The bug report follows some suggestions for Gnu & Gnu CC. I am using GCC to compile my Smalltalk-80 interpreter. Apart from this bug, the compiler looks good. On the largest function (the interpreter) GCC produces 10% less code than SUN's pcc. May I suggest an optimization to C switch statements? If the switch reads: register unsigned char *instrPointer asm("a4"); ... switch(*instrPointer++) $ case 0: ... ... case 255: ... and all the case indices are used, then the bounds check generated at the start of the switch is unnecessary. (Of course I realise its easy to tweak the assembler output, but GCC seems well on the way to being an excellent interpreter compiler, this is yet more icing on the cake!). My interpreter (BrouHaHa a portable smalltalk interpreter, OOPSLA '87) is currently 50% of the performance of ParcPlace (ex Xerox) Smalltalk-80 on a SUN 3/60. This places it at around the same speed as the Tektronix implementation for 68000s written in assembler. Apart from global registers (obtained with sed scripts on the assembler using pcc) it also does bitblt (rasterop) using on-the-fly compilation, and supports 1 bit deep and 8 bit deep bitmaps. At the language level, non-reentrant block contexts have been replaced by fully reentrant closures. The system is a 32 bit implementation and can support up to 2 M objects which occupy a 32 bit address space. The inplementation is mature, having been used intensively on a research project within QMC for the past two years. The interpreter itsself is about 4 years old. The interpreter currently runs on SUN 3s, IBM PS/2 Model 80 (under Xenix, sans global registers because Microsoft C is awful!), on the Whitechapel MG1 (ns32016 based machine) & Acorn Archimedes ARM. Apart from bitblt, the interpreter is highly portable. I would be happy to put the interpreter in the public domain under the aegis of GNU. However, I use a virtual image derived from Xerox Smalltalk-80 Version 2.0 or ParcPlace Smalltalk-80 Version 2.3, niether of which is in the public domain :-) ( :-( ). If you can produce a public domain image, I can produce the interpreter. On the subject of further enhancements to Gnu CC, my on-the-fly compiled bitblt works as follows: on each call to bitblt the arguments are analyzed and the appropriate fragments of machine code are concatenated to form a piece of code that performs the actual data manipulation. Since the code contains no tests (apart from the inner and outer loop counts), and since the analysis is performed once, the resulting implementation is very fast. I generate the fragments of machine code 'funclets' thus: Each funclet is written as a C function that declares all variables used in all funclets, and in the same order as all other funclets. Subsequently, the compiler generated, but unoptimized, assembler is edited to remove all prologes & epilogs from the funclets. The machine code is concatenated by a C function that, for each funclet, takes the address of the funclet & the address of the following funclet, casts them appropriately, and copies the machine code either into a dummy function or onto the stack, where it can be executed. The resulting code is executed using the stack frame of the compiling function. The resulting code is sort of portable & currently runs on the 68020 & the 80386. Although I have major problems on the 386 through using the Microsoft C compiler! Given that Gnu CC already supports global register variables & pointer arithmetic on pointers to functions, it would be nice to specify that for specific functions the prolog & epilog should not be generated. This would improve portability immensely! It would also help if sizeof(*(void(*)())) was 1 on the 386, 2 on the 68020, 4 on the ARM etc.... Also, at some point, all registers used in the compilation process should be saved & restored, so one must be able to specify that a particular function saves all appropriate registers. I am working on a faster interpreter that uses dynamic compilation to threaded code, the rationale being that direct threaded code is faster than bytecode decoding, that the same bytecode can be compiled to different TCODEs in different contexts, and most importantly, that message sends can be 'linked' in the threaded code and can require significantly less , and in some cases, no checking, when compared to traditional method lookup cache techniques. In some ways the threaded code scheme is similar to the bitblt hack described above. A global register typedef void (*TCODE)(); /* a TCODE is a pointer to a function */ register TCODE *tcip asm("a4"); /* the tcode ip points at TCODEs */ points at the next pointer to a function (I appologize now for teaching you to suck eggs!). On the 68020 each threaded code function is written thus: ... register OOP *stackPointer asm("a5"); /* Smalltalk stack pointer */ ... void pushLit() $ *++stackPointer = (OOP)(*tcip++); asm("movl a4@+,a0); asm("jmp a0@"); Each threaded code operation is coded as a TCODE followed by an argument: |---------------| | &pushLit | |---------------| tcip-> | pointer to obj| |---------------| ... On entry to a TCODE, tcip points at the following argument. Each threaded code function has its prolog & epilog removed. All TCODE routines run in the same stack frame, that of the function that called the first tcode: ... (*(*tcip++))(); .. With the ability to omit prolog & epilog this becomes much easier to do portably. But one needs to ensure that the function that called the first TCODE saves all registers except the current global registers, and that it provice sufficient stack space for any non-register variables used by any of the TCODES (presumably I can use alloca, but if the space is not used, the optimizer may remove the call?). In summary, I would like 1. full switches to have no bounds check 2. some way of specifying that a function should be generated without a prolog or an epilog 3. some way of specifying that a function should a) save ALL registers visible in C and b) save all bar the currently defined global registers and c) allocate space in a stack frame 4. some way of finding out the size of machine instructions, perhaps sizeof(*(void (*)())) OK, now for the bug report! Look for the following assembler: #APP HERE: #NO_APP movel a3@(4),a0 movel a0@(4),a0 movel a2@(4),d0 asll #1,d0 asrl #1,d0 addql #4,d0 addl a0@(4),d0 subql #1,d0 # <-- where has the movl d0,a4 gone? #APP ENDHERE #NO_APP gcc -v -O -S -fomit-frame-pointer -m68881 recache.c gcc version 1.33 /usr/local/lib/gcc-cpp -v -undef -D__GNUC__ -Dmc68000 -Dsun -Dunix -D__mc68000__ -D__sun__ -D__unix__ -D__OPTIMIZE__ -D__HAVE_68881__ -Dmc68020 recache.c /tmp/cca03079.cpp GNU CPP version 1.33 /usr/local/lib/gcc-cc1 /tmp/cca03079.cpp -quiet -dumpbase recache.c -m68881 -fomit-frame-pointer -O -version -o recache.s GNU C version 1.33 (68k, MIT syntax) compiled by GNU C version 1.33. recache.c: typedef unsigned long LONG; typedef unsigned short WORD; typedef unsigned char BYTE; typedef enum $ f = 0, t = 1 BOOLEAN; typedef int BOOL; register char *GlobRegDummy1 asm("a5"); register char *GlobRegDummy2 asm("a4"); register char *GlobRegDummy3 asm("a3"); typedef struct _OTE *OOP; typedef long SIOP; typedef struct $ unsigned z : 1; unsigned d : 1; unsigned u : 1; unsigned m : 1; unsigned p : 1; unsigned o : 3; unsigned size : 24; HEADER; typedef struct $ HEADER h; union $ BYTE b[(sizeof(LONG)/sizeof(BYTE)) ]; WORD w[(sizeof(LONG)/sizeof(WORD)) ]; OOP o[1]; b; OBJBODY; typedef struct _OTE $ unsigned rc : 8; unsigned class : 24; OBJBODY *a; OTE; extern void _reallyAddToZCT(OOP ); register OOP *stackPointer asm("a5"); register BYTE *instrPointer asm("a4"); typedef struct _f $ OOP *tp; OOP *mp; OOP *sp; BYTE *ip; OOP thisContext; OOP method; OOP sender; OOP closure; struct $ OOP intlip; OOP home; OOP nest; OOP *basesp; block; struct _f *nextFrame; struct _f *prevFrame; OOP *stackCheck; FRAME; register FRAME *frame asm("a3"); void recache(context) OOP context; $ register OOP *ptr; register long length; register long stackSize; ptr = (&(((context)->a)->b.o[0])); frame->thisContext = context; stackPointer = frame->tp; if (((SIOP) (((ptr)[4])) < 0)) $ frame->block.nest = ptr[3]; frame->block.intlip = ptr[4]; frame->block.home = ptr[5]; if ((((OOP) ((frame->block.home)->class)) == (((OOP) 0x70000) + 1))) $ register FRAME *homeFrame = ((FRAME *) (((frame->block.home)->a)->b.o[0])); frame->method = homeFrame->method; frame->mp = homeFrame->mp; frame->tp = homeFrame->tp; else $ frame->method = (((ptr[5])->a)->b.o[3]); frame->mp = (&(((frame->method)->a)->b.o[0])); frame->tp = &(((ptr[5])->a)->b.o[5]); frame->block.basesp = stackPointer; if (1) $ if (!((*((BYTE *) (ptr[5]))) -= 2)) if (1) $ if (!((((char *) ((ptr[5])->a))[0]) < 0)) $ ((ptr[5])->a->h.z) = (BOOL) t; _reallyAddToZCT(ptr[5]); else; else; if (1) $ if (!((*((BYTE *) (ptr[3]))) -= 2)) if (1) $ if (!((((char *) ((ptr[3])->a))[0]) < 0)) $ ((ptr[3])->a->h.z) = (BOOL) t; _reallyAddToZCT(ptr[3]); else; else; else $ frame->method = ptr[3]; frame->mp = (&(((frame->method)->a)->b.o[0])); frame->tp = stackPointer; if (1) $ if (!((*((BYTE *) (ptr[3]))) -= 2)) if (1) $ if (!((((char *) ((ptr[3])->a))[0]) < 0)) $ ((ptr[3])->a->h.z) = (BOOL) t; _reallyAddToZCT(ptr[3]); else; else; frame->sender = ptr[0]; asm("HERE:"); instrPointer = ((BYTE *) (&(((((frame->mp)[1]))->a)->b.o[0]))) + (((long) (ptr[1]) << 1) >> 1) - 1; asm("ENDHERE"); length = ((((OBJBODY *) ((HEADER *) (ptr) - 1))->h.size) - 1) - 6; (((OBJBODY *) ((HEADER *) (ptr) - 1))->h.p) = (BOOL) f; (((OBJBODY *) ((HEADER *) (ptr) - 1))->h.o) = 2; ((context)->class = (unsigned long) ((((OOP) 0x70000) + 1))); *ptr = (OOP) frame; length -= (stackSize = (((long) (ptr[2]) << 1) >> 1)); if (frame->block.home) ptr += 6; else $ stackSize += 1; ptr += 5; while (--stackSize >= 0) $ if (1) $ if (!(((SIOP) (*ptr) <= 0) || ((*((BYTE *) (*ptr))) -= 2))) if (1) $ if (!((((char *) ((*ptr)->a))[0]) < 0)) $ ((*ptr)->a->h.z) = (BOOL) t; _reallyAddToZCT(*ptr); else; else; *stackPointer++ = *ptr++; while (--length >= 0) $ if ((SIOP) * ptr > (SIOP) (((OOP) 0x70000) + 28)) if (1) $ if (!((*((BYTE *) (*ptr))) -= 2)) if (1) $ if (!((((char *) ((*ptr)->a))[0]) < 0)) $ ((*ptr)->a->h.z) = (BOOL) t; _reallyAddToZCT(*ptr); else; else; ptr++; stackPointer--; Eliot Miranda email: eliot@cs.qmc.ac.uk Dept of Computer Science Tel: 01 975 5220 Queen Mary College International: +44 1 975 5220 Mile End Road LONDON E1 4NS