eliot%CS.QMC.AC.UK@MITVMA.MIT.EDU (Eliot Miranda) (02/20/89)
There is a bug with global registers in gcc Version 1.33 running on a SUN 3.
The bug is the ommision of an assignment to a global register variable.
See the lines
asm("HERE:");
to
asm("ENDHERE:");
in the C source and the assembler. instrPointer is a global register variable
register unsigned char *instrPointer asm("a4");
The assignment is the sole use of instrPointer within the function.
The bug report follows some suggestions for Gnu & Gnu CC.
I am using GCC to compile my Smalltalk-80 interpreter. Apart from this bug,
the compiler looks good. On the largest function (the interpreter) GCC
produces 10% less code than SUN's pcc. May I suggest an optimization to C
switch statements?
If the switch reads:
register unsigned char *instrPointer asm("a4");
...
switch(*instrPointer++) $
case 0: ...
...
case 255: ...
and all the case indices are used, then the bounds check generated at the
start of the switch is unnecessary. (Of course I realise its easy to tweak the
assembler output, but GCC seems well on the way to being an excellent
interpreter compiler, this is yet more icing on the cake!).
My interpreter (BrouHaHa a portable smalltalk interpreter, OOPSLA '87) is
currently 50% of the performance of ParcPlace (ex Xerox) Smalltalk-80 on
a SUN 3/60. This places it at around the same speed as the Tektronix
implementation for 68000s written in assembler. Apart from global registers
(obtained with sed scripts on the assembler using pcc) it also does bitblt
(rasterop) using on-the-fly compilation, and supports 1 bit deep and 8 bit deep
bitmaps.
At the language level, non-reentrant block contexts have been replaced by
fully reentrant closures. The system is a 32 bit implementation and can
support up to 2 M objects which occupy a 32 bit address space.
The inplementation is mature, having been used intensively on a research
project within QMC for the past two years. The interpreter itsself is about
4 years old.
The interpreter currently runs on SUN 3s, IBM PS/2 Model 80 (under Xenix, sans
global registers because Microsoft C is awful!), on the Whitechapel MG1 (ns32016
based machine) & Acorn Archimedes ARM. Apart from bitblt, the interpreter is
highly portable. I would be happy to put the interpreter in the public domain
under the aegis of GNU.
However, I use a virtual image derived from Xerox Smalltalk-80 Version 2.0 or
ParcPlace Smalltalk-80 Version 2.3, niether of which is in the public domain :-)
( :-( ). If you can produce a public domain image, I can produce the
interpreter.
On the subject of further enhancements to Gnu CC, my on-the-fly compiled bitblt
works as follows:
on each call to bitblt the arguments are analyzed and the appropriate fragments
of machine code are concatenated to form a piece of code that performs the
actual data manipulation. Since the code contains no tests (apart from the
inner and outer loop counts), and since the analysis is performed once, the
resulting implementation is very fast.
I generate the fragments of machine code 'funclets' thus:
Each funclet is written as a C function that declares all variables used in all
funclets, and in the same order as all other funclets. Subsequently,
the compiler generated, but unoptimized, assembler is edited to remove all
prologes & epilogs from the funclets. The machine code is concatenated
by a C function that, for each funclet, takes the address of the funclet &
the address of the following funclet, casts them appropriately, and copies
the machine code either into a dummy function or onto the stack, where it can
be executed. The resulting code is executed using the stack frame of the
compiling function.
The resulting code is sort of portable & currently runs on the 68020 & the
80386.
Although I have major problems on the 386 through using the Microsoft C
compiler!
Given that Gnu CC already supports global register variables & pointer
arithmetic
on pointers to functions, it would be nice to specify that for specific
functions
the prolog & epilog should not be generated. This would improve portability
immensely! It would also help if sizeof(*(void(*)())) was 1 on the 386, 2 on
the
68020, 4 on the ARM etc.... Also, at some point, all registers used in
the compilation process should be saved & restored, so one must be able to
specify that a particular function saves all appropriate registers.
I am working on a faster interpreter that uses dynamic compilation to
threaded code, the rationale being that direct threaded code is faster than
bytecode decoding, that the same bytecode can be compiled to different
TCODEs in different contexts, and most importantly, that message sends can
be 'linked' in the threaded code and can require significantly less ,
and in some cases, no checking, when compared to traditional method lookup
cache techniques.
In some ways the threaded code scheme is similar to the bitblt hack described
above. A global register
typedef void (*TCODE)(); /* a TCODE is a pointer to a function */
register TCODE *tcip asm("a4"); /* the tcode ip points at TCODEs */
points at the next pointer to a function (I appologize now for teaching you to
suck eggs!). On the 68020 each threaded code function is written thus:
...
register OOP *stackPointer asm("a5"); /* Smalltalk stack pointer */
...
void pushLit()
$ *++stackPointer = (OOP)(*tcip++);
asm("movl a4@+,a0);
asm("jmp a0@");
Each threaded code operation is coded as a TCODE followed by an argument:
|---------------|
| &pushLit |
|---------------|
tcip-> | pointer to obj|
|---------------|
...
On entry to a TCODE, tcip points at the following argument.
Each threaded code function has its prolog & epilog removed.
All TCODE routines run in the same stack frame, that of the function that
called the first tcode:
...
(*(*tcip++))();
..
With the ability to omit prolog & epilog this becomes much easier to do
portably.
But one needs to ensure that the function that called the first TCODE saves
all registers except the current global registers, and that it provice
sufficient
stack space for any non-register variables used by any of the TCODES (presumably
I can use alloca, but if the space is not used, the optimizer may remove the
call?).
In summary, I would like
1. full switches to have no bounds check
2. some way of specifying that a function should be generated
without a prolog or an epilog
3. some way of specifying that a function should
a) save ALL registers visible in C
and
b) save all bar the currently defined global registers
and
c) allocate space in a stack frame
4. some way of finding out the size of machine instructions, perhaps
sizeof(*(void (*)()))
OK, now for the bug report!
Look for the following assembler:
#APP
HERE:
#NO_APP
movel a3@(4),a0
movel a0@(4),a0
movel a2@(4),d0
asll #1,d0
asrl #1,d0
addql #4,d0
addl a0@(4),d0
subql #1,d0
# <-- where has the movl d0,a4 gone?
#APP
ENDHERE
#NO_APP
gcc -v -O -S -fomit-frame-pointer -m68881 recache.c
gcc version 1.33
/usr/local/lib/gcc-cpp -v -undef -D__GNUC__ -Dmc68000 -Dsun -Dunix
-D__mc68000__ -D__sun__ -D__unix__ -D__OPTIMIZE__ -D__HAVE_68881__ -Dmc68020
recache.c /tmp/cca03079.cpp
GNU CPP version 1.33
/usr/local/lib/gcc-cc1 /tmp/cca03079.cpp -quiet -dumpbase recache.c -m68881
-fomit-frame-pointer -O -version -o recache.s
GNU C version 1.33 (68k, MIT syntax) compiled by GNU C version 1.33.
recache.c:
typedef unsigned long LONG;
typedef unsigned short WORD;
typedef unsigned char BYTE;
typedef enum $ f = 0, t = 1 BOOLEAN;
typedef int BOOL;
register char *GlobRegDummy1 asm("a5");
register char *GlobRegDummy2 asm("a4");
register char *GlobRegDummy3 asm("a3");
typedef struct _OTE *OOP;
typedef long SIOP;
typedef struct $ unsigned z : 1;
unsigned d : 1;
unsigned u : 1;
unsigned m : 1;
unsigned p : 1;
unsigned o : 3;
unsigned size : 24;
HEADER;
typedef struct $ HEADER h;
union $ BYTE b[(sizeof(LONG)/sizeof(BYTE)) ];
WORD w[(sizeof(LONG)/sizeof(WORD)) ];
OOP o[1];
b;
OBJBODY;
typedef struct _OTE $
unsigned rc : 8;
unsigned class : 24;
OBJBODY *a;
OTE;
extern void _reallyAddToZCT(OOP );
register OOP *stackPointer asm("a5");
register BYTE *instrPointer asm("a4");
typedef struct _f $
OOP *tp;
OOP *mp;
OOP *sp;
BYTE *ip;
OOP thisContext;
OOP method;
OOP sender;
OOP closure;
struct $
OOP intlip;
OOP home;
OOP nest;
OOP *basesp;
block;
struct _f *nextFrame;
struct _f *prevFrame;
OOP *stackCheck;
FRAME;
register FRAME *frame asm("a3");
void recache(context)
OOP context;
$
register OOP *ptr;
register long length;
register long stackSize;
ptr = (&(((context)->a)->b.o[0]));
frame->thisContext = context;
stackPointer = frame->tp;
if (((SIOP) (((ptr)[4])) < 0)) $
frame->block.nest = ptr[3];
frame->block.intlip = ptr[4];
frame->block.home = ptr[5];
if ((((OOP) ((frame->block.home)->class)) == (((OOP) 0x70000) +
1))) $
register FRAME *homeFrame = ((FRAME *)
(((frame->block.home)->a)->b.o[0]));
frame->method = homeFrame->method;
frame->mp = homeFrame->mp;
frame->tp = homeFrame->tp;
else $
frame->method = (((ptr[5])->a)->b.o[3]);
frame->mp = (&(((frame->method)->a)->b.o[0]));
frame->tp = &(((ptr[5])->a)->b.o[5]);
frame->block.basesp = stackPointer;
if (1) $
if (!((*((BYTE *) (ptr[5]))) -= 2))
if (1) $
if (!((((char *) ((ptr[5])->a))[0]) <
0)) $
((ptr[5])->a->h.z) = (BOOL) t;
_reallyAddToZCT(ptr[5]);
else;
else;
if (1) $
if (!((*((BYTE *) (ptr[3]))) -= 2))
if (1) $
if (!((((char *) ((ptr[3])->a))[0]) <
0)) $
((ptr[3])->a->h.z) = (BOOL) t;
_reallyAddToZCT(ptr[3]);
else;
else;
else $
frame->method = ptr[3];
frame->mp = (&(((frame->method)->a)->b.o[0]));
frame->tp = stackPointer;
if (1) $
if (!((*((BYTE *) (ptr[3]))) -= 2))
if (1) $
if (!((((char *) ((ptr[3])->a))[0]) <
0)) $
((ptr[3])->a->h.z) = (BOOL) t;
_reallyAddToZCT(ptr[3]);
else;
else;
frame->sender = ptr[0];
asm("HERE:");
instrPointer = ((BYTE *) (&(((((frame->mp)[1]))->a)->b.o[0]))) +
(((long) (ptr[1]) << 1) >> 1) - 1;
asm("ENDHERE");
length = ((((OBJBODY *) ((HEADER *) (ptr) - 1))->h.size) - 1) - 6;
(((OBJBODY *) ((HEADER *) (ptr) - 1))->h.p) = (BOOL) f;
(((OBJBODY *) ((HEADER *) (ptr) - 1))->h.o) = 2;
((context)->class = (unsigned long) ((((OOP) 0x70000) + 1)));
*ptr = (OOP) frame;
length -= (stackSize = (((long) (ptr[2]) << 1) >> 1));
if (frame->block.home)
ptr += 6;
else $
stackSize += 1;
ptr += 5;
while (--stackSize >= 0) $
if (1) $
if (!(((SIOP) (*ptr) <= 0) || ((*((BYTE *) (*ptr))) -=
2)))
if (1) $
if (!((((char *) ((*ptr)->a))[0]) < 0))
$
((*ptr)->a->h.z) = (BOOL) t;
_reallyAddToZCT(*ptr);
else;
else;
*stackPointer++ = *ptr++;
while (--length >= 0) $
if ((SIOP) * ptr > (SIOP) (((OOP) 0x70000) + 28))
if (1) $
if (!((*((BYTE *) (*ptr))) -= 2))
if (1) $
if (!((((char *)
((*ptr)->a))[0]) < 0)) $
((*ptr)->a->h.z) =
(BOOL) t;
_reallyAddToZCT(*ptr);
else;
else;
ptr++;
stackPointer--;
Eliot Miranda email: eliot@cs.qmc.ac.uk
Dept of Computer Science Tel: 01 975 5220
Queen Mary College International: +44 1 975 5220
Mile End Road
LONDON E1 4NS