[gnu.gcc.bug] Bug in GCC 1.33 + Smalltalk

eliot%CS.QMC.AC.UK@MITVMA.MIT.EDU (Eliot Miranda) (02/20/89)

There is a bug with global registers in gcc Version 1.33 running on a SUN 3.
The bug is the ommision of an assignment to a global register variable.
See the lines
        asm("HERE:");
to
        asm("ENDHERE:");
in the C source and the assembler.  instrPointer is a global register variable

        register unsigned char *instrPointer asm("a4");

The assignment is the sole use of instrPointer within the function.
The bug report follows some suggestions for Gnu & Gnu CC.

I am using GCC to compile my Smalltalk-80 interpreter.  Apart from this bug,
the compiler looks good.  On the largest function (the interpreter) GCC
produces 10% less code than SUN's pcc.  May I suggest an optimization to C
switch statements?
If the switch reads:

        register unsigned char *instrPointer asm("a4");

        ...

        switch(*instrPointer++) $
        case 0: ...

        ...

        case 255: ...
        

and all the case indices are used, then the bounds check generated at the
start of the switch is unnecessary.  (Of course I realise its easy to tweak the
assembler output, but GCC seems well on the way to being an excellent
interpreter compiler, this is yet more icing on the cake!).

My interpreter (BrouHaHa a portable smalltalk interpreter, OOPSLA '87) is
currently 50% of the performance of ParcPlace (ex Xerox) Smalltalk-80 on
a SUN 3/60.  This places it at around the same speed as the Tektronix
implementation for 68000s written in assembler.  Apart from global registers
(obtained with sed scripts on the assembler using pcc) it also does bitblt
(rasterop) using on-the-fly compilation, and supports 1 bit deep and 8 bit deep
bitmaps.

At the language level, non-reentrant block contexts have been replaced by
fully reentrant closures.  The system is a 32 bit implementation and can
support up to 2 M objects which occupy a 32 bit address space.

The inplementation is mature, having been used intensively on a research
project within QMC for the past two years.  The interpreter itsself is about
4 years old.

The interpreter currently runs on SUN 3s, IBM PS/2 Model 80 (under Xenix, sans
global registers because Microsoft C is awful!), on the Whitechapel MG1 (ns32016
based machine) & Acorn Archimedes ARM.  Apart from bitblt, the interpreter is
highly portable. I would be happy to put the interpreter in the public domain
under the aegis of GNU.
However, I use a virtual image derived from Xerox Smalltalk-80 Version 2.0 or
ParcPlace Smalltalk-80 Version 2.3, niether of which is in the public domain :-)
( :-( ). If you can produce a public domain image, I can produce the
 interpreter.


On the subject of further enhancements to Gnu CC, my on-the-fly compiled bitblt
works as follows:

on each call to bitblt the arguments are analyzed and the appropriate fragments
of machine code are concatenated to form a piece of code that performs the
actual data manipulation.  Since the code contains no tests (apart from the
inner and outer loop counts), and since the analysis is performed once, the
resulting implementation is very fast.

I generate the fragments of machine code 'funclets' thus:
Each funclet is written as a C function that declares all variables used in all
funclets, and in the same order as all other funclets.  Subsequently,
the compiler generated, but unoptimized, assembler is edited to remove all
prologes & epilogs from the funclets.  The machine code is concatenated
by a C function that, for each funclet, takes the address of the funclet &
the address of the following funclet, casts them appropriately, and copies
the machine code either into a dummy function or onto the stack, where it can
be executed.  The resulting code is executed using the stack frame of the
compiling function.

The resulting code is sort of portable & currently runs on the 68020 & the
 80386.
Although I have major problems on the 386 through using the Microsoft C
 compiler!

Given that Gnu CC already supports global register variables & pointer
 arithmetic
on pointers to functions, it would be nice to specify that for specific
 functions
the prolog & epilog should not be generated.  This would improve portability
immensely!  It would also help if sizeof(*(void(*)())) was 1 on the 386, 2 on
 the
68020, 4 on the ARM etc....  Also, at some point, all registers used in
the compilation process should be saved & restored, so one must be able to
specify that a particular function saves all appropriate registers.


I am working on a faster interpreter that uses dynamic compilation to
threaded code,  the rationale being that direct threaded code is faster than
bytecode decoding, that the same bytecode can be compiled to different
TCODEs in different contexts, and most importantly, that message sends can
be 'linked' in the threaded code and can require significantly less ,
and in some cases, no checking, when compared to traditional method lookup
cache techniques.

In some ways the threaded code scheme is similar to the bitblt hack described
above.  A global register

typedef  void   (*TCODE)();             /* a TCODE is a pointer to a function */
register TCODE  *tcip asm("a4");        /* the tcode ip points at TCODEs */

points at the next pointer to a function (I appologize now for teaching you to
suck eggs!).  On the 68020 each threaded code function is written thus:

...

register OOP    *stackPointer asm("a5");        /* Smalltalk stack pointer */

...


void    pushLit()

$       *++stackPointer = (OOP)(*tcip++);
        asm("movl       a4@+,a0);
        asm("jmp        a0@");


Each threaded code operation is coded as a TCODE followed by an argument:

                |---------------|
                | &pushLit      |
                |---------------|
        tcip->  | pointer to obj|
                |---------------|
            ...

On entry to a TCODE, tcip points at the following argument.
Each threaded code function has its prolog & epilog removed.
All TCODE routines run in the same stack frame, that of the function that
called the first tcode:

        ...

        (*(*tcip++))();

        ..

With the ability to omit prolog & epilog this becomes much easier to do
 portably.
But one needs to ensure that the function that called the first TCODE saves
all registers except the current global registers, and that it provice
 sufficient
stack space for any non-register variables used by any of the TCODES (presumably
I can use alloca, but if the space is not used, the optimizer may remove the
call?).


In summary, I would like

        1. full switches to have no bounds check

        2. some way of specifying that a function should be generated
           without a prolog or an epilog

        3. some way of specifying that a function should
                a) save ALL registers visible in C
           and
                b) save all bar the currently defined global registers
           and
                c) allocate space in a stack frame

        4. some way of finding out the size of machine instructions, perhaps
                sizeof(*(void (*)()))


OK, now for the bug report!
Look for the following assembler:

#APP
        HERE:
#NO_APP
        movel a3@(4),a0
        movel a0@(4),a0
        movel a2@(4),d0
        asll #1,d0
        asrl #1,d0
        addql #4,d0
        addl a0@(4),d0
        subql #1,d0
                        # <-- where has the movl d0,a4 gone?
#APP
        ENDHERE
#NO_APP

gcc -v -O -S -fomit-frame-pointer -m68881 recache.c
gcc version 1.33
 /usr/local/lib/gcc-cpp -v -undef -D__GNUC__ -Dmc68000 -Dsun -Dunix
 -D__mc68000__ -D__sun__ -D__unix__ -D__OPTIMIZE__ -D__HAVE_68881__ -Dmc68020
 recache.c /tmp/cca03079.cpp
GNU CPP version 1.33
 /usr/local/lib/gcc-cc1 /tmp/cca03079.cpp -quiet -dumpbase recache.c -m68881
 -fomit-frame-pointer -O -version -o recache.s
GNU C version 1.33 (68k, MIT syntax) compiled by GNU C version 1.33.

recache.c:

typedef unsigned long   LONG;
typedef unsigned short  WORD;
typedef unsigned char   BYTE;

typedef enum    $ f = 0, t = 1  BOOLEAN;

typedef int     BOOL;

register char   *GlobRegDummy1 asm("a5");
register char   *GlobRegDummy2 asm("a4");
register char   *GlobRegDummy3 asm("a3");

typedef struct  _OTE   *OOP;
typedef          long   SIOP;

typedef struct $ unsigned z     :  1;
                 unsigned d     :  1;
                 unsigned u     :  1;
                 unsigned m     :  1;
                 unsigned p     :  1;
                 unsigned o     :  3;
                 unsigned size  : 24;
                HEADER;

typedef struct $ HEADER   h;
                 union  $ BYTE   b[(sizeof(LONG)/sizeof(BYTE)) ];
                          WORD   w[(sizeof(LONG)/sizeof(WORD)) ];
                          OOP    o[1];
                  b;
         OBJBODY;

typedef struct _OTE $
                 unsigned       rc    :  8;
                 unsigned       class : 24;
                 OBJBODY       *a;
         OTE;



extern  void    _reallyAddToZCT(OOP );

register OOP    *stackPointer asm("a5");
register BYTE   *instrPointer asm("a4");

typedef struct _f $
                OOP             *tp;
                OOP             *mp;
                OOP             *sp;
                BYTE            *ip;
                OOP             thisContext;
                OOP             method;
                OOP             sender;
                OOP             closure;
                struct $

                        OOP     intlip;
                        OOP     home;
                        OOP     nest;
                        OOP     *basesp;
                 block;
                struct  _f      *nextFrame;
                struct  _f      *prevFrame;
                OOP             *stackCheck;
         FRAME;

register FRAME  *frame asm("a3");


void    recache(context)
        OOP             context;
$
        register OOP   *ptr;
        register long   length;
        register long   stackSize;

        ptr = (&(((context)->a)->b.o[0]));
        frame->thisContext = context;

        stackPointer = frame->tp;
        if (((SIOP) (((ptr)[4])) < 0)) $
                frame->block.nest = ptr[3];
                frame->block.intlip = ptr[4];
                frame->block.home = ptr[5];
                if ((((OOP) ((frame->block.home)->class)) == (((OOP) 0x70000) +
 1))) $
                        register FRAME *homeFrame = ((FRAME *)
 (((frame->block.home)->a)->b.o[0]));

                        frame->method = homeFrame->method;
                        frame->mp = homeFrame->mp;
                        frame->tp = homeFrame->tp;
                
                else $
                        frame->method = (((ptr[5])->a)->b.o[3]);
                        frame->mp = (&(((frame->method)->a)->b.o[0]));
                        frame->tp = &(((ptr[5])->a)->b.o[5]);
                
                frame->block.basesp = stackPointer;
                if (1) $
                        if (!((*((BYTE *) (ptr[5]))) -= 2))
                                if (1) $
                                        if (!((((char *) ((ptr[5])->a))[0]) <
 0)) $
                                                ((ptr[5])->a->h.z) = (BOOL) t;
                                                _reallyAddToZCT(ptr[5]);
                                        
                                 else;
                 else;
                if (1) $
                        if (!((*((BYTE *) (ptr[3]))) -= 2))
                                if (1) $
                                        if (!((((char *) ((ptr[3])->a))[0]) <
 0)) $
                                                ((ptr[3])->a->h.z) = (BOOL) t;
                                                _reallyAddToZCT(ptr[3]);
                                        
                                 else;
                 else;
        
        else $
                frame->method = ptr[3];
                frame->mp = (&(((frame->method)->a)->b.o[0]));
                frame->tp = stackPointer;
                if (1) $
                        if (!((*((BYTE *) (ptr[3]))) -= 2))
                                if (1) $
                                        if (!((((char *) ((ptr[3])->a))[0]) <
 0)) $
                                                ((ptr[3])->a->h.z) = (BOOL) t;
                                                _reallyAddToZCT(ptr[3]);
                                        
                                 else;
                 else;
        
        frame->sender = ptr[0];
        asm("HERE:");
        instrPointer = ((BYTE *) (&(((((frame->mp)[1]))->a)->b.o[0]))) +
                (((long) (ptr[1]) << 1) >> 1) - 1;
        asm("ENDHERE");

        length = ((((OBJBODY *) ((HEADER *) (ptr) - 1))->h.size) - 1) - 6;

        (((OBJBODY *) ((HEADER *) (ptr) - 1))->h.p) = (BOOL) f;
        (((OBJBODY *) ((HEADER *) (ptr) - 1))->h.o) = 2;
        ((context)->class = (unsigned long) ((((OOP) 0x70000) + 1)));
        *ptr = (OOP) frame;


        length -= (stackSize = (((long) (ptr[2]) << 1) >> 1));
        if (frame->block.home)
                ptr += 6;
        else $
                stackSize += 1;
                ptr += 5;
        

        while (--stackSize >= 0) $
                if (1) $
                        if (!(((SIOP) (*ptr) <= 0) || ((*((BYTE *) (*ptr))) -=
 2)))
                                if (1) $
                                        if (!((((char *) ((*ptr)->a))[0]) < 0))
 $
                                                ((*ptr)->a->h.z) = (BOOL) t;
                                                _reallyAddToZCT(*ptr);
                                        
                                 else;
                 else;
                *stackPointer++ = *ptr++;
        

        while (--length >= 0) $
                if ((SIOP) * ptr > (SIOP) (((OOP) 0x70000) + 28))
                        if (1) $
                                if (!((*((BYTE *) (*ptr))) -= 2))
                                        if (1) $
                                                if (!((((char *)
 ((*ptr)->a))[0]) < 0)) $
                                                        ((*ptr)->a->h.z) =
 (BOOL) t;
                                                        _reallyAddToZCT(*ptr);
                                                
                                         else;
                         else;
                ptr++;
        

        stackPointer--;

Eliot Miranda                email:        eliot@cs.qmc.ac.uk
Dept of Computer Science        Tel:        01 975 5220
Queen Mary College            International:    +44 1 975 5220
Mile End Road
LONDON E1 4NS