[comp.sys.transputer] Transputer C compiler efficiency - results

jeremy@cs.adelaide.edu.au (Jeremy Webber) (12/20/90)

As promised, here is my summary on C compiler performance for the Transputer.
Many thanks to those who replied to my original article, and particular thanks
to those who supplied code.  The test program is reproduced at the end of the
article, with comments indicating how each compiler went.

These are my comparison criteria.  Not all compilers were compared on all
criteria.  There may be other criteria which would be relevant to you.  This is
not meant to be a definitive report.  I do not recommend one compiler over
another.

I was interested in:
  Compiled code efficiency
  Support for Transputer facilities such as message passing
  Memory management - on-chip RAM, etc
  User-friendliness
  Support for such things as ROM-based network loading

I have tried 3L Parallel C and YARC C.  At time of writing I have obtained 3L's
Tbug package for evaluation, but have not tried it yet.  I have also obtained
Inmos C for evaluation, but have not had time to look at it yet (my results are
based on a report sent to me).  I'll post followups early in 1991 to those who
request them.

My thanks to those compiler suppliers who have provided me with evaluation
copies of their compilers.

General
-------

All compilers I tried were tested in PC development environment.  I am
interested in compiled code efficiency because our application is extremely
time and space critical.  Plus, I wish to port another system from Occam to C,
and will look pretty stupid if the result runs half as fast.

The compiled code quality of all compilers was very poor.  Most of the
compilers didn't even attempt any semantic analysis, or global or local
optimization.  Meiko C did, but with mixed results.  Coming from the Unix
world, where all compilers I have looked at produce at least intermediate
quality optimizations, this was a real disappointment.  See the results for
MIPS C in the code below.  All the Transputer compiler suppliers mentioned in
this article need to pull their socks up in this regard.  These compilers are
not cheap, especially if you use a Unix host!  No wonder people occasionally
post articles along the lines of "My transputer network doesn't go any faster
than my Brand X Unix box - how come?"

Another general observation is that only Meiko C appears to have provision
to compile in run-time stack bounds checking.  I would have thought in a
real-memory computer like a Transputer that this would have been essential for
debugging programs reliably.  I testing I had a number of problems with
programs which just hung, due to stack overflow.  Note that this checking
should be a compiler option so you can leave it out if speed is critical.

Another useful facility would be a compiler tool to do flow analysis of the
program to try to predict maximum stack size.  OK, this can't be done reliably
for recursive programs, but even a first guess would be useful.

Here are some notes on individual compilers, in note form.

3L Parallel C
-------------

Compiled code very poor.
Message passing, processes via procedure calls (inefficient)
Uses Inmos-style network loader which should be ROMable (haven't tried it)
Provides good configurer.
No tools for analyzing code or data sizes - it seems to be "by guess and by
gosh". Can't even find out module sizes after linking.
No facility to detect stack and heap overflows.
[I believe the TBug package provides debugging of stack and heap overflow]
Has configuration-time (coarse) support of using on-chip RAM.
Compiler user interface good, but doesn't provide a Make facility.
Doesn't compile for T2.
Documentation easy to read, but not complete at the low level.

YARC C
------

Performed some _very_ basic optimisations.  Seems to have a better idea of
individual transputer types for code generation.
Tptr facilities via pseudo function calls - compiler "recognises" them and
generates in-line code - this is documented well and can be disabled.
Linker generated map of code and static data.  No facilities for sizing stack.
No stack overflow detection.  Does have heap overflow detection.  Does support
explicit use of on-chip RAM in code.
Supports T212, M212, T414, T425, T800 and T805, and similar, Transputers.
User interface to compiler tedious - necessary to separately invoke all
compilation phases.  Fortunately, the Make facility makes it possible to
ingore most of the fiddly details.
Network configuration is very primitive, and as supplied could not boot a
network independently of the PC host.
Documentation is very detailed, making customization and modification possible.
Source is available for the compiler system (Hip Hip Hooray!)
I received one strong commendation of this product from a user.

Inmos C
-------

Marginally better than 3L C at optimizing, but char arithmetic is a real lose.
Other information not available at time of writing.

Meiko C
-------

Attempts some intermediate level optimizations, with varying success.  Misses
some obvious optimizations.
Supports run-time stack checking.
No other information available at time of writing.


Test Program with comments
--------------------------

/* Compiler Optimization Checkout Code
 *
 * These short sequences don't pretend to fully check out compiler
 * optimizer capability, but do determine whether a compiler performs
 * certain basic optimizations.
 * If a compiler fails on these, it doesn't stand a chance on real code...
 *
 * Compilers tested:
 *   MIPS C as supplied by Silicon Graphics with Irix 3.2 (as a control)
 *     Examined assembler output from compiler, which is after compiler
 *     optimisations, but before assembler peephole optimisations
 *     The MIPS compiler is generally regarded as very good at optimising.
 *     The MIPS processor is a conventional register achitecture, so expression
 *     evaluation is not directly comparable to the way transputers do it.
 *     Expression elimination and simplification, however, is.
 *   3L Parallel C Version 2.1
 *     Examined output from the decode utility, which I think is after all
 *     optimisations.
 *     Additional comment: All Transputer channel operations are performed
 *     by procedure calls.
 *   Inmos C, D7214 (I think).
 *     The assembly code I obtained had debugging statements in it.  It is
 *     possible that the compiler may generate better code if there is an
 *     optimize switch (like -O on Unix compilers).
 *   YARC C, TSC V89.1, from Logical Systems Inc.
 *     The generated assembly language from the compiler.  Unfortunately this
 *     assembly language did not contain references to source code lines,
 *     making it hard to analyze in places.
 *   Meiko C, version 6.06, compiled with "mcc -O -S"
 *     The generated assembly language from the compiler.  Also didn't contain
 *     source code lines.  This compiler included checks for stack overflow
 *     with every function.  Can this be switched off?
 */

#include <math.h>

int deterministic_conditionals(void)
/* These frequently occur in code which has various compile-time options,
 * though because of conditional compilation they are not as common as in,
 * say, Pascal.
 *
 * MIPS C optimises refs to 'a' out of existance, and the while to a simple
 *        loop.  It also warns about the infinite loop.
 * 3L C doesn't simpify the while loop conditional or anything else.
 * Inmos C simplifies the while loop conditional, but not refs to 'a'
 * YARC C simplifies the while loop conditional, but not refs to 'a' 
 * Meiko C evaluates the loop conditional, like 3L.
 */
{
  int a;

  while (1) {
   a = a + 1;
  }
}

int common_sub_expression(void)
/* Also loop invariant code
 * Elimination of these can result in large performance improvements.  In a
 * language such as C large sub-expressions and loop invariants can be
 * created without the programmer being aware of, or able to do anything about
 * it, when referencing data structures.  Elimination of these is somthing
 * every compiler _should_ do.  Forcing the programmer to use explicit temp
 * vars reduces code readability.
 *
 * MIPS C allocates space for all variables, but doesn't refer to the unused
 *        ones.  It moves the loop invariant code, eliminates the common
 *        subexpression, and treats the "a+2" expression as a constant expr.
 *        It also optimises order of execution of the while conditional.
 * 3L C doesn't optimize anything in this
 * Inmos C doesn't optimize anything in this
 * YARC C also doesn't optimize this
 * Meiko C eliminates the common subexpression, but doesn't move the loop
 *        invariant code.
 */
{
  int a, b, c;
  char s[100];

  a = 3;
  b = 0;
  while (b++ < 100) {
    s[a+2] = s[a+2] + 10;
  }
}

int byte_arithmetic(void)
/* Byte arithemetic is expensive on a Transputer, so where possible a compiler
 * should use int arithemetic and references.
 *
 * MIPS C compiles this to 1 load immediate instruction (the first c=a+b)
 * 3L C represents the char variables in integers to economise on byte refs
 * Inmos C does likewise.  It inexplicably zeros the variables at the start,
 *      even though they are immediately reassigned.  This costs 6 cycles.
 * YARC C doesn't optimize any of this at all (uses byte refs to memory)
 * Meiko C represents char variables as integers, and uses temp. vars to
 *      reduce byte references.
 */
{
  char a, b, c;
  int i;

  a = 10;
  b = 20;

  c = a + b;
  i = a + 2;
  c = a + b;
}

int byte_arithemic2(void)
/* Also fixed struct addressing.
 * This example is identical to the last, but uses a struct to stop the
 * compiler from representing the chars as ints (though of course a compiler
 * is still at liberty to do so if it wishes - none of them do though).
 * I wanted to see if the compilers created temporary int variables to store
 * the terms which were referred to more than once.
 *
 * MIPS C compiles this exactly as above.
 * 3L C does optimise the struct element refs.  It doesn't use temp. vars
 *      to overcome multiple refs to byte elements.
 * Inmos C does likewise.
 * YARC C does likewise.
 * Meiko C attempts to use a temp to overcome multiple refs to "x.a", but ends
 *      up producing code which is just as inefficient as if it hadn't!
 */
{
  struct {
    char a, b, c;
    int i;
  } x;

  x.a = 20;
  x.b = 10;
  x.c = x.a + x.b;
  x.i = x.a + 2;
  x.c = x.a + x.b;
}

int unreachable_code(void)
/* ... and unused variables
 * This really is the opposite case of deterministic_conditionals above.
 * All Transputer compilers allocated space for all variables, even though
 * one of them isn't referred to.
 *
 * MIPS C optimises this out of existence (really - just to a return instr.)
 * 3L C doesn't simplify any of this
 * Inmos C doesn't simplify any of this
 * YARC C eliminates the unreachable code
 * Meiko C doesn't simplify any of this
 */
{
  int a, b, c;

  c = 200;
  while (0) {
    b = c - 10;
  }
}

int complex_expression(void)
/* Just to see how well things get re-ordered for the Transputer stack.
 * The Transputer compilers, of necessity, generate lots of references to
 * memory, whereas the MIPS compiler generates none.  I suspect that Transputers
 * are highly memory bound in their performance.
 *
 * MIPS does entirely register arithmetic.  It eliminates the last statement.
 * 3L C rearranges expressions for the tptr stack well for the first 2 exprs.
 *      It creates an unnecessary temp. for the third expression.
 * Inmos C rearranges all expressions well
 * YARC C rearranges the expressions correctly, and does a small peephole
 *      optimization between the 2nd and 2rd statements.
 * Meiko C eliminates the common subexpression (d-e), but uses an unnecessary
 *      temp. in the third expression.
 */
{
  int a, b, c, d, e;

  a = b + c * (d - e);
  a = (d - e) - (a + a * b);
  a = b * ((c - d * (a + b)) * (a - b * (a + c)) + 2);
}

int constant_expressions(void)
/* A compiler which failed this test would be a real lose!
 * Fortunately none of them did, though none of them recognised the call
 * to "abs" either.
 *
 * MIPS C eliminates the first, does a function call for the second
 * 3L C evaluates the first at compile time, the second at run time, using
 *      a function call
 * Inmos C does likewise.
 * YARC C also does likewise.
 * Meiko C does the same.
 */
{
  int a, b;

  a = 4 + 7 * 3;
  b = abs(-2);
}

int floating_point_instr(void)
/* Check shedululing of floating point on a T8 - taken from
 * "The Transputer Instruction Set - A Compiler Writer's Guide" p75.
 *
 * MIPS C uses FP and integer regs (comparison to transputer not relevant)
 * 3L C generates "naive" code (as defined in the book) for this.
 * Inmos C also generates "naive" code.
 * Meiko C also generates "naive" code, but it does eliminate some subscript
 *    subexpressions.
 */
{
  double a[20][20], b[20][20];
  double c, d;
  int i, j;

  a[i][j] = b[i][j]*c + d;
}
--
--
Jeremy Webber			   ACSnet: jeremy@chook.ua.oz
Digital Arts Film and Television,  Internet: jeremy@chook.ua.oz.au
3 Milner St, Hindmarsh, SA 5007,   Voicenet: +61 8 346 4534
Australia			   Papernet: +61 8 346 4537 (FAX)

bailey@jacobs.CS.ORST.EDU (Kirk Bailey) (12/21/90)

Dear Net Community, 
 
	I'm posting in response to the recent (excellent), "C" compiler review
by Jeremy Webber.  I wanted to mention a technique which may be used with the
current (V89.1), Logical Systems "C" compiler to implement stack bounds
checking (and the like):
 
Write a small "C" function such as:
 
void
stack_check(int dummy)
	{
	if(((char *) ((&dummy) + 3)) < BOTTOM_OF_STACK)
		{
		printf("Stack Underflow\n");
		exit(1);
		}
	}
 
Where BOTTOM_OF_STACK is the desired stack minimum address.
 
	After compiling this routine it may be linked in with the application
code and calls to it inserted automatically by using the "-ti" TCX command
flag (see TCX manual for detailed information).  The same technique may be used
to implement a function trace-in/trace-out facility:
 
void
trace_in(int ws_size)
	{
	printf("Trace entry of function starting near %p\n",
		*((&ws_size) - 1));
	printf("\tFunction needs %u word(s) of workspace (current WS = %p)\n",
		ws_size,(&ws_size) + 3);
	}
 
void
trace_out(int dummy)
	{
	printf("Trace exit of function ending near %p\n",*((&dummy) - 1));
	}
 
	Note that calls to "trace_in" would again be inserted with the TCX
"-ti" option while calls to "trace_out" would require the "-to" option.
 
	In the "REAL SOON NOW" category: the upcoming version of the Logical
Systems "C" package has a more automatic way to accomplish stack underflow
checking (plus addressing many of the other concerns the review mentioned).
As I'm sure is the case with the other vendors, we welcome feedback and ideas
of what should be changed in the future, whether in the form of a comprehensive
review or just a quick EMAIL note.  Keep the cards and bricks coming...
 
	Hope everyone has a excellent new year!
 
		Cheers,
			Kirk Bailey
			Logical Systems