[comp.sys.transputer] C compiler efficency - final post

jeremy@cs.adelaide.edu.au (Jeremy Webber) (01/15/91)

I had enough requests for this update to a previous article I posted to warrant
posting it.  This will be the last posting of this article.  Hit 'n' (or
whatever) now if you're not interested...

Since I last posted this article I have received information on two extra
compilers, and tried Inmos C.

Many thanks to those who replied to my original article, and particular thanks
to those who supplied code.  The test program is reproduced at the end of the
article, with comments indicating how each compiler went.

These are my comparison criteria.  Not all compilers were compared on all
criteria.  There may be other criteria which would be relevant to you.  This is
not meant to be a definitive report.  I do not recommend one compiler over
another.

I was interested in:
  Compiled code efficiency
  Support for Transputer facilities such as message passing
  Memory management - on-chip RAM, etc
  User-friendliness
  Support for such things as ROM-based network loading

I have tried 3L Parallel C, including Tbug, Inmos C and YARC C.

My thanks to those compiler suppliers who have provided me with evaluation
copies of their compilers.

General
-------

All compilers I tried were tested in PC development environment.  I am
interested in compiled code efficiency because our application is extremely
time and space critical.  Plus, I wish to port another system from Occam to C,
and will look pretty stupid if the result only runs half as fast.

The compiled code quality of all execpt one compiler (PACT) was very poor.
Most of the compilers didn't even attempt any semantic analysis, or global or
local optimization.  Meiko C did, but with mixed results.  Coming from the Unix
world, where all compilers I have looked at produce at least intermediate
quality optimizations, this was a real disappointment.  See the results for
MIPS C in the code below.  Most of the Transputer compiler suppliers mentioned
in this article need to pull their socks up in this regard.  These compilers
are not cheap, especially if you use a Unix host!  No wonder people
occasionally post articles along the lines of "My transputer network doesn't go
any faster than my Brand X Unix box - how come?"

Another general observation is that some C compilers have no provision
to compile in run-time stack bounds checking.  I would have thought in a
real-memory computer like a Transputer that this would have been essential for
debugging programs reliably.  In testing I had a number of problems with
programs which just hung, due to stack overflow.  Note that this checking
should be a compiler option so you can leave it out if speed is critical.

Another useful facility would be a compiler tool to do flow analysis of the
program to try to predict maximum stack size.  OK, this can't be done reliably
for recursive programs, but even a first guess would be useful.

Here are some notes on individual compilers, in note form.

3L Parallel C
-------------

Compiled code very poor.
Message passing, processes via procedure calls (inefficient)
Uses Inmos-style network loader which should be ROMable (haven't tried it)
Provides good configurer.
No tools for analyzing code or data sizes - it seems to be "by guess and by
gosh". Can't even find out module sizes after linking.
No facility to detect stack and heap overflows.
The TBug package provides debugging of stack and heap overflow, as well as
out-of-bounds memory references.  Only debugs on a single Transputer.  No
provision that I could find for post-mortem debugging.  Excellent user
interface (given that it's on a PC :-)
Has configuration-time (coarse) support of using on-chip RAM.
Compiler user interface good, but doesn't provide a Make facility.
Doesn't compile for T2.  Doesn't support extra instructions in the T425, T801
and T805.
Documentation easy to read, but not complete at the low level.

YARC C
------

Performed some _very_ basic optimizations.  Seems to have a better idea of
individual transputer types for code generation.
Tptr facilities via pseudo function calls - compiler "recognises" them and
generates in-line code - this is documented well and can be disabled if
desired. 
Linker generated map of code and static data.  No facilities for sizing stack.
No stack overflow detection, but that can be worked around.
Does have heap overflow detection.  Does support explicit use of on-chip RAM
in code.
Supports all current Transputer types.
User interface to compiler tedious - necessary to separately invoke all
compilation phases.  Fortunately, the Make facility makes it possible to
ingore most of the fiddly details.
Network configuration is very primitive, and as supplied could not boot a
network independently of the PC host.
Documentation is very detailed, making customization and modification possible.
Source is available for the compiler system (Hip Hip Hooray!)
I received one strong commendation of this product from a user.
I did not have a debugger to try, I believe one is available which does
post-mortem debugging.

Inmos C
-------

Marginally better than 3L C at optimizing, but char arithmetic is a real lose.
Message passing and process control via procedure calls - externally looks very
similar to YARC C (I think they both came from a common source), but it doesn't
in-line the code.
Has primitive linker map showing module positions.  Some control over positions
possible.  No ability to obtain symbol values from linker.
Optional stack and heap overflow detection (doesn't work for parallel progs)
Supports all current transputer types, and mixed-transputer compatability.
Doesn't come with a Make facility, but has a utility to help building
makefiles. 
Network configuration was good, and easy to use.
Has tools explicitly designed to help generate ROMable code.
Has a debugger which can both interactively and post-mortem debug
multi-transputer programs.  For interactive debugging, you need a spare
transputer on the network, plus at least 12K spare on each transputer.
Documentation boring but reasonably complete, and in the usual Inmos style.
Refers to a program called "ispy" which isn't supplied with the toolkit :-(

Meiko C
-------

Attempts some intermediate level optimizations, with varying success.  Misses
some obvious optimizations.
Supports run-time stack checking.
No other information available at time of writing.

PACT Parallel C
---------------

Performs optimization on basic block boundaries.  This results in 'intermediate
level' optimization.  Character arithemetic is less efficient than the others
though, because of their choice to use signed char as the default instead of
unsigned char.  There is a compiler option to make unsigned chars the default.
This product is currently "futureware", at beta test stage.  One of the
developers has indicated that they intend to add global and peephole
optimization, and possibly a tool to help size stacks, but none of these things
exist yet.
The beta version of this compiler produced better code than any of the others I
had information on.
I have not been able to try this product.

Helios C
--------

Runs under the operating system Helios, a multi-tasking distributed Transputer
OS.
Parallelism and message passing via Helios system calls, or via standard C I/O.
On-chip ram available via a variant of Malloc.  Cannot allocate code to fast
RAM.  Memory management done by OS.  Optional stack checking.
Has a "comfortable source code debugger"
The Helios kernel and system tasks have an overhead of about 200K per
processor.  This is worth it for the convenience if you can afford the memory.
Helios is not for embedded applications, but it can host the Inmos tools.
No support for T2.
"unix-like" user interface.  Reasonable docco, including on-line manual from
version 1.2.

Test Program with comments
--------------------------

/* Compiler Optimization Checkout Code
 *
 * These short sequences don't pretend to fully check out compiler
 * optimiser capability, but do determine whether a compiler performs
 * certain basic optimisations.
 * If a compiler fails on these, it doesn't stand a chance on real code...
 *
 * Compilers tested:
 *   MIPS C as supplied by Silicon Graphics with Irix 3.2 (as a control)
 *     Examined assembler output from compiler, which is after compiler
 *     optimisations, but before assembler peephole optimisations
 *     The MIPS compiler is generally regarded as very good at optimising.
 *     The MIPS processor is a conventional register achitecture, so expression
 *     evaluation is not directly comparable to the way transputers do it.
 *     Expression elimination and simplification, however, is.
 *   3L Parallel C Version 2.1
 *     Examined output from the decode utility, which I think is after all
 *     optimisations.
 *     Additional comment: All Transputer channel operations are performed
 *     by procedure calls.
 *   Inmos C, D7214 (I think).
 *     The assembly code I obtained had debugging statements in it.  It is
 *     possible that the compiler may generate better code if there is an
 *     optimize switch (like -O on Unix compilers).
 *   YARC C, TSC V89.1, from Logical Systems Inc.
 *     The generated assembly language from the compiler.  Unfortunately this
 *     assembly language did not contain references to source code lines,
 *     making it hard to analyze in places.
 *   Meiko C, version 6.06, compiled with "mcc -O -S"
 *     The generated assembly language from the compiler.  Source code
 *     is optionally included in the assembly language.
 *     This compiler included checks for stack overflow
 *     with every function.  This can be switched off.
 *   PACT Parallel C beta-version
 *     Examined assembler output after all optimizations.
 *     Additional comment: There are language extensions to support channels
 *     and processes similar to OCCAM, and code for all this is generated
 *     inline. It's ANSI conforming.
 *   Helios C (version?)
 *     Requires Helios operating system.  Optional stack checking.
 */

#include <math.h>

int deterministic_conditionals(void)
/* These frequently occur in code which has various compile-time options,
 * though because of conditional compilation they are not as common as in,
 * say, Pascal.
 *
 * MIPS C optimises refs to 'a' out of existance, and the while to a simple
 *        loop.  It also warns about the infinite loop.
 * 3L C doesn't simpify the while loop conditional or anything else.
 * Inmos C simplifies the while loop conditional, but not refs to 'a'
 * YARC C simplifies the while loop conditional, but not refs to 'a' 
 * Meiko C evaluates the loop conditional, like 3L.
 * PACT Parallel C warns about constant in conditional expression,
 *        and generates code for an unconditional loop.
 * Helios C simplifies the while loop conditional.
 */
{
  int a;

  while (1) {
   a = a + 1;
  }
}

int common_sub_expression(void)
/* Also loop invariant code
 * Elimination of these can result in large performance improvements.  In a
 * language such as C large sub-expressions and loop invariants can be
 * created without the programmer being aware of, or able to do anything about
 * it, when referencing data structures.  Elimination of these is somthing
 * every compiler _should_ do.  Forcing the programmer to use explicit temp
 * vars reduces code readability.
 *
 * MIPS C allocates space for all variables, but doesn't refer to the unused
 *        ones.  It moves the loop invariant code, eliminates the common
 *        subexpression, and treats the "a+2" expression as a constant expr.
 *        It also optimises order of execution of the while conditional.
 * 3L C doesn't optimize anything in this
 * Inmos C doesn't optimize anything in this
 * YARC C also doesn't optimize this
 * Meiko C eliminates the common subexpression, but doesn't move the loop
 *        invariant code.
 * PACT Parallel C warns about variable 'c' unused (but does
 *        allocate space for it). It uses a temporary to store &s[a+2] so
 *        it isn't evaluated twice. It doesn't recognize the a+2 constant
 *        expression because it doesn't do constant folding acress basic
 *        block boundaries.  Uses signed char == char.
 * Helios C doesn't eliminate the common subexpression.  It stores s in
 *        non-workspace memory, and stores its address in the workspace which
 *        saves some instructions.
 */
{
  int a, b, c;
  char s[100];

  a = 3;
  b = 0;
  while (b++ < 100) {
    s[a+2] = s[a+2] + 10;
  }
}

int byte_arithmetic(void)
/* Byte arithemetic is expensive on a Transputer, so where possible a compiler
 * should use int arithemetic and references.
 *
 * MIPS C compiles this to 1 load immediate instruction (the first c=a+b)
 * 3L C represents the char variables in integers to economise on byte refs
 * Inmos C does likewise.  It inexplicably zeros the variables at the start,
 *      even though they are immediately reassigned.  This costs 6 cycles.
 * YARC C doesn't optimize any of this at all (uses byte refs to memory)
 * Meiko C represents char variables as integers, and uses temp. vars to
 *      reduce byte references.
 * PACT Parallel C assigns constants 10 to a, 20 to b, 12 to i and 30 to c.
 * Helios C stores the char variables in whole words.  It loads bytes, but
 *      stores words after AND'ing them with 255.
 */
{
  char a, b, c;
  int i;

  a = 10;
  b = 20;

  c = a + b;
  i = a + 2;
  c = a + b;
}

int byte_arithemic2(void)
/* Also fixed struct addressing.
 * This example is identical to the last, but uses a struct to stop the
 * compiler from representing the chars as ints (though of course a compiler
 * is still at liberty to do so if it wishes - none of them do though).
 * I wanted to see if the compilers created temporary int variables to store
 * the terms which were referred to more than once.
 *
 * MIPS C compiles this exactly as above.
 * 3L C does optimize the struct element refs.  It doesn't use temp. vars
 *      to overcome multiple refs to byte elements.
 * Inmos C does likewise.
 * YARC C does likewise.
 * Meiko C attempts to use a temp to overcome multiple refs to "x.a", but ends
 *      up producing code which is just as inefficient as if it hadn't!
 * PACT Parallel C doesn't fold the constants now, but simply generates
 *      code to load and store the bytes (packed) and int.
 * Helios C optimizes the struct element refs, but doesn't use temp. vars
 *      to overcome multiple refs to byte elements.  It unnecessarily AND's
 *      char expressesions with 255 before storing them with an sb instr.
 */
{
  struct {
    char a, b, c;
    int i;
  } x;

  x.a = 20;
  x.b = 10;
  x.c = x.a + x.b;
  x.i = x.a + 2;
  x.c = x.a + x.b;
}

int unreachable_code(void)
/* ... and unused variables
 * This really is the opposite case of deterministic_conditionals above.
 * All Transputer compilers allocated space for all variables, even though
 * one of them isn't referred to.
 *
 * MIPS C optimises this out of existence (really - just to a return instr.)
 * 3L C doesn't simplify any of this
 * Inmos C doesn't simplify any of this
 * YARC C eliminates the unreachable code
 * Meiko C doesn't simplify any of this
 * PACT Parallel C assignes 200 to c and warns about a and b being unused
 *     and about the constant in the conditional.  Eliminates the loop.
 * Helios C eliminates the unreachable code.
 */
{
  int a, b, c;

  c = 200;
  while (0) {
    b = c - 10;
  }
}

int complex_expression(void)
/* Just to see how well things get re-ordered for the Transputer stack.
 * The Transputer compilers, of necessity, generate lots of references to
 * memory, whereas the MIPS compiler generates none.  I suspect that Transputers
 * are highly memory bound in their performance.
 *
 * MIPS does entirely register arithmetic.  It eliminates the last statement.
 * 3L C rearranges expressions for the tptr stack well for the first 2 exprs.
 *      It creates an unnecessary temp. for the third expression.
 * Inmos C rearranges all expressions well
 * YARC C rearranges the expressions correctly, and does a small peephole
 *      optimization between the 2nd and 2rd statements.
 * Meiko C eliminates the common subexpression (d-e), but uses an unnecessary
 *      temp. in the third expression.
 * PACT Parallel C folds all 3 expressions, locates common subexpressions
 *      which are evaluated and stored in temps.  Apart from these temps
 *	it can evaluate the entire basic block (i.e. all 3 expressions)
 *	using the stack.
 * Helios C rearranges the expressions well, but doesn't eliminate common
 *      subexpressions.
 */
{
  int a, b, c, d, e;

  a = b + c * (d - e);
  a = (d - e) - (a + a * b);
  a = b * ((c - d * (a + b)) * (a - b * (a + c)) + 2);
}

int constant_expressions(void)
/* A compiler which failed this test would be a real lose!
 * Fortunately none of them did, though none of them recognised the call
 * to "abs" either.
 *
 * MIPS C eliminates the first, does a function call for the second
 * 3L C evaluates the first at compile time, the second at run time, using
 *      a function call
 * Inmos C does likewise.
 * YARC C also does likewise.
 * Meiko C does the same.
 * PACT Parallel C evaluates the first at compile time and the second at
 *	run time with a fn call.
 * Helios C eliminates the first, does a function call for the second.
 */
{
  int a, b;

  a = 4 + 7 * 3;
  b = abs(-2);
}

int floating_point_instr(void)
/* Check sheduling of floating point on a T8 - taken from
 * "The Transputer Instruction Set - A Compiler Writer's Guide" p75.
 *
 * MIPS C uses FP and integer regs (comparison to transputer not relevant)
 * 3L C generates "naive" code (as defined in the book) for this.
 * Inmos C also generates "naive" code.
 * Meiko C also generates "naive" code, but it does eliminate some subscript
 *    subexpressions.
 * PACT Parallel C uses temps to store i*8 and j*8 (implicit due to array
 *	reference), and signals (!) a 15 cycle stall after the FP multiply
 *	- no rescheduling is currently done.
 * Helios C stores the arrays in non-local memory, and precomputes sizes and
 *      addresses.  It generates "naive" code for the FP expression.
 */
{
  double a[20][20], b[20][20];
  double c, d;
  int i, j;

  a[i][j] = b[i][j]*c + d;
}
--
--
Jeremy Webber			   ACSnet: jeremy@chook.ua.oz
Digital Arts Film and Television,  Internet: jeremy@chook.ua.oz.au
3 Milner St, Hindmarsh, SA 5007,   Voicenet: +61 8 346 4534
Australia			   Papernet: +61 8 346 4537 (FAX)