[comp.compilers] C Compilers which use full 486 capabilities

mcarrick@gara.une.oz.au (Mick Carrick) (03/14/91)

I am forwarding this request for a University organization which does
not have access to the net.

They have an urgent need for a C compiler which uses all of the
speed features of the '486 chip for a vision processing application.

Do you know of any? Are any at the beta test stage? Where can they
be accessed?

If this is not interesting to the net, please reply directly.

Thanks

Mick Carrick
Department of Animal Science
UNE, Armidale, NSW, 2351
Australia

mcarrick@gara.une.oz.au
[It is my impression that optimizations for the 486 are in most cases the
same ones that you'd do for the 386.  There are a few places on the 486
where you can avoid pipeline stalls by reordering instructions, and a few
instructions that are relatively faster or slower, other than that anything
you'd do for a 386 applies equally. -John]
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

mason+@transarc.com (Tony Mason) (03/15/91)

Excerpts from netnews.comp.compilers: 13-Mar-91 C Compilers which use
full .. Mick Carrick@gara.une.oz (962)

> [It is my impression that optimizations for the 486 are in most cases the
> same ones that you'd do for the 386.  There are a few places on the 486
> where you can avoid pipeline stalls by reordering instructions, and a few
> instructions that are relatively faster or slower, other than that anything
> you'd do for a 386 applies equally. -John]

Actually, the March 1991 issue of Dr. Dobbs Journal of Software Tools
discusses 486 optimization.  The author asserts (p. 18):

"The 486 represents a fundamental break with 8088-style optimization. 
Virtually all the old rules fail on the 486, where, incredibly, a move
to or from memory often takes just one cycle, but exchanging two
registers takes three cycles."

From that, I'd conclude the optimization isn't the same.  Perhaps the
most current Microsoft C compiler does the job?

Tony Mason
> mason+@transarc.com
[The 486 Programmer's Reference has quite a lot to say about 486 optimization,
and most of what it says also applies to the 386.  The MOV vs. XCHG
example is an extreme one, as XCHG is a well-known slow instruction, both
because it has some bus-locking side effects, and because it's not in the
set of frequently used instructions that they optimized to one cycle.  There
are a few peculiarities that are different, but in general it's really true
that you make mostly the same decisions on the 486 as on the 386.  The
Programmer's Reference manual is quite informative and is well worth the $25
that it costs. -John]
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

preston@ariel.rice.edu (Preston Briggs) (03/16/91)

mason+@transarc.com (Tony Mason) writes:
>[quoting from Dr. Dobbs]
>"The 486 represents a fundamental break with 8088-style optimization. 
>Virtually all the old rules fail on the 486, where, incredibly, a move
>to or from memory often takes just one cycle, but exchanging two
>registers takes three cycles."

Does anyone's compiler generate an exchange instruction?  I can visualize
uses, but I'm not sure how I'd recognize it.  I can imagine peephole
optimizers that could catch it, but generally, I dislike instructions with
more than 1 result.

How would it integrate with register allocation?

Preston Briggs
[As far as I can tell, most actual XCHG instructions are really test-and-sets
that set locks between processes or processors. -John]
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

meissner@osf.org (03/17/91)

| [WRT the rather slow 486 XCHG instruction]
| Does anyone's compiler generate an exchange instruction?  I can visualize
| uses, but I'm not sure how I'd recognize it.  I can imagine peephole
| optimizers that could catch it, but generally, I dislike instructions with
| more than 1 result.
| 
| How would it integrate with register allocation?

When I started working on the 88k code generator for the GNU C
compiler at Data General, it supported the xmem and xmem.bu
instructions through peepholes.  The peephole checked for:

	reg1 <- memory1
	reg2 <- memory2
	memory1 <- reg2
	memory2 <- reg1

and converted it into the appropriate xmem instruction.  The code
wasn't safe, in that it should have made sure reg1 and reg2 were dead
after the peephole, but it could have been made safe with an
appropriate check.  This peephole did optimize the sort subtest in the
Stanford benchmarks.  I eventually removed it, when we ran the
compiler on an actual 88k, and compared the times to another compiler.
The sort benchmark was an anomaly, in that ran much slower for GCC
compared to the other compiler.  The reason of course is that xmem on
an 88k has to synchonize the machine before continuing.
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

rex@aussie.COM (Rex Jaeschke) (03/17/91)

> They have an urgent need for a C compiler which uses all of the
> speed features of the '486 chip for a vision processing application.
> Do you know of any? Are any at the beta test stage? Where can they
> be accessed?

MetaWare's High C has 486-specific code generation. They have a very good 
reputation in the compiler business. For info, contact:

	uunet!metaware.com!steven

Rex

----------------------------------------------------------------------------
Rex Jaeschke     |  Journal of C Language Translation  | C Users Journal
(703) 860-0091   |        2051 Swans Neck Way          | DEC PROFESSIONAL
rex@aussie.COM   |     Reston, Virginia 22091, USA     | Programmers Journal
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

pardo@cs.washington.edu (David Keppel) (03/19/91)

>mason+@transarc.com (Tony Mason) writes:
>>[quoting from Dr. Dobbs]
>>[Old optimization rules fail on the 486, where, a move to or from memory
>> takes one cycle, but exchanging two registers takes three cycles.]

preston@ariel.rice.edu (Preston Briggs) writes:
>Does anyone's compiler generate an exchange instruction?
>I can visualize uses, but I'm not sure how I'd recognize it.
>How would it integrate with register allocation?

John Levine (compilers-request) writes:
>[Most uses I have seen are for atomic operations.]

It is certainly the case that exchange is used for atomic operations.
But that is because there are no `ordinary' instructions that will do
the trick.  The other half of the question is ``what non-atomic
operations can make use of the exchange instruction?''  I have an
answer in two parts: First, example code that uses an abstract
exchange operation.  Second, details of the the iX86 `xchg' operation.

* Example Code

Here is code that the compiler could recognize as using exchange.  I
saw this code yesterday in a data compression algorithm:

	t := prev_val;
	prev_val := a[i];
	a[i] := t;

A reasonable implementation could make use of exchange, and a simple
peephole optimizer could replace instances of

	mov rx, ry
	ld rx, addr
	st ry addr

with

	xchg rx, addr

(as long as `ry' is not used subsequently).  I can imagine a register
allocator that instead of doing spills and restores with

	mov r0, -46[fp]
	mov -50[fp], r0

figures out the lifetimes of the variables, figures out that they're
disjoint (after all, the stack slots got generated in the first place
becase there weren't enough empty registers...), creates one merged
stack slot, and replaces the above with

	xchg  r0, -44[fp]

This has advantages of (a) smaller code size and (b) better stack
locality (decreasing the stack size and increasing the utilization of
particular stack slots).  Note that no analysis of user code is needed
for this latter operation, the spills and restores are
compiler-generated.

* iX86 `XCHG'

The iX86 has a `lock' prefix that can be applied to several
instructions.  When those instructions are `lock'-prefixed, they are
guaranteed to behave atomically wrt memory.  On a multiprocessor, if
two processors simultaneously execute

	lock incl 4[r0]

then the value at 4[r0] is guaranteed to be incremented by 2 (one for
each processor).  The `xchg' instruction can have the `lock' prefix
applied to it but -- for reasons I know not what -- omitting the
`lock' prefix has exactly the same effect, namely the exchange
operation is atomic whether or not you use the `lock' prefix.

Thus, on the i486 reads and writes that hit in the cache never go
off-chip.  Except that all `xchg' instructions must assert a line off
chip saying ``don't touch (this) memory right now''.  The 3-cycle
estimate is optmistic -- on a multiprocessor with a heavily-loaded
bus, atomic bus operations take substantially longer.

To summarize: it isn't hard to think of cases when exchange could be
useful, but you don't want to do it on the iX86 family because the
exchange instruction is defined as having atomic semantics that must
be enforced at any cost.

		;-D on  ( Look, ma, no peephole! )  Pardo
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

preston@ariel.rice.edu (Preston Briggs) (03/20/91)

pardo@cs.washington.edu (David Keppel) writes:

about exchange instructions (finding uses, not really defending to the death)

>Here is code that the compiler could recognize as using exchange.  I
>saw this code yesterday in a data compression algorithm:
>
>	t := prev_val;
>	prev_val := a[i];
>	a[i] := t;
>
>A reasonable implementation could make use of exchange, and a simple
>peephole optimizer could replace instances of
>
>	mov rx, ry
>	ld rx, addr
>	st ry addr
>with
>	xchg rx, addr
>
>(as long as `ry' is not used subsequently).

I wouldn't have constrained prev_val to be in rx before and after.
Instead, realize that prev_val actually denotes 2 disjoint live
ranges.  Therefore

	-- prev_val in rx
	ld ry, addr
	st rx, addr
	-- prev_val in ry

with both registers surviving.
However, the code might contained in a loop and the two appearances of
prev_val actually part of one live range.  In this case, I'd need
the extra move at some point.

>  I can imagine a register
>allocator that instead of doing spills and restores with
>
>	mov r0, -46[fp]
>	mov -50[fp], r0
>
>figures out the lifetimes of the variables, figures out that they're
>disjoint (after all, the stack slots got generated in the first place
>becase there weren't enough empty registers...), creates one merged
>stack slot, and replaces the above with
>
>	xchg  r0, -44[fp]
>
>This has advantages of (a) smaller code size and (b) better stack
>locality (decreasing the stack size and increasing the utilization of
>particular stack slots).  Note that no analysis of user code is needed
>for this latter operation, the spills and restores are compiler-generated.

Please don't make me do this!

The problem of merging spill locations is just graph coloring all over again
(the figure out lifetimes and note when they're disjoint).  Then you require
the register colors to match and the stack locations to match.

Besides being difficult, the 1st version (separate load and store) has the
advantage that we can attempt to schedule the load in advance of its use
(especially if it turns out we've overspilled slightly, which happens).

I can imagine a 3rd use (which I also don't expect to generate).  We might
have parameters passed in registers, with the 1st parameter in r1, the 2nd
in r2, and so forth.  Given the following source:

	void zip(i, j)
	    int i, j;
	{
	    ...
	    zap(j, i);
	    ...
	}

using moves alone requires going through memory or another register.

I guess I believe that exchange is just a little too CISCish for my taste;
tiny payoffs and hard to use.  Of course, the synchronization case
is very important, but I'd expect that to be hidden in a system routine
and code in assembler.

Preston Briggs
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.