mcarrick@gara.une.oz.au (Mick Carrick) (03/14/91)
I am forwarding this request for a University organization which does not have access to the net. They have an urgent need for a C compiler which uses all of the speed features of the '486 chip for a vision processing application. Do you know of any? Are any at the beta test stage? Where can they be accessed? If this is not interesting to the net, please reply directly. Thanks Mick Carrick Department of Animal Science UNE, Armidale, NSW, 2351 Australia mcarrick@gara.une.oz.au [It is my impression that optimizations for the 486 are in most cases the same ones that you'd do for the 386. There are a few places on the 486 where you can avoid pipeline stalls by reordering instructions, and a few instructions that are relatively faster or slower, other than that anything you'd do for a 386 applies equally. -John] -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
mason+@transarc.com (Tony Mason) (03/15/91)
Excerpts from netnews.comp.compilers: 13-Mar-91 C Compilers which use full .. Mick Carrick@gara.une.oz (962) > [It is my impression that optimizations for the 486 are in most cases the > same ones that you'd do for the 386. There are a few places on the 486 > where you can avoid pipeline stalls by reordering instructions, and a few > instructions that are relatively faster or slower, other than that anything > you'd do for a 386 applies equally. -John] Actually, the March 1991 issue of Dr. Dobbs Journal of Software Tools discusses 486 optimization. The author asserts (p. 18): "The 486 represents a fundamental break with 8088-style optimization. Virtually all the old rules fail on the 486, where, incredibly, a move to or from memory often takes just one cycle, but exchanging two registers takes three cycles." From that, I'd conclude the optimization isn't the same. Perhaps the most current Microsoft C compiler does the job? Tony Mason > mason+@transarc.com [The 486 Programmer's Reference has quite a lot to say about 486 optimization, and most of what it says also applies to the 386. The MOV vs. XCHG example is an extreme one, as XCHG is a well-known slow instruction, both because it has some bus-locking side effects, and because it's not in the set of frequently used instructions that they optimized to one cycle. There are a few peculiarities that are different, but in general it's really true that you make mostly the same decisions on the 486 as on the 386. The Programmer's Reference manual is quite informative and is well worth the $25 that it costs. -John] -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
preston@ariel.rice.edu (Preston Briggs) (03/16/91)
mason+@transarc.com (Tony Mason) writes: >[quoting from Dr. Dobbs] >"The 486 represents a fundamental break with 8088-style optimization. >Virtually all the old rules fail on the 486, where, incredibly, a move >to or from memory often takes just one cycle, but exchanging two >registers takes three cycles." Does anyone's compiler generate an exchange instruction? I can visualize uses, but I'm not sure how I'd recognize it. I can imagine peephole optimizers that could catch it, but generally, I dislike instructions with more than 1 result. How would it integrate with register allocation? Preston Briggs [As far as I can tell, most actual XCHG instructions are really test-and-sets that set locks between processes or processors. -John] -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
meissner@osf.org (03/17/91)
| [WRT the rather slow 486 XCHG instruction] | Does anyone's compiler generate an exchange instruction? I can visualize | uses, but I'm not sure how I'd recognize it. I can imagine peephole | optimizers that could catch it, but generally, I dislike instructions with | more than 1 result. | | How would it integrate with register allocation? When I started working on the 88k code generator for the GNU C compiler at Data General, it supported the xmem and xmem.bu instructions through peepholes. The peephole checked for: reg1 <- memory1 reg2 <- memory2 memory1 <- reg2 memory2 <- reg1 and converted it into the appropriate xmem instruction. The code wasn't safe, in that it should have made sure reg1 and reg2 were dead after the peephole, but it could have been made safe with an appropriate check. This peephole did optimize the sort subtest in the Stanford benchmarks. I eventually removed it, when we ran the compiler on an actual 88k, and compared the times to another compiler. The sort benchmark was an anomaly, in that ran much slower for GCC compared to the other compiler. The reason of course is that xmem on an 88k has to synchonize the machine before continuing. -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
rex@aussie.COM (Rex Jaeschke) (03/17/91)
> They have an urgent need for a C compiler which uses all of the > speed features of the '486 chip for a vision processing application. > Do you know of any? Are any at the beta test stage? Where can they > be accessed? MetaWare's High C has 486-specific code generation. They have a very good reputation in the compiler business. For info, contact: uunet!metaware.com!steven Rex ---------------------------------------------------------------------------- Rex Jaeschke | Journal of C Language Translation | C Users Journal (703) 860-0091 | 2051 Swans Neck Way | DEC PROFESSIONAL rex@aussie.COM | Reston, Virginia 22091, USA | Programmers Journal -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
pardo@cs.washington.edu (David Keppel) (03/19/91)
>mason+@transarc.com (Tony Mason) writes: >>[quoting from Dr. Dobbs] >>[Old optimization rules fail on the 486, where, a move to or from memory >> takes one cycle, but exchanging two registers takes three cycles.] preston@ariel.rice.edu (Preston Briggs) writes: >Does anyone's compiler generate an exchange instruction? >I can visualize uses, but I'm not sure how I'd recognize it. >How would it integrate with register allocation? John Levine (compilers-request) writes: >[Most uses I have seen are for atomic operations.] It is certainly the case that exchange is used for atomic operations. But that is because there are no `ordinary' instructions that will do the trick. The other half of the question is ``what non-atomic operations can make use of the exchange instruction?'' I have an answer in two parts: First, example code that uses an abstract exchange operation. Second, details of the the iX86 `xchg' operation. * Example Code Here is code that the compiler could recognize as using exchange. I saw this code yesterday in a data compression algorithm: t := prev_val; prev_val := a[i]; a[i] := t; A reasonable implementation could make use of exchange, and a simple peephole optimizer could replace instances of mov rx, ry ld rx, addr st ry addr with xchg rx, addr (as long as `ry' is not used subsequently). I can imagine a register allocator that instead of doing spills and restores with mov r0, -46[fp] mov -50[fp], r0 figures out the lifetimes of the variables, figures out that they're disjoint (after all, the stack slots got generated in the first place becase there weren't enough empty registers...), creates one merged stack slot, and replaces the above with xchg r0, -44[fp] This has advantages of (a) smaller code size and (b) better stack locality (decreasing the stack size and increasing the utilization of particular stack slots). Note that no analysis of user code is needed for this latter operation, the spills and restores are compiler-generated. * iX86 `XCHG' The iX86 has a `lock' prefix that can be applied to several instructions. When those instructions are `lock'-prefixed, they are guaranteed to behave atomically wrt memory. On a multiprocessor, if two processors simultaneously execute lock incl 4[r0] then the value at 4[r0] is guaranteed to be incremented by 2 (one for each processor). The `xchg' instruction can have the `lock' prefix applied to it but -- for reasons I know not what -- omitting the `lock' prefix has exactly the same effect, namely the exchange operation is atomic whether or not you use the `lock' prefix. Thus, on the i486 reads and writes that hit in the cache never go off-chip. Except that all `xchg' instructions must assert a line off chip saying ``don't touch (this) memory right now''. The 3-cycle estimate is optmistic -- on a multiprocessor with a heavily-loaded bus, atomic bus operations take substantially longer. To summarize: it isn't hard to think of cases when exchange could be useful, but you don't want to do it on the iX86 family because the exchange instruction is defined as having atomic semantics that must be enforced at any cost. ;-D on ( Look, ma, no peephole! ) Pardo -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
preston@ariel.rice.edu (Preston Briggs) (03/20/91)
pardo@cs.washington.edu (David Keppel) writes: about exchange instructions (finding uses, not really defending to the death) >Here is code that the compiler could recognize as using exchange. I >saw this code yesterday in a data compression algorithm: > > t := prev_val; > prev_val := a[i]; > a[i] := t; > >A reasonable implementation could make use of exchange, and a simple >peephole optimizer could replace instances of > > mov rx, ry > ld rx, addr > st ry addr >with > xchg rx, addr > >(as long as `ry' is not used subsequently). I wouldn't have constrained prev_val to be in rx before and after. Instead, realize that prev_val actually denotes 2 disjoint live ranges. Therefore -- prev_val in rx ld ry, addr st rx, addr -- prev_val in ry with both registers surviving. However, the code might contained in a loop and the two appearances of prev_val actually part of one live range. In this case, I'd need the extra move at some point. > I can imagine a register >allocator that instead of doing spills and restores with > > mov r0, -46[fp] > mov -50[fp], r0 > >figures out the lifetimes of the variables, figures out that they're >disjoint (after all, the stack slots got generated in the first place >becase there weren't enough empty registers...), creates one merged >stack slot, and replaces the above with > > xchg r0, -44[fp] > >This has advantages of (a) smaller code size and (b) better stack >locality (decreasing the stack size and increasing the utilization of >particular stack slots). Note that no analysis of user code is needed >for this latter operation, the spills and restores are compiler-generated. Please don't make me do this! The problem of merging spill locations is just graph coloring all over again (the figure out lifetimes and note when they're disjoint). Then you require the register colors to match and the stack locations to match. Besides being difficult, the 1st version (separate load and store) has the advantage that we can attempt to schedule the load in advance of its use (especially if it turns out we've overspilled slightly, which happens). I can imagine a 3rd use (which I also don't expect to generate). We might have parameters passed in registers, with the 1st parameter in r1, the 2nd in r2, and so forth. Given the following source: void zip(i, j) int i, j; { ... zap(j, i); ... } using moves alone requires going through memory or another register. I guess I believe that exchange is just a little too CISCish for my taste; tiny payoffs and hard to use. Of course, the synchronization case is very important, but I'd expect that to be hidden in a system routine and code in assembler. Preston Briggs -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.