schow@bnr-public.uucp (Stanley Chow) (07/16/89)
In article <12233@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes: > >Let's assume that the RISC and CISC machines under consideration are >identical except for instruction encoding; the CISC being microcoded and >each RISC instruction corresponding to one micro instruction. This has been stated many times in the RISC vs CISC debates. Usually, the argument is that RISC opcodes running in I-Cache is the equavilent of micro-code in a CISC machine. This argument is wrong for the following reasons: 1) CISC machines do not have to be micro-coded. It just happens that a majority of the CISC CPU's are micro-coded. Please remeber that VAX is just one example of CISC. There are many other examples of CISC that are totally unlike VAX. If you are arguing against the VAX, then say so. Arguing against CISC means you think the CISC concept has some *inherent* flaw that you are exposing. 2) The typical RISC instruction is far from equal to a typical micro-code instrcution. Almost any interesting micro-coded CPU will have micro-word that is much wider than 32 bits with much parallel operations going on. In fact, the whole premise of RISC gurantees that a RISC instruction will be *less* than a micro-word. 3) Even if a sequence of RISC instructions correspond to a sequence of micro-code, the effects are still not the same. Usually, a micro-coded instruction is atomic (or can be easily made to be). Most RISC CPU's require very special fudging to achive fine grain atomicity. > >CISC compilers tend to pack as many operations together in single >instructions as they possibly can, but this implies a certain amount of >redundancy because the complex instructions represent general case sequences >of micro instructions. Initially, there will be several RISC instructions >for each of the CISC instructions, perhaps 3:1. However, many of them will >be optimized away. > 4) RISC has more opportunaty for optimizations (I will leave open the question of whether RISC *requires* optimization). This means the compiler is much harder to get right. Please note that I am not saying the current RISC compilers don't work. I am merely saying that if I have an obscure bug in my program, I would be more inclined to suspect a RISC compiler than a dumb CISC compiler. >This is because the RISC compiler can recognize and remove redundancies in >the CISC micro instruction sequences. Generally, RISC implies more available >expressions, which makes value propagation and common subexpression >elimination more productive. Taking arbitrary basic blocks of RISC code and >doing rather simple analysis, we have often found that only about 1/4 of the >operations remain after optimization... this is a very unlikely scenario >for CISC operations. The question is, What make it possible to throw away the 3/4? The possible candidates: - Data-flow analysis - other optimizations - more registers - RISC instruction set It is not clear to me that the RISC instruction set should get all the credit just because the compiler is a RISC compiler. Stanley Chow BitNet: schow@BNR.CA BNR UUCP: ..!psuvax1!BNR.CA.bitnet!schow (613) 763-2831 ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public Me? Represent other people? Don't make them laugh so hard.
hankd@pur-ee.UUCP (Hank Dietz) (07/17/89)
In article <753@bnr-fos.UUCP> schow%BNR.CA.bitnet@relay.cs.net (Stanley Chow) writes: >In article <12233@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes: >> >>Let's assume that the RISC and CISC machines under consideration are ^^^^^^ >>identical except for instruction encoding; the CISC being microcoded and >>each RISC instruction corresponding to one micro instruction. > >This has been stated many times in the RISC vs CISC debates. Usually, >the argument is that RISC opcodes running in I-Cache is the equavilent >of micro-code in a CISC machine. > >This argument is wrong for the following reasons: .... You missed my point. I wasn't saying that people should or do design RISC and CISC machines this way, but rather that this is a convenient and reasonable assumption so that a more formal analysis could be made of the possibility of RISC:CISC instruction counts being less than 1:1. > 4) RISC has more opportunaty for optimizations (I will leave open > the question of whether RISC *requires* optimization). This means > the compiler is much harder to get right. Please note that I am > not saying the current RISC compilers don't work. I am merely > saying that if I have an obscure bug in my program, I would be > more inclined to suspect a RISC compiler than a dumb CISC compiler. Not me... I'd be more inclined to suspect a bug in the NEWER compiler, be it RISC or CISC. However, if it's an optimizing CISC compiler, I'll bet it has at least a few bugs even if it has been out for years. Perhaps I'm too much into the field of optimizing compilers, but from personal experience I have seen *decent* CISC compilers nearly always be harder than those performing similar optimizations for RISC... it is just that the average RISC compiler is expected to do more optimization than the average CISC compiler. >>elimination more productive. Taking arbitrary basic blocks of RISC code and >>doing rather simple analysis, we have often found that only about 1/4 of the >>operations remain after optimization... this is a very unlikely scenario >>for CISC operations. > >The question is, What make it possible to throw away the 3/4? The possible >candidates: > - Data-flow analysis > - other optimizations > - more registers > - RISC instruction set > >It is not clear to me that the RISC instruction set should get all the >credit just because the compiler is a RISC compiler. The RISC instruction sets tend to naturally require intermediate results to be placed in registers and this naturally yields more available expressions. I'm not quoting gospel here, but the little compilers we've been doing often manage to get rid of about 3/4 of the ops just doing things straight from Aho & Ullman's "Principles of Compiler Design": [1] Value propagation (get values from where cheapest, remember what values appear where and which quantities are equivalent) [2] Constant folding (also uses value prop.: A=5; B=A*A; would get coded as if it were A=5; B=25;) [3] Common subexpression elimination [4] Dead code elimination (get rid of dead stores, etc.) By the way, my best guess is that a comparable CISC compiler would get rid of perhaps 1/3 of the instructions. But that hardly matters. My point relies only on the fact that if an entire CISC instruction can be eliminated, then so can the sequence of RISC instructions which would implement it; whereas a RISC instruction might be eliminated but the CISC instruction which would incorporate it cannot be eliminated (because other parts of the CISC instruction are needed). I don't want to start another war, but we *never* run out of registers, so that has no impact. The prime point of my posting is that it is quite possible that, for "equivalent" RISC and CISC machines (using the definition I gave in the first sentence of my posting), the RISC:CISC instruction count ratio could be 1:1. Lower ratios would indicate that the CISC compiler isn't doing analysis which the RISC compiler is -- somebody ought to go fix the CISC compiler. -hankd
henry@utzoo.uucp (Henry Spencer) (07/17/89)
In article <753@bnr-fos.UUCP> schow%BNR.CA.bitnet@relay.cs.net (Stanley Chow) writes: > 4) RISC has more opportunaty for optimizations ... This means > the compiler is much harder to get right... No, it means it is harder to get optimal. There's a difference between generating suboptimal code and generating wrong code. Truly well-built compilers (obviously the semantics of "well-built" here are somewhat a matter of opinion) do much optimization by applying transformations that are *hoped* to improve the code but are *known* to preserve semantics. Such compilers may not always generate the fastest code but will seldom generate wrong code. Incidentally, do you really think the greater complexity of CISC instruction sets makes CISC compilers *easier* to get right? The compiler, if nobody else, has to know all the gory details of the machine if it's really pushing for good code quality. -- $10 million equals 18 PM | Henry Spencer at U of Toronto Zoology (Pentagon-Minutes). -Tom Neff | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
schow@bnr-public.uucp (Stanley Chow) (07/17/89)
In article <1989Jul17.035238.6497@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >In article <753@bnr-fos.UUCP> schow%BNR.CA.bitnet@relay.cs.net (Stanley Chow) writes: >> 4) RISC has more opportunaty for optimizations ... This means >> the compiler is much harder to get right... > >No, it means it is harder to get optimal. There's a difference between >generating suboptimal code and generating wrong code. Truly well-built >compilers (obviously the semantics of "well-built" here are somewhat a >matter of opinion) do much optimization by applying transformations that >are *hoped* to improve the code but are *known* to preserve semantics. >Such compilers may not always generate the fastest code but will seldom >generate wrong code. > >Incidentally, do you really think the greater complexity of CISC >instruction sets makes CISC compilers *easier* to get right? The >compiler, if nobody else, has to know all the gory details of the >machine if it's really pushing for good code quality. >-- Henry (and other posters) quite correctly pointed out the mistake in my statment. What I meant to say is: a) RISC "needs" more optimization in compiler (to achive the "same" degree of "optimality") b) Therefore, in *minimal* (meaning compilers with barely enough brains to get acceptable results) compilers, the RISC compiler will tend to have more and fancier optimizations. c) Therefore, in minimal compilers, the RISC compiler will tend to be less reliable. Obviously, the "native" complexity of a CPU will have substantial impact on the complexity and reliability of the compiler. As other posters have pointed out, CISC doees not *necessarily* mean a complex CPU. CISC can be very regular and simple to use, especially if you don't attempt to optimize in the compiler. I am by no means saying that *all* CISC machines lend themself to a nice simple reliable compiler. I *am* saying that CISC is just as likely to *allow* such a compiler. It all depends on the actual complexity of the machine, not whether it is RISC or CISC. Incidentally, I agree that truly optimal compilers are much harder for complex machines. I will even agree that most of the "classical" machines are not nice machines for optimizing compilers. Stanley Chow BitNet: schow@BNR.CA BNR UUCP: ..!psuvax1!BNR.CA.bitnet!schow (613) 763-2831 ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public Me? Represent other people? Don't make them laugh so hard.
bruce@heather.pooh.com (Bruce Robertson) (07/19/89)
These optimizations have been mentioned: [1] Value propagation (get values from where cheapest, remember what values appear where and which quantities are equivalent) [2] Constant folding (also uses value prop.: A=5; B=A*A; would get coded as if it were A=5; B=25;) [3] Common subexpression elimination [4] Dead code elimination (get rid of dead stores, etc.) Shouldn't all these optimizations be done in the machine independent portion of the compiler? The question of RISC v. CISC is raised primarily in the code generator, which in my experience is not the largest portion of the compiler. A well designed compiler should have all the machine dependencies isolated in one relatively small module (relative to the size of the entire compiler). RISCy things like delay slot filling are easiest to handle in the assembler, or other post processor. -- Bruce Robertson Hundred Acre Software, Reno, NV Domain: bruce@pooh.com UUCP: ...!uunet!tahoe.unr.edu!heather!bruce
cjosta@tasu77.UUCP (Jonathan Sweedler) (07/19/89)
In article <756@bnr-fos.UUCP> schow%BNR.CA.bitnet@relay.cs.net (Stanley Chow) writes: >In article <1989Jul17.035238.6497@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >>In article <753@bnr-fos.UUCP> schow%BNR.CA.bitnet@relay.cs.net (Stanley Chow) writes: >>> 4) RISC has more opportunaty for optimizations ... This means >>> the compiler is much harder to get right... >> >>No, it means it is harder to get optimal. There's a difference between >>generating suboptimal code and generating wrong code. > >Henry (and other posters) quite correctly pointed out the mistake in my >statment. What I meant to say is: > Wait a second. I wouldn't give in so quickly if I were you. I'm more of a hardware guy than a software guy, but I was always under the impression that RISC machines need more complex compilers, not just because of the need for more optimization, but because RISC chips are 'dumber' than CISC chips. i.e. a lot of the complexity of the chip is moved to the software. Don't forget that MIPS, one of the early RISC chips, stands for 'Microprocessor without Interlocked Pipe Stages.' This means that interlocks must be handled in the compiler. So forgetting about all optimization, just writing a compiler that WORKS on a RISC chip is by DEFINITION harder than writing a compiler for a CISC chip. Correct me if I'm wrong, but I remember looking at some of the initial specs for the Intel 860 and that thing looked like a nightmare to write a compiler for. Forget about optimizations, just getting floating point results out of the FP pipeline, requires that the compiler have knowledge of the internals of the hardware. How many CISC chips do you know that make the same requirements? (I'm probably going to be sorry I asked that :-) But I think you know the point I'm trying to make). Just so I don't start a new war here, I'm not saying any of this is bad. RISC seems to be proving its point. But I think that a RISC compiler is NECESSARILY more complex than a CISC compiler, regardless of the level of optimization performed by the compiler. Jonathan Sweedler === National Semiconductor Israel UUCP: ...!{amdahl,hplabs,decwrl}!nsc!taux01!cjosta Domain: cjosta@taux01.nsc.com
slackey@bbn.com (Stan Lackey) (07/19/89)
In article <2203@taux01.UUCP> cjosta@tasu77.UUCP (Jonathan Sweedler) writes: >impression that RISC machines need more complex compilers, not just >because of the need for more optimization, but because RISC chips are >'dumber' than CISC chips... >This means that interlocks must be handled in the compiler. So >forgetting about all optimization, just writing a compiler that WORKS >on a RISC chip is by DEFINITION harder than writing a compiler for a >CISC chip. Not necessarily. The winner of the 'dumb compiler that works' award goes to one that just puts nop's anywhere that could have a problem; after loads, in branch delay slots... >Correct me if I'm wrong, but I remember looking at some of the initial >specs for the Intel 860 and that thing looked like a nightmare to write >a compiler for. Forget about optimizations, just getting floating >point results out of the FP pipeline, requires that the compiler have >knowledge of the internals of the hardware. This is true, and makes me believe that IEEE FP and RISC are incompatible. Both major implementations (860 and 88000 (I'm glad I can admit knowlege of the 88000 now)) show that this force-fit causes problems. By RISC arguments, let sw handle the exceptions. But IEEE FP has so many exceptions by its over-complex nature that [not the compiler really, the OS] absolutely must be able to handle said hardware internals, and must do it efficiently; underflow happens more often than you think. -Stan
henry@utzoo.uucp (Henry Spencer) (07/19/89)
In article <2203@taux01.UUCP> cjosta@tasu77.UUCP (Jonathan Sweedler) writes: >... I was always under the >impression that RISC machines need more complex compilers, not just >because of the need for more optimization, but because RISC chips are >'dumber' than CISC chips... As has been pointed out before (sigh), this is just not true. Most compilers can make very little use of the extra complexity of CISCs; simple code generation for a RISC may not exploit the full power of the machine, but the result is often reasonably fast code. >... Don't forget that MIPS, one of the early RISC >chips, stands for 'Microprocessor without Interlocked Pipe Stages.' >This means that interlocks must be handled in the compiler... Things like software pipe interlocks, delayed branches, etc. can almost always be deferred to a cleanup pass that is very stupid and simple. (On some RISC chips, this sort of thing is done in the assembler, not the compiler at all.) One of the surprises that came out of RISC work is that moving some of these things into the software really costs very little. You may occasionally miss an optimization opportunity by not having the earlier passes aware of the final rearrangement, but it's not really very frequent. >Correct me if I'm wrong, but I remember looking at some of the initial >specs for the Intel 860 and that thing looked like a nightmare to write >a compiler for... Hey, but look who made it. :-) I mean, you really expected the source of the 8086 to produce a machine that was easy to generate code for? Not everyone has made that big a mess of their RISC designs. The other side of the coin, of course, is that if you are willing (as John Mashey puts it) "to commit unspeakable acts", the i860 is REALLY FAST. You get what you pay for. -- $10 million equals 18 PM | Henry Spencer at U of Toronto Zoology (Pentagon-Minutes). -Tom Neff | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
jlg@lanl.gov (Jim Giles) (07/20/89)
From article <BRUCE.89Jul18181114@heather.pooh.com>, by bruce@heather.pooh.com (Bruce Robertson): > These optimizations have been mentioned: > [1] Value propagation (get values from where cheapest, remember what > values appear where and which quantities are equivalent) > [2] Constant folding (also uses value prop.: A=5; B=A*A; would get > coded as if it were A=5; B=25;) > [3] Common subexpression elimination > [4] Dead code elimination (get rid of dead stores, etc.) They may have been mentioned, but I didn't mention them. All the above are machine independent (except perhaps [1], which relates to register allocation). The optimizations I mentioned were: (A) Instruction selection (if several different sequences of instructions can perform a given action - which is best?) (B) Register allocation (store intermediate results in registers if at all possible). (C) Code ordering (issue the instructions in the order which best uses pipelining or other parellelism in the archetecture). Any one of the above is fairly easy to do in isolation. Any two of the above are difficult to do because they interfere with each other. Only CISC archetectures even have to consider (A) since, in a RISC, this step is _very_ simple. I claim that CISC compilers must do all three of the above in order to be competitive. Hence, my claim that CISC is harder to compile for than RISC.
jlg@lanl.gov (Jim Giles) (07/20/89)
From article <2203@taux01.UUCP>, by cjosta@tasu77.UUCP (Jonathan Sweedler): > [...] I was always under the > impression that RISC machines need more complex compilers, not just > because of the need for more optimization, but because RISC chips are > 'dumber' than CISC chips. i.e. a lot of the complexity of the chip is > moved to the software. You are confusing complexity with quantity. RISC compilers usually (but not always) generate more instructions for a given task than CISC compilers do. However, the task of generating code for RISC is _much_ easier than generating code for CISC. So, the RISC compiler is simpler than the CISC compiler even if it does produce more output. > [...] Don't forget that MIPS, one of the early RISC > chips, stands for 'Microprocessor without Interlocked Pipe Stages.' > This means that interlocks must be handled in the compiler. So > forgetting about all optimization, just writing a compiler that WORKS > on a RISC chip is by DEFINITION harder than writing a compiler for a > CISC chip. You stated that incorrectly. Forgetting about all optimization, just writing a compiler that WORKS on a non-interlocked chip is by DEFINITION harder than writing a compiler for an interlocked chip. This doesn't necessarily have anything to do with RISC vs. CISC. [...] > Just so I don't start a new war here, I'm not saying any of this is > bad. RISC seems to be proving its point. But I think that a RISC > compiler is NECESSARILY more complex than a CISC compiler, regardless > of the level of optimization performed by the compiler. The strange fact is: for a given degree of optimization, it is harder to compile for CISC than RISC. This is hardly surprising. Computers are real good at simple repetitive tasks (like generating code for RISC machines). Complex tasks (liking making good use of a rich instruction set on a CISC) require more effort on the part of the _programmer_.
mash@mips.COM (John Mashey) (07/20/89)
In article <42942@bbn.COM> slackey@BBN.COM (Stan Lackey) writes: .... >>Correct me if I'm wrong, but I remember looking at some of the initial >>specs for the Intel 860 and that thing looked like a nightmare to write >>a compiler for. Forget about optimizations, just getting floating >>point results out of the FP pipeline, requires that the compiler have >>knowledge of the internals of the hardware. >This is true, and makes me believe that IEEE FP and RISC are >incompatible. Both major implementations (860 and 88000 (I'm glad I >can admit knowlege of the 88000 now)) show that this force-fit causes >problems. By RISC arguments, let sw handle the exceptions. But IEEE >FP has so many exceptions by its over-complex nature that [not the >compiler really, the OS] absolutely must be able to handle said >hardware internals, and must do it efficiently; underflow happens >more often than you think. This seems to be an over-generalization. Maybe this means that 860s and 88000s have problems (beyond whether or not these are the 2 "major" implementations :-). I assume these refer to the issues of dealing with imprecise exceptions, which is certainly possible, if complicated, or extremely complicated. Anyway, please explain more why IEEE FP and RISC are incompatible, as the RISCs that are already out there in systems manage to deal with it.... -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
jac@paul.rutgers.edu (Jonathan A. Chandross) (07/20/89)
cjosta@tasu77.UUCP (Jonathan Sweedler) > Don't forget that MIPS, one of the early RISC > chips, stands for 'Microprocessor without Interlocked Pipe Stages.' > This means that interlocks must be handled in the compiler. So > forgetting about all optimization, just writing a compiler that WORKS > on a RISC chip is by DEFINITION harder than writing a compiler for a > CISC chip. Actually, this is usually handled by a recoding assembler. You don't really want to handle it in your compiler unless your compiler integrates the assembly phase and emits .o files directly. The reason is that you have to do some analysis on the whole program when you are finished to try to fill the delayed branch slots, and to reorder instructions, inserting no-ops where necessary, to avoid pipeline conflicts. So you might as well do it then. And writing a RISC compiler isn't `by DEFINITION'' harder: you just pad out everything with no-ops (both delayed branch slots and the actual instructions to insure that the pipeline flushes.) Pretty poor object code, but you don't have to do any analysis at all. Writing a compiler that makes use of the non-trivial features of a machine, RISC or CISC, is very tricky. For instance, I still haven't found out how I can generate code to use those all those nice BCD instructions in my 68k..... Jonathan A. Chandross Internet: jac@paul.rutgers.edu UUCP: rutgers!paul.rutgers.edu!jac
mash@mips.COM (John Mashey) (07/20/89)
In article <2203@taux01.UUCP> cjosta@tasu77.UUCP (Jonathan Sweedler) writes: ..... >Correct me if I'm wrong, but I remember looking at some of the initial >specs for the Intel 860 and that thing looked like a nightmare to write >a compiler for. Forget about optimizations, just getting floating >point results out of the FP pipeline, requires that the compiler have >knowledge of the internals of the hardware. How many CISC chips do >you know that make the same requirements? (I'm probably going to be >sorry I asked that :-) But I think you know the point I'm trying to make). > >Just so I don't start a new war here, I'm not saying any of this is >bad. RISC seems to be proving its point. But I think that a RISC >compiler is NECESSARILY more complex than a CISC compiler, regardless >of the level of optimization performed by the compiler. This topic has been argued frequently in this newsgroup. Whether or not an i860 is hard to compile for proves zero about whether any given other RISC is hard to compile for. It is hard to get data upon this topic, but there is a little. First, perhaps it would be good to hear from people who have WRITTEN compilers for different architectures and see I'll what THEY think [please]. Second, I'll cite one study, by an unbiased source, especially since folks have been asking about it, since I used a few of the foils at Hot Chips as an answer to a question (not part of the talk, and if A.K. would send me his address, I'll send him a copy): Funk, Bud K, "RISC and CISC benchmarks and insights", UNISYS World, January 1989, 11-14. They evaluated 8 chips {80386, 32532, 68020, 68030, SPARC, 88000, 29000, R3000}, running the same {small} integer benchmarks written in both {C, assembly language}, using the same external memory, to keep things as fair as possible. [This is hard work, by the way, which is why you don't see more of this kind of study.] 1) First, they handcoded the small benchmarks in assembler for all machines, and called the 68020 == 1 (bigger is better). The CISC performance ranged from about .75 to 1.1; the 4 RISCs were about 2-2.5X faster. 2) They then compiled the C code. The CISCs dropped about 50-60% in performance; the RISCs dropped 20-40% in performance, so the resulting ratio was that the RISCs ended up with a 4-5.5X performance advantage. 3) OF COURSE THESE ARE SMALL BENCHMARKS, AND SO ALL THE USUAL CAVEATS APPLY, and I don't know (except in one case :-) exactly whose compilers were being used, etc, etc. However, the group doing these evaluations is NOT a purveyor of RISC chips or compilers, and is trying to do a straight evaluation. A quote from the article: "The RISC compilers are relatively new and still significantly better than their CISC counterparts." Now, if that statement is true, I observe that it is NOT from lack of competent people, working for years, to build good compilers for the CISCs....... -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
schow@bnr-public.uucp (Stanley Chow) (07/20/89)
In article <23663@winchester.mips.COM> mash@mips.COM (John Mashey) writes: > >They evaluated 8 chips {80386, 32532, 68020, 68030, SPARC, 88000, 29000, >R3000}, .... Just a reminder that the three CISC's here are by no means representative of CISC architecture. [Who cares about the actual number sold.] Conclusions reached on a small sample of architecture cannot be generalized to *all* CISC or RISC machines. Stanley Chow BitNet: schow@BNR.CA BNR UUCP: ..!psuvax1!BNR.CA.bitnet!schow (613) 763-2831 ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public Me? Represent other people? Don't make them laugh so hard.
slackey@bbn.com (Stan Lackey) (07/20/89)
In article <23659@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >In article <42942@bbn.COM> slackey@BBN.COM (Stan Lackey) writes: >>This is true, and makes me believe that IEEE FP and RISC are >>incompatible. > >This seems to be an over-generalization. >Anyway, please explain >more why IEEE FP and RISC are incompatible, ... Sure. As I understand it, the major basis of RISC is: Simplify the hardware. Include only those features that: Provably improve performance; Their presence speeds up certain operation(s) Their presence does not slow down other operation(s) Are straightforward to implement; Are usable by straightforward compiler design; Anything else emulate in software. [I'm going to let it stand here. Probing more deeply into any of the above is a separate discussion.] IEEE floating point goes contrary to each of those issues. 1. Gradual underflow is not straightforward to implement. It requires more cases to be handled in virtually every fp operation, for example pre-alignment before multiplication. Now you can leave time before every multiplication for this prealignment (Cydrome), you take a special control flow to handle the case (68881 etc), you just zero the operands/results and ignore it (essentially what DEC, Cray, IBM, etc do), or you trap to the OS (i860). BTW [at least] DEC has trap-on-underflow. Guess what happens real quick? Users shut it off. [Was that necessary? Bad boy. -Ed.] Gradual underflow does not improve performance. It [inaccurately] extends the low end of the range a little. It adds significantly to design and verification time. The only case I ever heard of it mattering is a drummed-up case Kahan wrote. This does not affect the complier, other than requiring the runtime libraries to be able to convert yet another format on output. In the case of an i860 software supplier, if he wants stuff like transcendentals to run fast, he must tune them so that they don't cause underflow traps. Rather that just use 0 which is what he wants anyway. 2. Rounding modes. No compiler on Earth uses them, other than to get around instruction set deficiencies (rounding when converting from FP to integer). This particular problem is more efficiently solved by adding a second version of the convert-to-integer instruction. Their use has been intended for range arithmetic. Now, when I asked, lots of people knew what they were for, but no one [on comp.arch] had ever heard of them having ever been used. Interesting how much hardware out there can do these operations that are so rarely used. Kind of like the decimal string instruction set on the VAX, except that the VAX decimal strings are occasionally used. [There you go again! -Ed.] Their implementation requires not much logic (maybe 50-100 gates), but logic that affects critical path timing. It requires yet more verification (test vectors). -- My belief is that if IEEE FP had been done a few years after it actually was, after RISC became the trend, it would have come out much different. In fact, I think the suppliers should get an official subset recognized by the IEEE that just uses 0 for underflowed operands/results, and just implements round-to-nearest. I would be very surprised if it ever came up in a sales situation anyway; in fact all I care about is how fast it runs SPICE. I'm done. [Finally! -Ed.] -Stan
khb@gammara.Sun.COM (gammara) (07/20/89)
In article <43004@bbn.COM> slackey@BBN.COM (Stan Lackey) writes: >Sure. As I understand it, the major basis of RISC is: > >Simplify the hardware. Include only those features that: > Provably improve performance; > Their presence speeds up certain operation(s) > Their presence does not slow down other operation(s) > Are straightforward to implement; > Are usable by straightforward compiler design; > Anything else emulate in software. Getting the _right_ answer should be on the list. > >IEEE floating point goes contrary to each of those issues. > >1. Gradual underflow is not straightforward to implement. > >Gradual underflow does not improve performance. It [inaccurately] >extends the low end of the range a little. It adds significantly to >design and verification time. The only case I ever heard of it >mattering is a drummed-up case Kahan wrote. GU is one of the MOST useful parts of the standard. It, in fact, allows many algorithms to run in 32-bits that would have required 64-bits (or more) with, say, IBMath. Kahan is one of the most skillful folk at finding SIMPLE examples ... real codes are much too hard to fully analyze ... and that is exactly why GU _should_ be employed for all 32-bit machines. While purists will deep fry me in oil (or should), I can see the great utility of a 64-bit non-compliant mode ... there are many codes which are run in 64-bits for hysterical reasons (broke on a 360 therefore it must always be DP ...) which get along just fine with abrupt underflow. AU is sometimes desired in the 32-bit realm, but there one really should analyze the whole program (or run just selected parts in AU). This can be accomplished on SunOS 4.x with a runtime call. > >In the case of an i860 software supplier, if he wants stuff like >transcendentals to run fast, he must tune them so that they don't >cause underflow traps. Rather that just use 0 which is what he wants >anyway. Depends on the algorithm. If 0 is what they want, one can toggle back and forth. > >2. Rounding modes. No compiler on Earth uses them, other than to get >around instruction set deficiencies (rounding when converting from >FP to integer). This particular problem is more efficiently solved by >adding a second version of the convert-to-integer instruction. The folks who do interval arithmetic would strongly disagree here. For better or worse, most of those folks seem to live in Europe. I have seen versions of Pascal and Fortran extended to use 'em. > >-- >My belief is that if IEEE FP had been done a few years after it >actually was, after RISC became the trend, it would have come out much >different. In fact, I think the suppliers should get an official Getting the right answer is important. When dealing with FP problems one has at LEAST the following issues to sweat: 1) is the algorithm stable ? 2) is it robust ? Requires a very hard backwards analysis to prove. 3) is the implementation stable 4) robust ? This requires massive testing 5) is the problem ill conditioned ? The symptoms are EXACTLY those of 1-4 being broken. 6) is my computer arithmetic doing something funny The symptoms are EXACTLY those of 1-5 being broken. Using ieee math removes 6 from those things which one must sweat bullets over when your 100+K line application is acting wonky. The cost to implementors is non-trivial, but the savings to application specialists is huge ... and they need not even understand that this is happening in their behalf ... they just get fewer strange edge conditions. >subset recognized by the IEEE that just uses 0 for underflowed >operands/results, and just implements round-to-nearest. I would be >very surprised if it ever came up in a sales situation anyway; in fact >all I care about is how fast it runs SPICE. You should also care that the circuit is analyzed correctly. The sad fact is that this is exceedingly hard to benchmark. That is why we need the standard. Kahan has a paper or two on Presubstitution a technique which allows one to avoid a lot of the expensive precise exception stuff for a huge number of applications. Keith H. Bierman |*My thoughts are my own. Only my work belongs to Sun* It's Not My Fault | Marketing Technical Specialist ! kbierman@sun.com I Voted for Bill & | Languages and Performance Tools. Opus (* strange as it may seem, I do more engineering now *)
bcase@cup.portal.com (Brian bcase Case) (07/21/89)
>>Correct me if I'm wrong, but I remember looking at some of the initial >>specs for the Intel 860 and that thing looked like a nightmare to write >>a compiler for. Forget about optimizations, just getting floating >>point results out of the FP pipeline, requires that the compiler have >>knowledge of the internals of the hardware. > >This is true, This is not true; the i860 has non-pipelined FP instructions that don't require any knowledge of the pipeline. Of course, if you use them, you don't get the performance....
seanf@sco.COM (Sean Fagan) (07/21/89)
In article <BRUCE.89Jul18181114@heather.pooh.com> bruce@heather.pooh.com (Bruce Robertson) writes: >Shouldn't all these optimizations be done in the machine independent >portion of the compiler? Yes, all of the optimizations you mentioned (which I deleted) *should* be done in the mi part of the compiler. For compilers which are intended to support multiple architectures (remember, not all are), it usually is. >A well designed compiler should have >all the machine dependencies isolated in one relatively small module >(relative to the size of the entire compiler). RISCy things like delay >slot filling are easiest to handle in the assembler, or other post >processor. Except that delay-slot filling is not the only type of optimization that exists. For some machine (such as a CDC Cyber), doing certain instructions after another would slow it down, while on other machines, it can speed up. Specifically: memory accesses and various register combinations. Some microprocessors can be optimized to do certain kinds of memory fetches faster, while the Cyber does sequential accesses best. Also, something like: FX6 X5*X3 FX7 X6/X1 is bad practice on a cyber (put other instructions in between). On some other machines (specifically VLIW and the ilk), that could be reduced to a pseudo-one instruction which did: X6 <- X5*X3, X7 <- <multiply unit> / X1 which would execute in parallel as much as possible (I know, I know, that's really a bad example. I had some better code than that lying around, but I forget where I put it). Then there's the choice of instruction selection. Sometimes, which instructions you choose can change some of your "machine independent" optimizations. Also, you have to choose registers properly (is it better to put array addresses into registers, or would it be better to put other variables in registers and let the hardware compute the array address? For something like a '486, this could be a tough choice...). In other words, life isn't as simple as we would like it to be... -- Sean Eric Fagan | "Uhm, excuse me..." seanf@sco.UUCP | -- James T. Kirk (William Shatner), ST V: TFF (408) 458-1422 | Any opinions expressed are my own, not my employers'.
stevev@tekchips.LABS.TEK.COM (Steve Vegdahl) (07/21/89)
In article <1989Jul19.165456.20210@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) writes: > Things like software pipe interlocks, delayed branches, etc. can almost > always be deferred to a cleanup pass that is very stupid and simple. (On > some RISC chips, this sort of thing is done in the assembler, not the > compiler at all.) One of the surprises that came out of RISC work is > that moving some of these things into the software really costs very > little. You may occasionally miss an optimization opportunity by not > having the earlier passes aware of the final rearrangement, but it's > not really very frequent. The early RISC compiler work (e.g., Gross) seems have suggested this, but later work (e.g., Bird's thesis from U. Michigan) has shown that doing register allocation before instruction scheduling will often cause code quality to be much poorer. The register allocator generally wants to do all of one computation, then all of the next, etc. to minimize the number of concurrently live variables. The instruction scheduler generally wants to interleave "unrelated" compuations to minimize the effects of memory latency and such. This issue came up when we were considering instruction scheduling for our 88000 Scheme compiler. (See our paper in the ASPLOS-III proceedings). What happens in practice when register allocation is done first, is that the register assignments introduce a lot of artificial constraints on the scheduler, causing it to do a much poorer job. In other words, the assignment of common registers to computations that could otherwise be interleaved makes such interleaving impossible. Steve Vegdahl Computer Research Lab Tektronix Labs Beaverton, Oregon
stevev@tekchips.LABS.TEK.COM (Steve Vegdahl) (07/21/89)
In article <13995@lanl.gov>, jlg@lanl.gov (Jim Giles) writes: > (A) Instruction selection ... > (B) Register allocation ... > (C) Code ordering ... > Any one of the above is fairly easy to do in isolation. Any two of the > above are difficult to do because they interfere with each other. Only > CISC archetectures even have to consider (A) since, in a RISC, this > step is _very_ simple. I claim that CISC compilers must do all three > of the above in order to be competitive. Hence, my claim that CISC is > harder to compile for than RISC. On the other hand, (C) is generally much more difficult for a RISC architcture, due to architectural features such as delayed branches and exposed memory latency. These features tend not pose difficult problems when treated in isolation---as was done in the early RISC compiler work---but cause additional compications when combined with other phases. I think that the RISC vs. CISC compiler issue is more or less a toss-up. With either type of architecture, you can write a *simple* compiler by compiling: * to a RISC subset * without code organization For higher quality code, CISC compilers are complicated by the need to consider complex addressing modes, while RISC compilers are complicated by the need to consider delayed branches and memory latency. Both kinds of complications have interdependencies with other aspects of optimization. Steve Vegdahl Computer Research Lab Tektronix Labs Beaverton, Oregon
gideony@microsoft.UUCP (Gideon Yuvall) (07/22/89)
In article <43004@bbn.COM> slackey@BBN.COM (Stan Lackey) writes: >2. Rounding modes. No compiler on Earth uses them, other than to get >around instruction set deficiencies (rounding when converting from >FP to integer). This particular problem is more efficiently solved by >adding a second version of the convert-to-integer instruction. > >Their use has been intended for range arithmetic. Now, when I asked, >lots of people knew what they were for, but no one [on comp.arch] had >ever heard of them having ever been used. Interesting how much ... ... K.C.Ng's square-root code (part of 4.3BSD; look under the "#if 0" conditional) uses rounding modes to compute correct SQRTs. -- Gideon Yuval, gideony@microsof.UUCP, 206-882-8080 (fax:206-883-8101;TWX:160520)
mash@mips.COM (John Mashey) (07/22/89)
In article <4356@tekcrl.LABS.TEK.COM> stevev@tekchips.LABS.TEK.COM (Steve Vegdahl) writes: >In article <13995@lanl.gov>, jlg@lanl.gov (Jim Giles) writes: > ... >> (C) Code ordering > ... >> Any one of the above is fairly easy to do in isolation. Any two of the >> above are difficult to do because they interfere with each other. Only >> CISC archetectures even have to consider (A) since, in a RISC, this >> step is _very_ simple. I claim that CISC compilers must do all three >> of the above in order to be competitive. Hence, my claim that CISC is >> harder to compile for than RISC. > >On the other hand, (C) is generally much more difficult for a RISC >architcture, due to architectural features such as delayed branches and >exposed memory latency. These features tend not pose difficult problems >when treated in isolation---as was done in the early RISC compiler work---but >cause additional compications when combined with other phases. 1) As I recall, our first reorganizer was written and debugged by 1 person (Larry Weber) in a week. Admittedly he'd written several before. It certainly has been improved and rewritten since, but it lasted a good while. 2) It is clear that code reordering varies in usefulness from architecture to architecture, BUT the boundary is not RISC-vs-CISC, but determined by presence or absence of various features and the ways compilers use them. In general, any machine with nontrivial pipelines and/or separate functional units may well benefit. Note that CDCs did all this years ago. Also, I heard long ago that people were able to get >5% improvement on S/370-architecture systems by code-reordering. Maybe somebody has a reference. Also, maybe somebody from Amdahl or IBM would care to offer some inforomation on what the pipelines and functional units look like on their current products, because I'd suspect there are places where code rearrangement is usable. I'd strongly suspect that most machines that could execute: load reg1,a load reg2,b add reg1,20 sub reg2,30 are better off doing it that way than the following: load reg1,a add reg1,20 load reg2,b sub reg2,30 because most machines will stall on the add/sub more on the second case. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
brooks@vette.llnl.gov (Eugene Brooks) (07/23/89)
In article <4355@tekcrl.LABS.TEK.COM> stevev@tekchips.LABS.TEK.COM (Steve Vegdahl) writes: >The register allocator generally wants to do all of one computation, then >all of the next, etc. to minimize the number of concurrently live variables. >The instruction scheduler generally wants to interleave "unrelated" >compuations to minimize the effects of memory latency and such. This issue >came up when we were considering instruction scheduling for our 88000 Scheme >compiler. (See our paper in the ASPLOS-III proceedings). > >What happens in practice when register allocation is done first, is that the >register assignments introduce a lot of artificial constraints on the >scheduler, causing it to do a much poorer job. In other words, the >assignment of common registers to computations that could otherwise be >interleaved makes such interleaving impossible. We dealt with this problem in the Cerberus multiprocessor simulator by simply having the register allocator get registers from the scratch pile in a round-robin manner, instead of starting upwards from R0 in the search for a free one each time. This minimized the number of "false dependencies" and allowed a postprocessing scheduler to work quite well. In the limit of large register counts it of course is perfect (256 registers is effectively an infinite number, but in practice is worked quite well at 16 or 32 registers. If one is constrained to do the scheduling in a machine dependent postprocessor with the register allocation done in a PCC or GCC like portable compiler this technique is quite useful and produces well scheduled code. brooks@maddog.llnl.gov, brooks@maddog.uucp
sbf10@uts.amdahl.com (Samuel Fuller) (07/26/89)
In article <23802@winchester.mips.COM>, mash@mips.COM (John Mashey) writes: > 2) It is clear that code reordering varies in usefulness from architecture > to architecture, BUT the boundary is not RISC-vs-CISC, but determined > by presence or absence of various features and the ways compilers use them. > In general, any machine with nontrivial pipelines and/or separate > functional units may well benefit. Note that CDCs did all this years ago. > Also, I heard long ago that people were able to get >5% improvement > on S/370-architecture systems by code-reordering. Maybe somebody has > a reference. Also, maybe somebody from Amdahl or IBM would care to > offer some inforomation on what the pipelines and functional units > look like on their current products, because I'd suspect there are places > where code rearrangement is usable. > I'd strongly suspect that most machines that could execute: > load reg1,a > load reg2,b > add reg1,20 > sub reg2,30 > are better off doing it that way than the following: > load reg1,a > add reg1,20 > load reg2,b > sub reg2,30 > because most machines will stall on the add/sub more on the second case. Some experiments have been conducted here at Amdahl wrt code reordering. Fairly simple code rearrangement have produced approx. 5% improvement in execution speed. Current Amdahl machines have 5 or 6 stage pipelines depending on the model. Although the machines are microcoded most instructions take just one flow down the pipeline. The microcoded approach does not impact the cycle time (10ns) because the cycle time is determined mostly by the cache access latency. If data is in the cache, execution proceeds without delay. Access to cache and to registers takes the same amount of time, although a Register-Register architecture would probably allow a shorter pipe. -- --------------------------------------------------------------------------- Sam Fuller / Amdahl System Performance Architecture I speak for myself, from the brown hills of San Jose. UUCP: {ames,decwrl,uunet}!amdahl!sbf10 | USPS: 1250 E. Arques Ave (M/S 139) INTERNET: sbf10@amdahl.com | P.O. Box 3470 PHONE: (408) 746-8927 | Sunnyvale, CA 94088-3470 ---------------------------------------------------------------------------