aglew@ccvaxa.UUCP (09/23/87)
| cik@l.cc.purdue.edu (Herman Rubin) | There are many instructions which are easy to implement in hardware, but | for which software implementation may even be so costly that a procedure | using the instruction may be worthless. Some of these instructions have | been implemented in the past and have died because the ill-designed | languages do not even recognize their existence. Others have not been | included due to the non-recognition of them by the so-called experts and | by the stupid attitude that something should not be implemented unless | 99.99% of the users of the machine should be able to want the instruction | _now_. As you can tell from this article, I consider the present CISC | computers to be RISCy. Well, I'm defintely a member of the RISC camp, but I do think that Herman has raised two good points: (1) there are instructions that are not in the 99% usage group that can still be more efficiently implemented in hardware than software. (2) languages should provide better (more efficient) ways of interfacing to machine features, without the expense of wrapping them in procedure call overhead. (1) Remember, frequency of use is not the fundamental criterion for inclusion of an instruction. It's more like (Number of times that the operation is required) * (Speed of software implementation)/(speed of hardware implementation) /(slowdown of other instructions...) With (customers who simply want a benchmark that can make good use of that particular instruction) factored in. Of course, these are highly nonlinear functions, and frequency of use is often a good approximation, other things being equal. But other things are not always equal. "Unusual" instructions do not always mean microcode; if they can be combinatorically implemented, especially without state machines, then there may be no slowdown apart from wiring effects. And they may be extremely slow to do in software. One of my favorite examples of this is the bit-reversed indexing that is so convenient for FFT applications. Bit reversing an 8 or 16 bit address isn't too bad, but some applications are getting into the 32 bit range, and may be passing that soon. Even with table lookup in decent sized tables, bit reversal is expensive in software - and yet it can be easily done in hardware. If FFTs are a primary application for your system it may be worthwhile looking at bit reversal or carry-reversed addition. Even if they aren't, the instruction may be so cheap to implement that it may be worth including - because it may turn out to be important in the future. Of course, there is a designer's trap here: each additional instruction may appear cheap, but the cost of a horde of extra instructions may exceed the sum of them each because you have exceeded a hard limit of your implementation, like chip size. I try to avoid instructions that require sequencing, but occasionally amuse myself by thinking of purely combinatoric operations that might be useful, and cheap to implement. Like bit reversal, or counting the number of set bits in an instruction, or branching (ooops) based on the uppermost three bits. Unfortunately, there aren't too many. (2) The second point, about interfacing to high level languages, is more important. Say that I do have a code that needs to count the number of set bits in a bitstring, and it already uses a POP function. Well, I can simply replace the POP function by my POP instruction, can't I? Unfortunately, most $%^^#@!!! languages do not let you; you either have to wrap the POP instruction in a function call, which loses most of it's benefit, or you have to use asms after putting stuff into register variables that you *KNOW* the compiler maps to R7 and R6... Not good either way. It would be nice if the compiler could be made to know about every instruction in the machine, even though it didn't generate code for them; it would be nice if you could say count = asm_pop(bitreg) and the compiler could say "Oh, he wants to use the pop instruction that this strange machine provides. I don't know how to use it myself, but I know how to arrange for the programmer to use it. Let's see, it takes an input in a register and puts an output into a register. He wants pop of bitreg - well, that's in memory, so it needs to be fetched into a register. THere's no register free, so I'll have to save R7. Now, he wants the result in count. Oh, that's already in a register R3. Well, I just have to unsave R7, and now I've got save R7 R7 = bitreg pop R7 -> count unsave R7 Well, that's a nice bit of code that I wouldn't have been able to produce automatically, but at least I was able to arrange the register use for my programmer" I've just received a UNIX PC in the fire sale, and I'm told that the inline assembler functions can be made to do something like this, so maybe the world is slowly getting better. Andy "Krazy" Glew. Gould CSD-Urbana. USEnet: ihnp4!uiucdcs!ccvaxa!aglew 1101 E. University, Urbana, IL 61801 ARPAnet: aglew@gswd-vms.arpa I always felt that disclaimers were silly and affected, but there are people who let themselves be affected by silly things, so: my opinions are my own, and not the opinions of my employer, or any other organisation with which I am affiliated. I indicate my employer only so that other people may account for any possible bias I may have towards my employer's products or systems.
lamaster@pioneer.arpa (Hugh LaMaster) (09/29/87)
In article <28200048@ccvaxa> aglew@ccvaxa.UUCP writes: > > important. Say that I do have a code that needs to count the number > of set bits in a bitstring, and it already uses a POP function. > Well, I can simply replace the POP function by my POP instruction, > can't I? Unfortunately, most $%^^#@!!! languages do not let you; > you either have to wrap the POP instruction in a function call, > which loses most of it's benefit, or you have to use asms after > putting stuff into register variables that you *KNOW* the compiler > maps to R7 and R6... Not good either way. > It would be nice if the compiler could be made to know about every > instruction in the machine, even though it didn't generate code > for them; it would be nice if you could say As described, it is semi-machine-independent. This is hard to do. But, there have been compilers that let you put instructions in in a machine dependent way. This is a nice compromise with assembly language code, because most of the time you can let the compiler do the work. The Cyber 205 has a set of "Q8" calls, so called because they look like Q8MERGE say to use a "MERGE" instruction. The compiler uses program names to generate the addresses: e.g. CALL MERGE(A,B,C) so you don't have to do anything wierd to get the address. It is a nice feature, as long as it doesn't encourage the compiler writer to get lazy. ("Well, I won't bother to recognize that special case and generate the "MERGE" instruction, because if someone really needs it they can call it directly.") I wish more compilers had this feature; it eliminates the need for a lot of assembly language coding. (Or speeds up the process, depending on how you look at it.) Hugh LaMaster, m/s 233-9, UUCP {topaz,lll-crg,ucbvax}! NASA Ames Research Center ames!pioneer!lamaster Moffett Field, CA 94035 ARPA lamaster@ames-pioneer.arpa Phone: (415)694-6117 ARPA lamaster@pioneer.arc.nasa.gov (Disclaimer: "All opinions solely the author's responsibility")
tim@amdcad.AMD.COM (Tim Olson) (10/02/87)
In article <340@oracle.UUCP> bradbury@oracle.UUCP (Robert Bradbury) writes: | So from my experience the cost of mapping CISC functions into CISC instructions | can be quite a large part of the code generator of a compiler. Do the RISC | people have any measures of how much work goes into a RISC code generator | for things like DIV/MUL, STRCPY/MEMCPY or BRANCH scheduling? (Some of the code | published for the AMD 29000 indicates these aren't afternoon efforts :-).) In our "development" C compiler, div/mul, strcpy/memcpy are simply calls to the runtime routines to perform these functions, so there was no cost in the code generator for these. I didn't write the code generator, but the delayed-branch scheduling code in the optimizer is very small. | Have we gotten to the point where we can estimate the hardware development | costs of branch destination caching vs. the software development costs | of branch scheduling and trade them off against each other? The two aren't mutually-exclusive (the Am29000 implements both). Delayed-branches allow execution of instructions following the branch which are already in the pipeline, while the Branch Target Cache reduces or eliminates the latency involved in starting a new instruction stream. Perhaps you mean the tradeoff between delayed-branches and branch prediction? -- Tim Olson Advanced Micro Devices (tim@amdcad.amd.com)
turner@uicsrd.csrd.uiuc.edu (10/02/87)
> Written 9:22 pm Sep 28, 1987 by stachour@umn-cs.UUCP in comp.arch > > I've NEVER seen anyone design compilers for a machine that is only > being similated, and chose the architecture of the hardware based > on measurement, and build the machine later. (Well, one exception, > Multics many years ago, but that design set goals seldom met now.) > > Paul Stachour > Honeywell SCTC (Stachour@HI-Multics) > UMinn. Computer Science (stachour at umn-cs.edu) Here at CSRD we have been designing compilers for Cedar since BEFORE the machine was ever simulated. I think that compiler design is obviously important enough to be done concurrently with architecture evaluation, anyone else feel differently?? --------------------------------------------------------------------------- Steve Turner (on the Si prairie - UIUC CSRD) UUCP: {ihnp4,seismo,pur-ee,convex}!uiucdcs!uicsrd!turner ARPANET: turner%uicsrd@a.cs.uiuc.edu CSNET: turner%uicsrd@uiuc.csnet *-)) Mutants for BITNET: turner@uicsrd.csrd.uiuc.edu Nuclear Power (-%
aglew@ccvaxa.UUCP (10/07/87)
>I believe the current AT&T C compilers (Release 3) allow "Enhanced Assembly >Language Escapes for C" which include a fairly sophisticated mechanism >for defining asm() pseudo-macros which look like C functions. Does anyone >know how much this "cost" in compiler effort? > >Robert Bradbury I'd heard about this, and had been hoping to play with it when I received my AT&T 3B1 UNIX PC, running 3.5.1, which I thought was up to R3. But, if this compiler has these pseudo-functions, they're well hidden. Does anyone know if they exist on the 3B1? Andy "Krazy" Glew. Gould CSD-Urbana. USEnet: ihnp4!uiucdcs!ccvaxa!aglew 1101 E. University, Urbana, IL 61801 ARPAnet: aglew@gswd-vms.arpa I always felt that disclaimers were silly and affected, but there are people who let themselves be affected by silly things, so: my opinions are my own, and not the opinions of my employer, or any other organisation with which I am affiliated. I indicate my employer only so that other people may account for any possible bias I may have towards my employer's products or systems.
howard@cpocd2.UUCP (Howard A. Landman) (10/16/87)
>> Written 9:22 pm Sep 28, 1987 by stachour@umn-cs.UUCP in comp.arch >> I've NEVER seen anyone design compilers for a machine that is only >> being similated, and chose the architecture of the hardware based >> on measurement, and build the machine later. (Well, one exception, >> Multics many years ago, but that design set goals seldom met now.) In article <43700027@uicsrd> turner@uicsrd.csrd.uiuc.edu writes: >Here at CSRD we have been designing compilers for Cedar since BEFORE >the machine was ever simulated. I think that compiler design is >obviously important enough to be done concurrently with architecture >evaluation, anyone else feel differently?? No, I agree. When we were designing the RISC I at Berkeley a while back, we had a compiler and an instruction-level simulator long before the design finalized. We also could run lower level simulations (eventually, right down to switch simulation of the transistor netlist extracted from the chip layout) at the same time and compare them with the high-level simulation. This was used to test out the design as it progressed. Finally, the coding for the control PLA of the chip was generated automatically from the instruction-set level description used for high level simulation. Thus it was not only possible, but routine, for us to change the instruction set, "push a button", and have a new control PLA generated, dropped into the chip layout, the new chip layout extracted, 3 levels of simulation run and results compared, all in under 24 hours with no human intervention. This was done about 30 times in the last 45 days of the design process. The final "tweak" to the architecture was done 1 day before we sent out the database for maskmaking! The switch level simulator was never run for more than about 100 instruction cycles, because of CPU limitations, but the high-level simulator had many complete programs run on it. Only one minor functional bug, affecting the condition codes after a certain instruction, got past all this into the final chip. Rather than declare it a "feature", we modified the assembler to insert an additional instruction where needed to assure that the condition code was correct. This had an insignificant effect on the performance of the machine. While the design wasn't based on measurements taken from the compiler before design began, it *was* based on many, many measurements of the performance of VAXes, and was modified as compiler results became available. -- Howard A. Landman {oliveb,hplabs}!intelca!mipos3!cpocd2!howard <- works howard%cpocd2%sc.intel.com@RELAY.CS.NET <- recently flaky "Unpick a ninny - recall Mecham"