schow@bnr-public.uucp (Stanley Chow) (04/20/89)
In article <38853@bbn.COM> schooler@oak.bbn.com (Richard Schooler) writes: > > I'm not sure memory bandwidth has anything to do with RISC vs. CISC. >Remember that there are (at least) two kinds of bandwidth: instructions, >and data. I guess I'll concede that RISC forces instruction bandwidth >up, or requires somewhat larger instruction caches. However data >bandwidth is a much more severe limitation on certain programs. I have >in mind numerical or scientific codes, which spend most of their time in >small loops (instructions all in cache) sweeping through large arrays >(which may well not fit in cache). The average scientific code appears >to do roughly 1.5 memory references per floating-point operation. Can >your 10-Megaflop (64-bit precision) micro-processor move 120 Megabytes >per second? RISC vs. CISC seems largely irrelevant in this domain. > > -- Richard > schooler@bbn.com You are in effect saying the CPU architecture is not related to the bandwidth requirement. I like to point out some ways that they do interact. First of all, there are non-numeric programs, in fact, I would guess that number crunching is no longer the major user of computing power. Some programs have very poor hit-rates in any cache. But, IMO, even in the number crunching area, RISC is still sub-optimal. Second of all, I agree with you that data is a much harder problem. It is here that I have the most trouble with RISC. It appears to me that to solve the data bandwidth problem, one must give more information to the CPU. In particular, a well designed architecture should work to minimize the impact of data latency. The basic premise of RISC is to not telll the CPU anything until the last moment. This strikes me as a funny way of optimizing throughput. To execute your 1.0 FLOP, the typical RISC will do about 1.5 memory access intructions, 1.5 address adjusting instructions, say 0.5 instructions for boundary condition checking and 0.5 jump instructions. This adds up to 5 instructions to do 1 FLOP,. Many CISC machines can do a FLOP with only 2 instructions. I can hear it now, everyone is jumping up and down saying, "what a fool, doesn't he know that all those cycles are free?", "Hasn't he heard of pipelining and register scoreboarding?", "but the CISC instruction are slower so the RISC will still run faster." In response, I can only say, work through some real examples and see how many cycles are wasted. Alternatively, see how many stages of pipelining is needed to have no wasted cycles. A suitable CISC will find out earlier that it will be doing another memory reference and can prepare accordingly. It is even possible to have scatter/gather type hardware to offload the CPU while maximizing data throughput. "Compilers can do optimizations", I hear the yelling. This is another interesting phenomenon - reduce the complexity in the CPU so that the compiler must do all these other optimizations. I have also now seem any indications that a compiler can to anywhere close to an optimal job on scheduling code or pipelining. Even discounting the NP-completeness of just about everything, theoratical indications point the other way, especially when the compiler has to juggle so many conflicting constraints. It would be interesting to speculate on total system complexity, is it higher for CISC or for RISC (with its attendent memory and compiler requirements). Stanley Chow ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public As soon as flames start to show up, I will probably disown these opinions to save my skin, at which point, these opinions will no longer represent anyone at all. Anyone wishing to be represented by these opinions need only say so.
bcase@cup.portal.com (Brian bcase Case) (04/21/89)
>Second of all, I agree with you that data is a much harder problem. It is >here that I have the most trouble with RISC. It appears to me that to solve >the data bandwidth problem, one must give more information to the CPU. In >particular, a well designed architecture should work to minimize the impact >of data latency. The basic premise of RISC is to not telll the CPU anything >until the last moment. This strikes me as a funny way of optimizing throughput. This strikes me as a funny way of interpreting RISC!!! :-) There are several "basic premises" of RISC, and as far as I know, none of them is "to not tell the CPU anything until the last moment." Conversely, as far as I know, one of the basic premises of CISC is not "to tell the CPU everything as early as possible." RISC emphasises exposing hardware to the software (and vice versa, I guess) so that as much work as possible is avoided. CISC emphasizes binding operations together. This has the effect of doing more work than is necessary, and taking more cycles therefore. >To execute your 1.0 FLOP, the typical RISC will do about 1.5 memory access >intructions, 1.5 address adjusting instructions, say 0.5 instructions for >boundary condition checking and 0.5 jump instructions. This adds up to 5 >instructions to do 1 FLOP,. Many CISC machines can do a FLOP with only 2 >instructions. Who cares about instructions? How much of the work of the RISC instructions can be reused (i.e., addressing calculations). How much of the CISC work is wasted? Try an optimizing compiler.... >"Compilers can do optimizations", I hear the yelling. This is another >interesting phenomenon - reduce the complexity in the CPU so that the >compiler must do all these other optimizations. I have also now seem any >indications that a compiler can to anywhere close to an optimal job on >scheduling code or pipelining. Even discounting the NP-completeness of >just about everything, theoratical indications point the other way, >especially when the compiler has to juggle so many conflicting constraints. Well, I wasn't yelling, so much as muttering. So, since the compiler can't do an optimal job, it might as well not do anything. Why bother. I can tell you that a compiler can do a whole lot better job than some microcode! At leas the compiler has a view of the whole program! (or at least a whole procedure.) >It would be interesting to speculate on total system complexity, is it >higher for CISC or for RISC (with its attendent memory and compiler >requirements). Why speculate? Design a CISC and then a RISC, of equal performance. Or look at existing implementations. And don't confuse on-chip resources with real complexity (the kind that makes it hard) like the editor of UNIX <somethingorother> did in his comments on the i860. He claimed it is the CISCiest processor yet! >As soon as flames start to show up, I will probably disown these >opinions to save my skin, at which point, these opinions will no >longer represent anyone at all. Anyone wishing to be represented >by these opinions need only say so. You are entitled to your opinion, of course. This is not intended to be a flame. I'm just putting in my $0.02 too.
alan@rnms1.paradyne.com (Alan Lovejoy) (04/21/89)
In article <423@bnr-fos.UUCP> schow@bnr-public.UUCP (Stanley Chow) writes: >"Compilers can do optimizations", I hear the yelling. This is another >interesting phenomenon - reduce the complexity in the CPU so that the >compiler must do all these other optimizations. I have also now seem any >indications that a compiler can to anywhere close to an optimal job on >scheduling code or pipelining. Even discounting the NP-completeness of >just about everything, theoratical indications point the other way, >especially when the compiler has to juggle so many conflicting constraints. If optimization is too difficult for compilers, how in the *&^%$#@! is the hardware going to be able to do it??????!!!! The compiler knows a LOT more about the istruction stream--and the intent of the instruction stream--than the hardware does, with one big exception: the hardware knows the dynamic instruction sequence; the compiler does not. Unfortunately, even the hardware doesn't have all that much advance notice of dynamically variable parameters. The reason for increasing the primitiveness (a better characterization than "reducing the complexity") of machine instructions is so that the compiler CAN "do all these optimizations," not so that it "must." RISC does not force optimization, it permits it. "Complex" or "high-level" instructions necessarily become too application-specific. The greater the semantic content of an instruction, the less its generality. The more primitive the instruction semantics, the greater the probability that the instruction only does what you need, and not what you don't. Have you ever tried emulating unsigned multiplication/division on a system that only provides signed integer arithmetic and comparisons? Because primitive instructions do less work than "complex" instructions, they can execute in less TIME. This means either fewer and/or SHORTER clock cycles. At first blush, it would seem that complex instructions can compensate for this by means of parallelism (e.g., pipelining, mulitiple identical, parallel functional units). But in practice, there is no such thing as free lunch. The steps in the hardware "algorithm" that implements a complex instruction are usually inherently sequential (fetch data from memory, do operation, store data to memory). The parallelizable parts tend to be the same functions that are parallelizable for primitive instructions. So complex instructions usually gain nothing from this, relative to primitive instructions. Alan Lovejoy; alan@pdn; 813-530-2211; AT&T Paradyne: 8550 Ulmerton, Largo, FL. Disclaimer: I do not speak for AT&T Paradyne. They do not speak for me. _________________________Design Flaws Travel In Herds_________________________ Motto: If nanomachines will be able to reconstruct you, YOU AREN'T DEAD YET.
slackey@bbn.com (Stan Lackey) (04/22/89)
In article <17417@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes: >>of data latency. The basic premise of RISC is to not telll the CPU anything >>until the last moment. This strikes me as a funny way of optimizing throughput. > >This strikes me as a funny way of interpreting RISC!!! :-) There are >several "basic premises" of RISC, and as far as I know, none of them is "to >not tell the CPU anything until the last moment." Conversely, as far as I Stuff like vector instructions, the VAX character string instructions, VAX CALL/RET, the 680x0 MOVEM, etc. give the CPU a real strong hint as to what near-future memory accesses will be. As memory access times become even longer [relative to cycle time], this becomes more important. And will begin to widen the performance gap, if implemented properly. RISC architectures don't have the ability to communicate this class of information, and if it is added, they won't be RISC's anymore (unless Marketing SAYS they are, I guess...) >>To execute your 1.0 FLOP, the typical RISC will do about 1.5 memory access >>intructions, 1.5 address adjusting instructions, say 0.5 instructions for >>boundary condition checking and 0.5 jump instructions. This adds up to 5 >>instructions to do 1 FLOP,. Many CISC machines can do a FLOP with only 2 >>instructions. > >Who cares about instructions? If each instruction consumes one cycle to issue I sure do! (BTW, the Alliant takes one cycle to do MULF (a2)+,fp0; don't tell me it's impossible.) >look at existing implementations. And don't confuse on-chip resources with >real complexity (the kind that makes it hard) like the editor of UNIX ><somethingorother> did in his comments on the i860. He claimed it is the >CISCiest processor yet! I have a real problem with anything that includes IEEE floating point AND calls itself a RISC. IEEE FP violates every rule of RISC; it has features that compilers will never use (rounding modes), features that are rarely needed that slow things down (denormalized operands), and features that make things complex that nobody needs (round-to-even). I'd really like to see someone stand up and say, "Boy, the IEEE round-to-even is much more accurate than DEC's round .5 up. I have an application right here that proves it." Or, "Gradual underflow is much better. I have an application that can be run in single precision that would need to be run double precision without it." :-) Stan
ingoldsb@ctycal.COM (Terry Ingoldsby) (04/22/89)
In article <423@bnr-fos.UUCP>, schow@bnr-public.uucp (Stanley Chow) writes: > "Compilers can do optimizations", I hear the yelling. This is another > interesting phenomenon - reduce the complexity in the CPU so that the > compiler must do all these other optimizations. I have also now seem any > indications that a compiler can to anywhere close to an optimal job on > scheduling code or pipelining. Even discounting the NP-completeness of > just about everything, theoratical indications point the other way, > especially when the compiler has to juggle so many conflicting constraints. > I am quite naive on this subject (but that won't stop me from throwing in my two cents worth :^)), but it seems to me that if we still programmed mostly in assembler, then CISC would beat RISC. I did a lot of programming of the 8086 using assembler language, and I became (painfully) aware of some of the unusual instructions, and the difference the choice a a particular register, or way of doing something would make on overall performance. By skillfully picking my instructions, I could improve performance *significantly*. On the other hand, I can't imagine any compiler (apologies to the compiler writers) smart enough to have figured out what I wanted to do, and to choose the optimal instructions if I had coded the algorithm in a high level language. In fact, I suspect that a lot of the weird instructions were never used by compilers at all. This means that compilers often generate RISC code for CISC machines (ie. they use the simplest instructions they can). On the other hand, I can see that while a RISC processor programmed in assembler might not be quite as quick as an expertly assembled CISC program, the compiler has a reasonable chance of generating a good sequence of instructions. At least it doesn't have to ask questions like: 1) If I use register A for this operation, the next 10 instructions will be quick, but 2) would I be better to not use A, use B (slow) and wait until a really critical set of instructions comes up to use A. Even if your compiler is brainy enough to figure that out, there is almost no way it can recognize that the algorithm I'm performing is a Fast Fourier Transform. It will generate the code to perform it instead of using a (hypothetical) FFT CISC instruction. My point is that since almost everything is written in high level languages today, they are better suited for RISC. For applications that still use assembler (eg. control systems) CISC makes sense. But what do I know?? Terry Ingoldsby Land Information Related Systems The City of Calgary ctycal!ingoldsb@calgary.UUCP or ...{alberta,ubc-cs,utai}!calgary!ctycal!ingoldsb
mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel) (04/22/89)
In article <38971@bbn.COM> slackey@BBN.COM (Stan Lackey) writes: >In article <17417@cup.portal.com) bcase@cup.portal.com (Brian bcase Case) writes: )))of data latency. The basic premise of RISC is to not telll the CPU anything )))until the last moment. This strikes me as a funny way of optimizing throughput. )) ))This strikes me as a funny way of interpreting RISC!!! :-) There are ))several "basic premises" of RISC, and as far as I know, none of them is "to ))not tell the CPU anything until the last moment." Conversely, as far as I ) )Stuff like vector instructions, the VAX character string instructions, )VAX CALL/RET, the 680x0 MOVEM, etc. give the CPU a real strong hint )as to what near-future memory accesses will be. As memory access times )become even longer [relative to cycle time], this becomes more important. )And will begin to widen the performance gap, if implemented properly. )RISC architectures don't have the ability to communicate this class )of information, and if it is added, they won't be RISC's anymore (unless )Marketing SAYS they are, I guess...) I thought that many RISC chips have this property already: load delays. You tell it to load some register or something or other but it wont be valid until n cycles later. In the meantime, though, you can have it run the exact instructions that YOU want it do to for your program, and not what the microcode programmer thought would be a commonly used bundle. It's the same effect--just more general purpose. ) )I have a real problem with anything that includes IEEE floating point )AND calls itself a RISC. IEEE FP violates every rule of RISC; it has )features that compilers will never use (rounding modes), features that )are rarely needed that slow things down (denormalized operands), and )features that make things complex that nobody needs (round-to-even). )I'd really like to see someone stand up and say, "Boy, the IEEE )round-to-even is much more accurate than DEC's round .5 up. I have an )application right here that proves it." Or, "Gradual underflow is )much better. I have an application that can be run in single precision )that would need to be run double precision without it." ):-) Stan What do you think would be better? Matt Kennel mbkennel@phoenix.princeton.edu
dgh%dgh@Sun.COM (David Hough) (04/22/89)
In article <38971@bbn.COM>, slackey@bbn.com (Stan Lackey) writes: > > I have a real problem with anything that includes IEEE floating point > AND calls itself a RISC. IEEE FP violates every rule of RISC; it has > features that compilers will never use (rounding modes), features that > are rarely needed that slow things down (denormalized operands), and > features that make things complex that nobody needs (round-to-even). > I'd really like to see someone stand up and say, "Boy, the IEEE > round-to-even is much more accurate than DEC's round .5 up. I have an > application right here that proves it." Or, "Gradual underflow is > much better. I have an application that can be run in single precision > that would need to be run double precision without it." This is certainly the position that DEC took through the IEEE 754 and 854 meetings. For better or worse, however, all RISC chips that I'm aware of that have hardware floating point support implement IEEE arithmetic more of less fully. The anomaly here, of course, is that common scientific applications that, by dint of great effort, have been debugged to the point of running efficiently unchanged on IBM 370, VAX, and Cray, run about as well but not much better on IEEE systems since they don't exploit any specific feature of any particular arithmetic system. Sometimes they run slower if they underflow a lot in situations that don't matter, AND the hardware doesn't support subnormal operands and results efficiently. This is properly viewed as a shortcoming of the hardware/software system that purportedly implements IEEE arithmetic: even on synchronous systems you have to be able to hold the FPU for cache misses and page faults, so similarly you should be able to hold the CPU for exception misses in the FPU that take a little longer to compute. On asynchronous CISC systems like 68881 or 80387 this isn't a problem, but they are slower in the non-exceptional case, which is why RISC systems are mostly synchronous. Conversely, however, programs that take advantage of IEEE arithmetic, usually unknowingly, don't work nearly as well on 370, VAX, or Cray, where simple assumptions like if (x != y) /* then it's safe to divide by (x-y) */ no longer hold. > an application that can be run in single precision > that would need to be run double precision without [gradual underflow]. There will never be such an example that satisfies everyone since you never "need" any particular precision. After all, any integer or floating-point computation is fabricated out of one-bit integer operations. It's just a matter of dividing up the cleverness between the hardware and the software. What you CAN readily demonstrate are programs (written entirely in one precision) that are no worse affected by underflow than by normal roundoff, PROVIDED that underflow be gradual. Demmel and Linnainmaa contributed many pages of such analyses to the IEEE deliberations and to subsequent proceedings of the Symposia on Computer Arithmetic published by IEEE-CS. Of course if you are sufficiently clever you can use higher precision explicitly if provided by the compiler or implicitly otherwise to produce robust code in the face of abrupt underflow or even Cray arithmetic. Many mathematical software experts are good at this but most regard this as a evil only made necessary by hardware, system, and language designs that through ignorance or carelessness become part of the problem rather than part of the solution. Not all code is compiled. For instance, there is a great body of theory and practice in obtaining computational error bounds in computations based on interval arithmetic. Interval arithmetic is efficient to implement with the directed rounding modes required by IEEE arithmetic, but you can't write the implementation in standard C or Fortran. In integer arithmetic, the double-precise product of two single-precise operands, and the single-precise quotient and remainder of a double-precise dividend and single-precise divisor, are important in a number of applications such as base conversion and random number generation, but there is no way to express the required computations in standard higher-level languages. As to rounding halfway cases to even, the advantage over biased rounding is perhaps simplest understood by the observation that 1+(eps/2) rounds to 1 rather than 1+eps. The "even" result is more likely to be the one you wanted if you had a preference. Such rounding is no more expensive than biased rounding on a system that is required to provide directed rounding modes as well. It's not the bottleneck on any hardware IEEE implementation of which I'm aware. I have heard that adder carry propagate time and multiplier array size are the key constraints with a floating-point chip; hardware experts will correct me if I'm wrong. Memory bandwidth tends to be the key constraint on overall system performance unless floating-point division and sqrt dominate. The last describes a minority of programs but they are quite important in some influential circles. David Hough dhough@sun.com na.hough@na-net.stanford.edu {ucbvax,decvax,decwrl,seismo}!sun!dhough
yuval@taux01.UUCP (Gideon Yuval) (04/22/89)
>round-to-even is much more accurate than DEC's round .5 up. I have an >application right here that proves it." Or, "Gradual underflow is >much better. I have an application that can be run in single precision >that would need to be run double precision without it." The video-tapes of Kahan's "floating-point indoctrination" course (Sun, May-Jul/88) have "somebody" (i.e. W.Kahan) standing up & saying precisely that. Sneak a view if you can. -- Gideon Yuval, yuval@taux01.nsc.com, +972-2-690992 (home) ,-52-522255(work) Paper-mail: National Semiconductor, 6 Maskit St., Herzliyah, Israel TWX: 33691, fax: +972-52-558322
yuval@taux01.UUCP (Gideon Yuval) (04/22/89)
My previous posting got garbled. Here's an ungarbled version. Stan Lackey, in his message <38971@bbn.com>, says: >I'd really like to see someone stand up and say, "Boy, the IEEE >round-to-even is much more accurate than DEC's round .5 up. I have an >application right here that proves it." Or, "Gradual underflow is >much better. I have an application that can be run in single precision >that would need to be run double precision without it." The video-tapes of Kahan's "floating-point indoctrination" course (Sun, May-Jul/88) have "somebody" (i.e. W.Kahan) standing up & saying precisely that. Sneak a view if you can. -- Gideon Yuval, yuval@taux01.nsc.com, +972-2-690992 (home) ,-52-522255(work) Paper-mail: National Semiconductor, 6 Maskit St., Herzliyah, Israel TWX: 33691, fax: +972-52-558322
slackey@bbn.com (Stan Lackey) (04/25/89)
In article <100524@sun.Eng.Sun.COM> dgh%dgh@Sun.COM (David Hough) writes: >In article <38971@bbn.COM>, slackey@bbn.com (Stan Lackey) writes: >> >> I have a real problem with anything that includes IEEE floating point >> AND calls itself a RISC. IEEE FP violates every rule of RISC; it has >> features that compilers will never use (rounding modes), features that >> are rarely needed that slow things down (denormalized operands), and >> features that make things complex that nobody needs (round-to-even). >> (emotional stuff deleted) >Not all code is compiled. I agree - I was just quoting the RISC guys. >Interval arithmetic is >efficient to implement with the directed rounding modes required by IEEE >arithmetic, but you can't write the implementation in standard C or... >Such rounding is no more expensive than biased rounding >on a system that is required to provide directed rounding modes as well. >It's not the bottleneck on any hardware IEEE implementation of which I'm aware. Having to detect EXACTLY .5 is a bottleneck in terms of transistor count, design time, and diagnostics. The extra execution time may not affect overall cycle time, but the RISC guys say that any added hardware increases cycle time (they usually use it in the context of instruction decode). >I have heard that adder carry propagate time and multiplier array size >are the key constraints with a floating-point chip; hardware experts >will correct me if I'm wrong. These are probably the largest single elements in most implementations. But, as the hardware guys will tell you, it's the exceptions that get you. Note: It's prealigning a denormalized operand before a multiplication that REALLY hurts. >Memory bandwidth tends to be the key constraint >on overall system performance unless floating-point division and sqrt >dominate. Absolutely true, but not very relevant. >David Hough Lots of valid uses of IEEE features listed. I didn't mean that IEEE was bad or useless, it's just that it was architected when CISC was the trend, and it shows. Especially after my own efforts in an IEEE implementation, I am glad to see from this posting and others that at least a few users can make use of the features. I think the RISC implementers should have a RISC-style floating point standard, though.
dik@cwi.nl (Dik T. Winter) (04/25/89)
In article <39049@bbn.COM> slackey@BBN.COM (Stan Lackey) writes: > Lots of valid uses of IEEE features listed. I didn't mean that IEEE > was bad or useless, it's just that it was architected when CISC was > the trend, and it shows. Especially after my own efforts in an IEEE > implementation, I am glad to see from this posting and others that at > least a few users can make use of the features. I think the RISC > implementers should have a RISC-style floating point standard, though. Oh, go ahead, but make sure you have some numerical analysts around to help you, unless you are willing to make the same mistakes as numerous designers before you have made. To some of your points: Round to even gives a better overall round-off error than truncate to zero (i.e. better in larger expressions). Gradual underflow is, as far as I see, not really needed, but the alternative is trap on underflow, and allow the program to recover. This would be just as hard, if not harder, in my opinion. David Hough remarked that many applications are written to work properly on a lot of machines and that they would not benefit very much from IEEE arithmetic. I might say that for a number of those applications this was achieved with much trouble. The original design would, in a lot of cases, have benefitted if *only* IEEE arithmetic had to be considered. -- dik t. winter, cwi, amsterdam, nederland INTERNET : dik@cwi.nl BITNET/EARN: dik@mcvax
dgh%dgh@Sun.COM (David Hough) (04/25/89)
In article <39049@bbn.COM>, slackey@bbn.com (Stan Lackey) writes: > In article <100524@sun.Eng.Sun.COM> dgh%dgh@Sun.COM (David Hough) writes: > >In article <38971@bbn.COM>, slackey@bbn.com (Stan Lackey) writes: > >Such rounding is no more expensive than biased rounding > >on a system that is required to provide directed rounding modes as well. > Having to detect EXACTLY .5 is a bottleneck in terms of transistor > count, design time, and diagnostics. The extra execution time may not > affect overall cycle time, but the RISC guys say that any added > hardware increases cycle time (they usually use it in the context of > instruction decode). EXACTLY .5 is no harder than correct directed rounding. You have to (in principle) develop all the digits, propagate carries, and remember whether any shifted off were non-zero. Division and sqrt are simplified by the fact that EXACTLY .5 can't happen. > Note: It's prealigning a denormalized operand before a multiplication > that REALLY hurts. This event is rare enough that it needn't be as fast as a normal multiplication, so it's OK to slow down somewhat by holding the CPU, but not so rare that you want to punt to software. By throwing enough hardware at the problem you can make it as fast as the normal case. I don't advocate that but that's my understanding of what the Cydra-5 did. Interestingly enough, the early drafts of 754 specified that default handling of subnormal numbers be in a "warning mode" and that the more expensive "normalizing mode" be an option. This was with highly-pipelined implementations very much in mind. However a gang of early implementers from Apple managed to talk a majority of the committee into making the normalizing mode the default. The normalizing mode is easier to understand and easier to implement in software. Warning mode is a lot cheaper to pipeline, however. I was part of the gang but I've since had opportunity to repent at leisure. > Lots of valid uses of IEEE features listed. I didn't mean that IEEE > was bad or useless, it's just that it was architected when CISC was > the trend, and it shows. Especially after my own efforts in an IEEE > implementation, I am glad to see from this posting and others that at > least a few users can make use of the features. Remember IEEE 754 and 854 are standards for a programming environment. How much of that is to be provided by hardware and how much by software is up to the implementer; in contrast RISC is a hardware design philosophy. The MC68881 is probably the best-known attempt to put practically everything in the hardware so the software wouldn't screw it up as usual. The Weitek 1032/3 and their descendants and competitors are examples of minimal hardware implementations that support complete IEEE implementations once appropriate software is added. Evidently the first generations of such chips were too minimal; for instance nowadays everybody has correctly-rounded division and sqrt in hardware, rather than software, on chips intended for general-purpose computation. > I think the RISC > implementers should have a RISC-style floating point standard, though. There's a very minimalist floating-point standard, that of S. Cray, which is very cheap to implement entirely in hardware (compared to other standards at similar performance levels). The only hard part is writing the software that uses it. So far no other hardware manufacturers have seen fit to adopt Cray arithmetic. IBM 370 architecture has been more widely imitated but not because of any inherent wonderfulness for mathematical software. DEC VAX floating-point architecture is well defined and a number of non-DEC implementations are available. But divide and sqrt are no easier than IEEE, and IEEE double precision addition and multiplication are available now in one or two cycles on some implementations. Does anybody still think there would be an advantage to VAX, 370, or Cray floating-point architecture for a PC or workstation? David Hough dhough@sun.com na.hough@na-net.stanford.edu {ucbvax,decvax,decwrl,seismo}!sun!dhough
slackey@bbn.com (Stan Lackey) (04/25/89)
In article <100891@sun.Eng.Sun.COM> dgh%dgh@Sun.COM (David Hough) writes: >EXACTLY .5 is no harder than correct directed rounding. You have to >(in principle) develop all the digits, propagate carries, and remember >whether any shifted off were non-zero. Division and sqrt are simplified >by the fact that EXACTLY .5 can't happen. OK, it's only a problem in multiplication. >> Note: It's prealigning a denormalized operand before a multiplication >> that REALLY hurts. >This event is rare enough that it needn't be as fast as a normal >multiplication, so it's OK to slow down somewhat by holding the CPU, >but not so rare that you want to punt to software. By throwing enough >hardware at the problem you can make it as fast as the normal case. >I don't advocate that but that's my understanding of what the Cydra-5 did. Ever design a pipelined machine? It was probably easier in the Cydra to make everything assume the worst case, than to deal with the pipeline getting messed up. The new micros (at least the i860) trap and expect software to fix things up, which includes parsing the instructions in the pipe, and fixing up the saved version of the internal data pipeline. I've seen statements in this newsgroup like "not usable in a general purpose environment" when referring to the i860. Talk about debug time! In the Alliant we wanted to get the design done, and fit it on one board, so we shut denorms off. (It sets the exception bits, though.) After shipping for 4 years, there have still been no complaints. >for instance nowadays >everybody has correctly-rounded division and sqrt in hardware, except Intel Re: one or two-cycle DP IEEE mul/add exist: Alliant is the only one I know of, but it's because 1) the cycle time is abnormally long and 2) denorms are not supported. I think it's valid to say that if floating point (esp DP) ops take one cycle, your cycle time is too long. >> I think the RISC >> implementers should have a RISC-style floating point standard, though. >DEC VAX floating-point architecture >is well defined and a number of non-DEC implementations are available. Sounds like a good idea to me! The IBM one is not useful, and (so it is said) the Cray one is difficult to use. The VAX one is accurate enough and has enough range for normal use, and if F or G aren't enough, there's always H :-) -Stan
dgh%dgh@Sun.COM (David Hough) (04/26/89)
In article <39095@bbn.COM>, slackey@bbn.com (Stan Lackey) writes: > The new micros (at least the i860) trap > and expect software to fix things up, which includes parsing the > instructions in the pipe, and fixing up the saved version of the > internal data pipeline. I've seen statements in this newsgroup like > "not usable in a general purpose environment" when referring to the > i860. I agree. The i860 appears never to have been intended to support an efficient implementation of IEEE 754. > In the Alliant we wanted to get the > design done, and fit it on one board, so we shut denorms off. ... > Re: one or two-cycle DP IEEE mul/add exist: Alliant is the only one I Regardless of how fast you do the arithmetic, if (x != y) and (x-y) != 0 are not equivalent for finite x, you don't conform to IEEE 754 or 854. Subnormal numbers permit this equivalence. 754 committee members were irked in advance, so to speak, by the prospect that some vendors would claim conformance for such implementations. > said) the Cray one is difficult to use. The VAX one is accurate enough > and has enough range for normal use, and if F or G aren't enough, there's > always H :-) The VAX standard is D format double precision, not G. Many people consider it inadequate because unlike a pocket calculator it won't accommodate 10+-99. David Hough dhough@sun.com na.hough@na-net.stanford.edu {ucbvax,decvax,decwrl,seismo}!sun!dhough
cik@l.cc.purdue.edu (Herman Rubin) (04/26/89)
In article <288@ctycal.UUCP>, ingoldsb@ctycal.COM (Terry Ingoldsby) writes: > In article <423@bnr-fos.UUCP>, schow@bnr-public.uucp (Stanley Chow) writes: > > "Compilers can do optimizations", I hear the yelling. This is another > > interesting phenomenon - reduce the complexity in the CPU so that the > > compiler must do all these other optimizations. I have also now seem any > > indications that a compiler can to anywhere close to an optimal job on > > scheduling code or pipelining. Even discounting the NP-completeness of > > just about everything, theoratical indications point the other way, > > especially when the compiler has to juggle so many conflicting constraints. > I am quite naive on this subject (but that won't stop me from throwing in my > two cents worth :^)), but it seems to me that if we still programmed mostly > in assembler, then CISC would beat RISC. I did a lot of programming of the > 8086 using assembler language, and I became (painfully) aware of some of the > unusual instructions, and the difference the choice a a particular register, > or way of doing something would make on overall performance. By skillfully > picking my instructions, I could improve performance *significantly*. On the > other hand, I can't imagine any compiler (apologies to the compiler writers) > smart enough to have figured out what I wanted to do, and to choose the > optimal instructions if I had coded the algorithm in a high level language. No apologies are due to the compiler writers. Rather, criticism is due to them for the arrogance they took in leaving out the possibility for the programmmer to do something intelligent. The HLLs are woefully inadequate, and I would not be surprised that the ability to do intelligent coding using the machine capabilities may be destroyed by learning such restrictive coding procedures first. You are far less naive than those gurus who thing they know all the answers. > In fact, I suspect that a lot of the weird instructions were never used by > compilers at all. This means that compilers often generate RISC code for CISC > machines (ie. they use the simplest instructions they can). You are so right. I have no trouble using these not very wierd instructions to do what the compiler writer did not anticipate. Nobody can anticipate all my needs, but nobody should say that he has given me all the tools, either. As far as the difficulty of using machine language, I know of no machine as complicated as a HLL, although some may be getting a little close. The present assembler languages are another matter, though, and unnecessarily so. > On the other hand, I can see that while a RISC processor programmed in assembler > might not be quite as quick as an expertly assembled CISC program, the compiler > has a reasonable chance of generating a good sequence of instructions. At > least it doesn't have to ask questions like: > 1) If I use register A for this operation, the next 10 instructions will > be quick, but > 2) would I be better to not use A, use B (slow) and wait until a really > critical set of instructions comes up to use A. > Even if your compiler is brainy enough to figure that out, there is almost > no way it can recognize that the algorithm I'm performing is a Fast Fourier > Transform. It will generate the code to perform it instead of using a > (hypothetical) FFT CISC instruction. I have run into situations where the number of possible programs is well into the thousands, at least. In many cases I can see that using some operations cannot pay unless those operations are hardware. How is the programmer going to write the program? Those who want it to be machine independent make things difficult. A good example of using CISC versus RISC is division, with a quotient and remainder. Some RISC machines do not even have this. Now I suggested to this group that the instruction be modified to allow the programmer to specify which quotient and remainder are to be used as a function of the signs. This is trivial in hardware, and probably would not extend the time by more than a small fraction of a cycle, although three conditional transfers are involved. The point is that they can be made while the lengthy division is taking place by the division unit. Another example is floating point arithmetic. The RISCy CRAY, on problems with rigid vectors, will run rings around the CYBER 205 in single precision floating point (around 14 digits). If we now change to double precision, we not get a time factor of about 15 in favor of the CYBER. Many problems in which non-rigid vectors are appropriate also favor the CYBER. Considering the cost of the CPU relative to the rest of the computer, I would suggest VRISC as the profitable way to go. But we need input from people like Terry and me about the apparently crazy instructions which will speed up throughput. > My point is that since almost everything is written in high level languages > today, they are better suited for RISC. For applications that still use > assembler (eg. control systems) CISC makes sense. Must we only use tools that would appal an artist? Programming is an art; artists do not learn by filling in the squares with numbered colors. > But what do I know?? More than the HLL gurus. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)
peter@ficc.uu.net (Peter da Silva) (04/27/89)
Herman. Please describe a language, higher level than forth, that will provide all the tools you feel you need in a HLL. I am about convinced that such a language is impossible. Thanks, your partner in monomania, Peter da Silva. -- Peter da Silva, Xenix Support, Ferranti International Controls Corporation. Business: uunet.uu.net!ficc!peter, peter@ficc.uu.net, +1 713 274 5180. Personal: ...!texbell!sugar!peter, peter@sugar.hackercorp.com.
chuck@melmac.harris-atd.com (Chuck Musciano) (04/27/89)
Oh well, into the breach... In article <1262@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: >No apologies are due to the compiler writers. Rather, criticism is due to >them for the arrogance they took in leaving out the possibility for the >programmmer to do something intelligent. The HLLs are woefully inadequate, >and I would not be surprised that the ability to do intelligent coding using >the machine capabilities may be destroyed by learning such restrictive coding >procedures first. Some misdirected flames here. If you want to blame anyone, blame the language designer, not the compiler writer. Us poor compiler writers just sit around, horrified, as the hardware guys take more and more of the hard part and give it to us. :-) If you dislike the current crop of HLLs so much, you are free to design your own. As one who has designed and implemented several languages on a variety of systems, I know how easy it is to take potshots at the language implementors. Go through the loop yourself, and then complain. Designing anything which will please some segment of the world is very difficult. Designing a language which is elegant, orthogonal, easy to learn and use, easy to implement on a variety of machines and that will appeal to a large number of users is almost impossible. Chuck Musciano ARPA : chuck@trantor.harris-atd.com Harris Corporation Usenet: ...!uunet!x102a!trantor!chuck PO Box 37, MS 3A/1912 AT&T : (407) 727-6131 Melbourne, FL 32902 FAX : (407) 727-{5118,5227,4004}
mccalpin@loligo.cc.fsu.edu (John McCalpin) (04/28/89)
In article <1262@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: >Another example is floating point arithmetic. The RISCy CRAY, on problems >with rigid vectors, will run rings around the CYBER 205 in single precision >floating point (around 14 digits). If we now change to double precision, >we not get a time factor of about 15 in favor of the CYBER. Many problems >in which non-rigid vectors are appropriate also favor the CYBER. >Herman Rubin, Dept. of Statistics,hrubin@l.cc.purdue.edu (1) What is a "rigid vector"? (2) On 64-bit vector operations with long vectors, the Crays do not "run rings around" the Cyber 205. The asymptotic speeds (MFLOPS) are: Cray-1 Cyber 205 Cray X/MP 160 200 235 (3) Both the X/MP and 205 perform "double precision" (128-bit) arithmetic in software, and experience a slow-down of close to a factor of 100 relative to 64-bit vector operations. -- ---------------------- John D. McCalpin ------------------------ Dept of Oceanography & Supercomputer Computations Research Institute mccalpin@masig1.ocean.fsu.edu mccalpin@nu.cs.fsu.edu --------------------------------------------------------------------
dave@celerity.uucp (Dave Smith) (04/29/89)
In article <1262@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: >In article <288@ctycal.UUCP>, ingoldsb@ctycal.COM (Terry Ingoldsby) writes: >> My point is that since almost everything is written in high level languages >> today, they are better suited for RISC. For applications that still use >> assembler (eg. control systems) CISC makes sense. > >Must we only use tools that would appal an artist? Programming is an art; >artists do not learn by filling in the squares with numbered colors. > A good artist can create fine art with crayons or oil paints (or whatever). Assembly languages are definitely the crayons of the computer world. RISCs are kind of like the little box of Crayolas with 16 colors, a CISC, like the VAX, is like the big box with 64. Ever notice how it was that the black, red and blue crayons always ended up the smallest in that big box and the mauve crayon looked brand new? The problem I have with RISC designs are that they use up too much memory bandwidth. What I think would be better was something that gave you the flexibility of a RISC (not being tied into the designer's particular idea of how a string instruction should be implemented, for example) but with the memory bandwidth efficiency of a CISC. I'm not familiar enough with VLIW to make any good judgements on it, but it seems as though it's a reasonable way to go. David L. Smith FPS Computing, San Diego ucsd!celerity!dave "Repent, Harlequin!," said the TickTock Man
cik@l.cc.purdue.edu (Herman Rubin) (04/29/89)
In article <1984@trantor.harris-atd.com>, chuck@melmac.harris-atd.com (Chuck Musciano) writes: > Oh well, into the breach... > > In article <1262@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: > >No apologies are due to the compiler writers. Rather, criticism is due to > >them for the arrogance they took in leaving out the possibility for the > >programmmer to do something intelligent. The HLLs are woefully inadequate, > >and I would not be surprised that the ability to do intelligent coding using > >the machine capabilities may be destroyed by learning such restrictive coding > >procedures first. > > Some misdirected flames here. If you want to blame anyone, blame the > language designer, not the compiler writer. Us poor compiler writers just > sit around, horrified, as the hardware guys take more and more of the hard > part and give it to us. :-) Not all blames go to the language designer. The implementation of asm in C can be done to give much more benefit to the programmer. The implementation knows where the variable xyz is and can substitute it in the assembler instruction. This has nothing to do with the language. And this atrocious use of underscores! Why should C prefix an underscore to all its names for systems purposes? I know the reason, but I cannot respect the intelligence of those who did not take the other action. If we have programs created by several languages, we should have no more than a calling sequence problem, not a name problem. If I call a buffer refill program with the calling sequence I advocate, passing the pointer descriptor, it should make no difference whether the subroutine is written in Fortran, C, Pascal, APL, or anything else, and it should make no difference whether it is called from any of those languages. If they all prepended an underscore, and each could use the others' names, it would not be too bad. But some Fortrans leave the name alone, some prepend and postpend, Pascal does not allow underscores in the middle of a name, etc. > If you dislike the current crop of HLLs so much, you are free to design > your own. As one who has designed and implemented several languages on a > variety of systems, I know how easy it is to take potshots at the language > implementors. Go through the loop yourself, and then complain. Designing > anything which will please some segment of the world is very difficult. > Designing a language which is elegant, orthogonal, easy to learn and use, > easy to implement on a variety of machines and that will appeal to a large > number of users is almost impossible. I am asking for one thing you have left out; it should be able to write efficient code. I will throw out elegant and orthogonal completely. If the machine is not orthogonally designed, why should the language be. And most machines are not. The first thing needed is a macro assembler in which the "macro name" can be a pattern. For example, I would like x = y - z to be the =- macro. Allow the user to use these ad lib. This way the language can be extended. The #defines, and even the user-overloaded operators in C++ do not achieve this. If a machine instruction is complicated, it may be necessary to have a complicated design, as well as type overrides, etc. An example from the CYBER 205 is the following large class of vector instruction, where A or B can be either vector or scalar. There are many options in this, and I point this out. C'W =t -|A'X opmod | B'Y /\ ~ W 11 2 34 55 666 7 88 99 a 9 notice that I have 10 options for each opcode (someone might say that option 6 is part of the opcode, but there are natural defaults). Since this is a single machine instruction on this machine, I MUST be able to use it, and I do not wish to have to use the clumsy notation provided by the "CALL8" procedures. I do not claim that this is optimal notation, and I would expect you to write it somewhat differently. For example, you could always leave out the /\; I put it in merely for clarity. The best semi-portable procedures for generating such random variables as normal are easily described. Coding them on a CYBER 205 efficiently is trivial. Coding a slightly different version of them on an IBM 3090 is not difficult. Coding them (any version) on a CRAY X-MP is an interesting challenge. Coding them on a CRAY 1 is a major headache, and not vectorizable for much of the procedure. These vector machines are that different. So should the code be easy to implement on a large variety of machines? But the programmer who does not understand the machine cannot code well on that machine. Some C-like language with macro augmentation is probably the answer. But types and operation symbols should be introduced by the user at will. And all manners of arrays should be included, and whatever else the user can come up with. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)
cik@l.cc.purdue.edu (Herman Rubin) (04/29/89)
In article <632@loligo.cc.fsu.edu>, mccalpin@loligo.cc.fsu.edu (John McCalpin) writes: > In article <1262@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: > > >Another example is floating point arithmetic. The RISCy CRAY, on problems > >with rigid vectors, will run rings around the CYBER 205 in single precision > >floating point (around 14 digits). If we now change to double precision, > >we not get a time factor of about 15 in favor of the CYBER. Many problems > >in which non-rigid vectors are appropriate also favor the CYBER. > > >Herman Rubin, Dept. of Statistics,hrubin@l.cc.purdue.edu > > (1) What is a "rigid vector"? Rigid vector operations are those in which the position of an element is essentially unchanged, except by scalar shifts. Examples of non- rigid vector operations are removing the elements of a vector corresponding to 0's in a bit vector with subsequent shrinking of the length of the vector, inserting the first elements for vector a in locations in vector b selected by a bit vector, merging under control of a bit vector, etc. > (2) On 64-bit vector operations with long vectors, the Crays do not > "run rings around" the Cyber 205. The asymptotic speeds (MFLOPS) are: > Cray-1 Cyber 205 Cray X/MP > 160 200 235 Asymptotic speeds are much less often approximated on the CYBER, unfortunately. The CYBER also can only do one vector operation at a time, but there is no interference, in general, on the CYBER for vector and scalar. I prefer the CYBER myself, and I guess I took the most pessimistic view. The actual ratios depend on a lot of things. > (3) Both the X/MP and 205 perform "double precision" (128-bit) arithmetic > in software, and experience a slow-down of close to a factor of 100 > relative to 64-bit vector operations. Double precision has 96 bits in the mantissa on the X/MP and 94 on the CYBER.* If one is willing to lose 1-2 bit accuracy on the CYBER, the slow-down factor can be reduced to around 5. The CYBER has the direct capability of getting both the most and least significant part of the sum or product, with two instruction calls, but no additional overhead, but the CRAYs only get the most significant part; this is the biggest problem, and requires that half- precision is used to get double precision. *No flames, please. There is disagreement on how the number of bits is to be counted. This is the number of significant bits in a sign-magnitude represent- ation. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)
aglew@mcdurb.Urbana.Gould.COM (04/30/89)
>Herman. > >Please describe a language, higher level than forth, that will provide >all the tools you feel you need in a HLL. I am about convinced that such >a language is impossible. > >Thanks, > your partner in monomania, > Peter da Silva. >-- >Peter da Silva, Xenix Support, Ferranti International Controls Corporation. If I remember correctly, POP-2 had some nice mechanisms for accessing the underlying machine. As did Algol-68. Myself, I'm just about happy with GNU CC style assembly function inlining, and C++ function overloading and typing. Although I haven't used G++ yet, to see what they would feel like together.
khb@fatcity.Sun.COM (Keith Bierman Sun Tactical Engineering) (05/01/89)
In article <423@bnr-fos.UUCP> schow@bnr-public.UUCP (Stanley Chow) writes: > ... cogent argument deleted.. > >I can hear it now, everyone is jumping up and down saying, "what a fool, >doesn't he know that all those cycles are free?", "Hasn't he heard of >pipelining and register scoreboarding?", "but the CISC instruction are slower >so the RISC will still run faster." > ... and more >"Compilers can do optimizations", I hear the yelling. This is another >interesting phenomenon - reduce the complexity in the CPU so that the >compiler must do all these other optimizations. I have also now seem any >indications that a compiler can to anywhere close to an optimal job on >scheduling code or pipelining. Even discounting the NP-completeness of >just about everything, theoratical indications point the other way, >especially when the compiler has to juggle so many conflicting constraints. > Cydrome and Multiflow have both demonstrated that it is possible to move much of the analysis to the compiler (with a increase in compile times :>). The original paper on the Bulldog compiler by ellis (well, its a book :>) describes how the memory bandwidth problem can be dealt with, and in many cases quite well. The Cydra 5 compiler could, in interesting programs (but by no means all) generate optimal code for key loops (as long as the vector were long; but this was a hardware constraint, not a compiler issue). It should be noted that both Cydrome and Multiflow chose to have an almost fully exposed pipeline, and no scoreboarding or other nastiness. Memory bandwidth is key to deleivering high performance, but the RISCiness or CISCiness of the processor (only impacting the instruction side of things) would seem to be a non-issue. Keith H. Bierman |*My thoughts are my own. Only my work belongs to Sun* It's Not My Fault | Marketing Technical Specialist I Voted for Bill & | Languages and Performance Tools. Opus (* strange as it may seem, I do more engineering now *)
khb@fatcity.Sun.COM (Keith Bierman Sun Tactical Engineering) (05/01/89)
In article <100524@sun.Eng.Sun.COM> dgh%dgh@Sun.COM (David Hough) writes: > ... much good reading deleted.. >Not all code is compiled. For instance, there is a great body of theory >and practice in obtaining computational error bounds in computations >based on interval arithmetic. Interval arithmetic is >efficient to implement with the directed rounding modes required by IEEE >arithmetic, but you can't write the implementation in standard C or >Fortran. In integer arithmetic, the double-precise product of >two single-precise operands, >and the single-precise quotient and remainder of a double-precise >dividend and single-precise divisor, are important in a number of >applications such as base conversion and random number generation, >but there is no way to express the required computations in standard >higher-level languages. > I believe that the next rev of Fortran (new socially approved spelling) will allow us to write this sort of code. I also think that it can be done in Ada. Keith H. Bierman |*My thoughts are my own. Only my work belongs to Sun* It's Not My Fault | Marketing Technical Specialist I Voted for Bill & | Languages and Performance Tools. Opus (* strange as it may seem, I do more engineering now *)
khb@fatcity.Sun.COM (Keith Bierman Sun Tactical Engineering) (05/01/89)
In article <39049@bbn.COM> slackey@BBN.COM (Stan Lackey) writes: >In article <100524@sun.Eng.Sun.COM> dgh%dgh@Sun.COM (David Hough) writes: >>In article <38971@bbn.COM>, slackey@bbn.com (Stan Lackey) writes: >>> >>> I have a real problem with anything that includes IEEE floating point >>> AND calls itself a RISC. IEEE FP violates every rule of RISC; it has ............. deleted >I agree - I was just quoting the RISC guys. Which RISC guys ? Note dgh's (and my) corporate affiliation ... :>:> Keith H. Bierman |*My thoughts are my own. Only my work belongs to Sun* It's Not My Fault | Marketing Technical Specialist I Voted for Bill & | Languages and Performance Tools. Opus (* strange as it may seem, I do more engineering now *)
khb@fatcity.Sun.COM (Keith Bierman Sun Tactical Engineering) (05/01/89)
In article <100891@sun.Eng.Sun.COM> dgh%dgh@Sun.COM (David Hough) writes: > >This event is rare enough that it needn't be as fast as a normal >multiplication, so it's OK to slow down somewhat by holding the CPU, >but not so rare that you want to punt to software. By throwing enough >hardware at the problem you can make it as fast as the normal case. >I don't advocate that but that's my understanding of what the Cydra-5 did. Well, the details are probably the only things of value for the Cydrome investors to fence (I mean sell :>) so we won't go into the details. The Cydra 5 was constructed to that fp mults could be _issued_ EVERY cycle, and took a precise number of instructions to complete. There was not enough real estate (it was only a 17 board ECL numeric engine) to do the same for divide.... this was something of a performance bottleneck (divide may be rare, but when it happens, it happens often! :>) Keith H. Bierman |*My thoughts are my own. Only my work belongs to Sun* It's Not My Fault | Marketing Technical Specialist I Voted for Bill & | Languages and Performance Tools. Opus (* strange as it may seem, I do more engineering now *)
yair@tybalt.caltech.edu (Yair Zadik) (05/01/89)
In article <231@celit.UUCP> dave@celerity.UUCP (Dave Smith) writes: > The problem I have with RISC designs are that they use up too much >memory bandwidth. What I think would be better was something that gave you >the flexibility of a RISC (not being tied into the designer's particular >idea of how a string instruction should be implemented, for example) but >with the memory bandwidth efficiency of a CISC. I'm not familiar enough >with VLIW to make any good judgements on it, but it seems as though it's >a reasonable way to go. > >David L. Smith >FPS Computing, San Diego >ucsd!celerity!dave >"Repent, Harlequin!," said the TickTock Man A couple of years ago there was an article in Byte about a proposed design which they called WISC for Writeable Instruction Set Computer. The idea was to do a RISC or microcoded processor which had an on board memory containing macros which behaved like normal instructions (I guess it was on EEPROM like memory). That way, each compiler could optimize the instruction set for its language. The end result (theoreticly) is that you get the efficiency of RISC with the memory bandwith of CISC. I haven't heard else about it. Is anyone out there working on such a processor or is it just a bad idea? Yair Zadik yair@tybalt.caltech.edu
stuart@bms-at.UUCP (Stuart Gathman) (05/01/89)
The GCC asm() interface gives an excellent interface to special CISC instructions. One can code any arbitrary assembler code with register and address substitutions for C variables, specify input, output, and scratch registers, and put it in a macro to disguise it as a function call (or put it in an inline function). Portability is maintained by proper design of the function interface, machines that don't have a similar instruction can use a real function. Turbo C has an inline assembler capability with similar features. It is geared specifically to '86 code, however. Instead of specifying the register environment in the asm, the compiler knows which instructions affect which registers. Automatic register & address substitution for C variables is available here also. With this capability, the only difference between custom inline CISC instructions and standard operators is syntactic. Using C++ can help that also. -- Stuart D. Gathman <stuart@bms-at.uucp> <..!{vrdxhq|daitc}!bms-at!stuart>
bullerj@handel.colostate.edu (Jon Buller) (05/03/89)
In article <10544@cit-vax.Caltech.Edu> yair@tybalt.caltech.edu.UUCP (Yair Zadik) writes: >A couple of years ago there was an article in Byte about a proposed design >which they called WISC for Writeable Instruction Set Computer. The idea >was to do a RISC or microcoded processor which had an on board memory >containing macros which behaved like normal instructions (I guess it was >on EEPROM like memory). That way, each compiler could optimize the >instruction set for its language. The end result (theoreticly) is that >you get the efficiency of RISC with the memory bandwith of CISC. I haven't >heard else about it. Is anyone out there working on such a processor or is >it just a bad idea? > >Yair Zadik >yair@tybalt.caltech.edu The only problem with this is that doing a context switch is nearly impossible. Imagine not only saving registers but having to swap out microcode and instructions too. Not to mention that porting a compiler to do this might be a lot harder. Portability would be sure to take an incredible hit, one machines microcode can do x and y in parallel, machine z has hardware to do operation w... I think it would be good for a controller, or some compute server that does one thing, but that is probably better done with a custom chip or a coprocessor with that particular microcode built in from the start. Doing something like this would then lead to virtual microcode, which I heard something about once, but I don't think I'd ever want to see it in use. I think about the only use for something like this would be a lab machine to test out different machine styles (ie. what would happen if a 68000 had instructions to... or can we do the same thing without...) Well, that's my $0.02 worth, and probably wrong too, I'd like to hear better/other ideas, but finals are in 5 days, and then this account goes away permenantly... ------------------------------------------------------------------------------- Jon Buller FROM fortune IMPORT quote; ..!ccncsu!handel!bullerj FROM lawyers IMPORT disclaimer;
tim@crackle.amd.com (Tim Olson) (05/03/89)
In article <10544@cit-vax.Caltech.Edu> yair@tybalt.caltech.edu.UUCP (Yair Zadik) writes: | A couple of years ago there was an article in Byte about a proposed design | which they called WISC for Writeable Instruction Set Computer. The idea | was to do a RISC or microcoded processor which had an on board memory | containing macros which behaved like normal instructions (I guess it was | on EEPROM like memory). That way, each compiler could optimize the | instruction set for its language. The end result (theoreticly) is that | you get the efficiency of RISC with the memory bandwith of CISC. I haven't | heard else about it. Is anyone out there working on such a processor or is | it just a bad idea? "WISC" is just a new term for how most people build microcoded machines (SRAMs are faster than EPROMS/ROMS). I don't see how you can get "the efficiency of RISC with the memory bandwidth of CISC" using such a design. The way CISCs attempt to reduce memory bandwith is to make an instruction do as much as possible, so fewer are needed to perform an operation. This is the antithesis of RISC, which, by using simple "building-block" instructions, allows the compiler to perform many more optimizations. The way to reduce memory bandwith while maintaining performance is to change the Writeable Control Store into an instruction cache. -- Tim Olson Advanced Micro Devices (tim@amd.com)
ted@nmsu.edu (Ted Dunning) (05/03/89)
In article <1827@ccncsu.ColoState.EDU> bullerj@handel.colostate.edu (Jon Buller) writes: In article <10544@cit-vax.Caltech.Edu> yair@tybalt.caltech.edu.UUCP (Yair Zadik) writes: >A couple of years ago there was an article in Byte about a >proposed design >which they called WISC for Writeable Instruction >Set Computer. The idea ... The only problem with this is that doing a context switch is nearly impossible. Imagine not only saving registers but having to swap out microcode and instructions too. ... GOLLY, do you think that maybe we could build some cool hardware that would keep track and only swap out the parts of the microcode that were different, or maybe even only swap in the parts that were new, and then why stop there. I mean, like, lets swap parts of the user program and data into and out of this fast control store. And let's make the backing store be main memory so it is easier to get to... isn't this leading right back to a normal risc with a cache that allows programs to share executable segments?
khb%chiba@Sun.COM (Keith Bierman - SPD Languages Marketing -- MTS) (05/04/89)
In article <10544@cit-vax.Caltech.Edu> yair@tybalt.caltech.edu.UUCP (Yair Zadik) writes: ...... >>"Repent, Harlequin!," said the TickTock Man > >A couple of years ago there was an article in Byte about a proposed design >which they called WISC for Writeable Instruction Set Computer. The idea >was to do a RISC or microcoded processor which had an on board memory >containing macros which behaved like normal instructions (I guess it was >on EEPROM like memory). That way, each compiler could optimize the >instruction set for its language. The end result (theoreticly) is that >you get the efficiency of RISC with the memory bandwith of CISC. I haven't >heard else about it. Is anyone out there working on such a processor or is >it just a bad idea? > Honeywell and Buro..whoops UNISYS had mainframes like this, well over a decade ago. Byte, lives at the cutting edge.... Performance is similar to having a RISC with a seperate instruction cache (of some reasonable size). This is, in fact, often better because different programs (we do live in a world full of context switches) probably want different "microcode" ... cheers Keith H. Bierman |*My thoughts are my own. Only my work belongs to Sun* It's Not My Fault | Marketing Technical Specialist ! kbierman@sun.com I Voted for Bill & | Languages and Performance Tools. Opus (* strange as it may seem, I do more engineering now *)
cquenel@polyslo.CalPoly.EDU (24 more school days) (05/04/89)
In article <10544@cit-vax.Caltech.Edu> (Yair Zadik) writes: |WISC for Writeable Instruction Set Computer. The idea ... In article <1827@ccncsu.ColoState.EDU> (Jon Buller) writes: | The only problem with this is that doing a context switch is nearly | impossible. Imagine not only saving registers but having to swap | out microcode and instructions too. ... In 9690 ted@nmsu.edu (Ted Dunning) sez: |isn't this leading right back to a normal risc with a cache that |allows programs to share executable segments? Actually, no. The point is that micro-code is much more static over the life of a process. A seperate cache of already-broken-down, easy to execute micro-code would be carrying RISC to an extreme (simple instructions), but would get around the icache/bandwidth problem inherent in conventional RISCs. -- @---@ ----------------------------------------------------------------- @---@ \. ./ | Chris (The Lab Rat) Quenelle cquenel@polyslo.calpoly.edu | \. ./ \ / | You can keep my things, they've come to take me home -- PG | \ / ==o== ----------------------------------------------------------------- ==o==
peter@ficc.uu.net (Peter da Silva) (05/04/89)
In article <10544@cit-vax.Caltech.Edu>, yair@tybalt.caltech.edu (Yair Zadik) writes: > A couple of years ago there was an article in Byte about a proposed design > which they called WISC for Writeable Instruction Set Computer. > ...each compiler could optimize the instruction set for its language. Sounds like a great idea for an embedded controller, but can you imagine what context switches would be like in a general purpose environment with multiple supported compilers...? -- Peter da Silva, Xenix Support, Ferranti International Controls Corporation. Business: uunet.uu.net!ficc!peter, peter@ficc.uu.net, +1 713 274 5180. Personal: ...!texbell!sugar!peter, peter@sugar.hackercorp.com.
cliff@ficc.uu.net (cliff click) (05/04/89)
In article <10544@cit-vax.Caltech.Edu>, yair@tybalt.caltech.edu (Yair Zadik) writes: > In article <231@celit.UUCP> dave@celerity.UUCP (Dave Smith) writes: > > The problem I have with RISC designs are that they use up too much > >memory bandwidth. > A couple of years ago there was an article in Byte about a proposed design > which they called WISC for Writeable Instruction Set Computer. > I haven't heard else about it. Is anyone out there working on such a > processor or is it just a bad idea? A couple of years ago Phil Koopman took his WISC stuff to Harris - I think their working with it. He had a 32bit CPU built from off-the-shelf TTL logic that plugged into an IBM PC and ran at 10Mhz. It was stack based, Harvard archecture and a completly writable micro-code store. He had some amazing throughput numbers on it, and had tweaked micro-code for Prolog, C and some other stuff (Lisp?). Anyhow Harris is supposed to be putting together a chip from it. -- Cliff Click, Software Contractor at Large Business: uunet.uu.net!ficc!cliff, cliff@ficc.uu.net, +1 713 274 5368 (w). Disclaimer: lost in the vortices of nilspace... +1 713 568 3460 (h).
henry@utzoo.uucp (Henry Spencer) (05/04/89)
In article <10544@cit-vax.Caltech.Edu> yair@tybalt.caltech.edu.UUCP (Yair Zadik) writes: >... WISC for Writeable Instruction Set Computer. The idea >was to do a RISC or microcoded processor which had an on board memory >containing macros which behaved like normal instructions (I guess it was >on EEPROM like memory). That way, each compiler could optimize the >instruction set for its language. The end result (theoreticly) is that >you get the efficiency of RISC with the memory bandwith of CISC. I haven't >heard else about it. Is anyone out there working on such a processor or is >it just a bad idea? Consider a well-built RISC, with an instruction cache, executing an interpreter that fetches bytes from memory and interprets them as if they were, say, 8086 instructions. Assuming that the interpreter fits in the I-cache, in what way does this differ from the WISC idea? Context switching between interpreters is trivial, you can write them in high-level languages, and if you really want *speed*, you can forget the interpreter and just compile real code. In short, it's an excellent idea and everyone is already doing it, but without some of the limitations that result from thinking in terms of microcode and EEPROM. -- Mars in 1980s: USSR, 2 tries, | Henry Spencer at U of Toronto Zoology 2 failures; USA, 0 tries. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
news@ism780c.isc.com (News system) (05/05/89)
In article <10544@cit-vax.Caltech.Edu> yair@tybalt.caltech.edu.UUCP (Yair Zadik) writes: >A couple of years ago there was an article in Byte about a proposed design >which they called WISC for Writeable Instruction Set Computer. The idea >was to do a RISC or microcoded processor which had an on board memory >containing macros which behaved like normal instructions (I guess it was >on EEPROM like memory). That way, each compiler could optimize the >instruction set for its language. The end result (theoreticly) is that >you get the efficiency of RISC with the memory bandwith of CISC. I haven't >heard else about it. Is anyone out there working on such a processor or is >it just a bad idea? Yes, it is a bad idea. In the mid 60's I was at Standard Computer (no longer in existance) and I actually built such a machine. It was called the Standard EX01 (EX01 was for expermintal number 1). The user could dynamically alter the instruction set of the machine. The machine was micro coded and had a writable control store. The 'basic' instruction set provided a mechanism for writing to control storage. In practice we found that it was impossible to make the thing work because any modification to the control store could effect th 'basic' instruction behavior. As an example, one of the problems we found was that when running the 'FORTRAN' instructions, double precision floating divide produced the wrong answer if the instruction was executed at the same time as a tape unit was reading a file mark. I decided that there was no way to support a machine like that in the field, so the experment was terminated. Marv Rubinstein
greg@cantuar.UUCP (G. Ewing) (05/09/89)
Yair Zadik (yair@tybalt.caltech.edu.UUCP) writes: >A couple of years ago there was an article in Byte about a proposed design >which they called WISC for Writeable Instruction Set Computer. Well, maybe the performance improvement would be debatable, but what the heck - I think it would be fun! In fact, I'd like to go further and make the processor sort of a big writeable PAL! Rearrange the hardware according to the task at hand. A WHISC (Writeable Hardware Interconnection Scheme Computer)? Greg Ewing, Computer Science Dept, Canterbury Univ., Christchurch, New Zealand UUCP: ...!{watmath,munnari,mcvax,vuwcomp}!cantuar!greg Internet: greg@cantuar.uucp +-------------------------------------- Spearnet: greg@nz.ac.canterbury.cantuar | A citizen of NewZealandCorp, a Telecom: +64 3 667 001 x6367 | wholly-owned subsidiary of Japan Inc.