rcd@ico.ISC.COM (Dick Dunn) (03/21/89)
In article <37196@bbn.COM>, slackey@bbn.com (Stan Lackey) writes: > RISC is indeed a technology window, driven largely by the amount of > stuff you can fit in a chip... OK, fair 'nuff. As soon as we can put an unlimited amount of stuff on a chip (and do it without increasing delays or other limitations), we'll be beyond that technology window, I guess... >...Look at what is being added now that you > can fit more than a simple CPU core in a chip: [7 examples which are not in the least at variance with RISC approach] > The trend in computer evolution is truly toward greater hardware > complexity. This has been demonstrated countless times... Sure. Just look how much more complex a CDC 6600 is than, say, a 7090. (For the younger set: It's several times less complex.) No, the trends for faster machines have frequently involved producing *simpler* designs because it wasn't possible to make a machine fast with all the extra baggage. > There is a true need for complexity... not demonstrated. >...How many times when reading this > newsgroup do you see things like, "Yes but that chip doesn't have <my > favorite feature>"... Rarely, if ever...but even if I did, so what? <my favorite feature> is a lousy criterion for what to put in hardware. Folks, we're not discussing machines to indulge your whims; we're talking about what it takes to get a job done. > Companies must make money. They will do this by making not tiny > low-cost RISC micros, but the most complex thing they can fit in a > chip... Sure. That's why Sun introduced the very complex SPARC as a successor to the much simpler 680x0 machines...or why intel came out with the more complex 860 to up the ante over the RISCy 386, right??? -- Dick Dunn UUCP: {ncar,nbires}!ico!rcd (303)449-2870 ...Never offend with style when you can offend with substance.
jrg@Apple.COM (John R. Galloway) (03/22/89)
In article <15702@clover.ICO.ISC.COM>, rcd@ico.ISC.COM (Dick Dunn) writes: > In article <37196@bbn.COM>, slackey@bbn.com (Stan Lackey) writes: > > RISC is indeed a technology window, driven largely by the amount of > > stuff you can fit in a chip... > > OK, fair 'nuff. As soon as we can put an unlimited amount of stuff on a > chip (and do it without increasing delays or other limitations), we'll be > beyond that technology window, I guess... Well actually only while the "extra" space is less than a full cpu, as soon as it is we will just get multiple cpus on a chip and they may well still be RISC oriented. In fact with the extra cost of packaging, I could imagine that as soon as this point is approached all extras will be stripped off to squeeze the extra one in. apple!jrg John R. Galloway, Jr. contract programmer, San Jose, Ca These are my views, NOT Apple's, I am a GUEST here, not an employee!!
mash@mips.COM (John Mashey) (03/22/89)
In article <27681@apple.Apple.COM> jrg@Apple.COM (John R. Galloway) writes: >In article <15702@clover.ICO.ISC.COM>, rcd@ico.ISC.COM (Dick Dunn) writes: >> In article <37196@bbn.COM>, slackey@bbn.com (Stan Lackey) writes: >> > RISC is indeed a technology window, driven largely by the amount of >> > stuff you can fit in a chip... >> >> OK, fair 'nuff. As soon as we can put an unlimited amount of stuff on a >> chip (and do it without increasing delays or other limitations), we'll be >> beyond that technology window, I guess... .... >Well actually only while the "extra" space is less than a full cpu, as soon >as it is we will just get multiple cpus on a chip and they may well still be >RISC oriented. In fact with the extra cost of packaging, I could imagine >that as soon as this point is approached all extras will be stripped off to >squeeze the extra one in. Can anyone tell us where to get some of this kind of silicon? (the kind you can put unlimited stuff on :-) we want some. I'm sure Intel, Moto, Sun would like some also. Seriously, I doubt that anyone has silicon to burn. In particular, the faster the chips get, the more it hurts you to go off chip. Bigger on-chip caches [I, D, or TLB] keep you on-chip more, and are therefore good. With more hardware, you can make integer multiplies go faster, make FP go faster, and maybe put in some multiple FP units, a la CDC 6600s [and these things chew up area]. Note that Intel, with a million transistors, said the space budget didn't leave room for an IEEE divide..... (Compcon paper). I suspect it will be some time before people replicate the CPUs on a chip, just because there's nothing else to do with the silicon. a) It's hard to get enough bandwidth in and out of these chips, i.e., I/Os cost money. b) If you replicate CPUs on a chip, it would like more bandwidth. c) If you double the size of a giant-monster-chip, its yield might get a lot worse... -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (03/23/89)
In article <15695@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >faster, make FP go faster, and maybe put in some multiple FP units, >a la CDC 6600s [and these things chew up area]. Note that Intel, with >a million transistors, said the space budget didn't leave room for >an IEEE divide..... (Compcon paper). Things are getting to a very interesting stage, however. I guess it is just my big machine history showing again, but I keep wondering why, with a decent sized register file, it wouldn't make more sense to put all FP/ALU hardware on the same chip as the control unit, along with the instruction cache, and leave the MMU, and the data cache (which is almost always going to be larger than what you can put on a chip no matter how large the chips get), to an external implementation. This also leaves more flexibility in cache design/choice, which is reasonable since data cache design is very dependent on what the chip is going to be used for anyway. I would expect to see a second MMU/cache chip available also for people who want to use it. Also, it would make some graphics/vector designs easier to deal with also (at least potentially). Last year, the answer always was: "High speed arithmetic (integer and/or FP) is a specialty area." I would think that the success of this years crop of high-arithmetic-performance systems would have dispelled that notion by now. So, my question is: If you ASSUME that you have to have high speed arithmetic, what is the best way to partition functions between chips? I believe that the best way is Control, ALU/FPU, and instruction cache on one chip, and data cache/MMU on another chip. Why doesn't the market agree with me? Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (03/23/89)
In article <22974@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: > > (About some of his favorite topics.) > Of course the Motorola 88K does essentially what I asked, although the cache setup is slightly different. They don't have AS MUCH FPU hardware as was being talked about in what I was responding to, but conceptually, they are already doing it, so my point was confused. The Clipper also. My point was to explore which designs minimize the off-chip bandwidth required, and why, in the context of 1M+ transistor chips, assuming that high performance arithmetic is a given. Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
davidsen@steinmetz.ge.com (Wm. E. Davidsen Jr) (03/24/89)
It's getting harder to tell RISC from CISC in some cases. If a computer has one instruction to do something like: *(++a) = (*b)++ I would feel that it it is CISC. If it executes that instruction in one cycle and doesn't use microcode, I would find it hard to argue that it is not RISC. Processors like the N10 are approaching 1 op/cycle, and the i80486 is rumored to have an average of < 2 for non-F.P. operations. I believe that people are taking RISC as a personal issue in some cases, rather than a method of getting real work (ie. that done by the people who pay for the computer or on their behalf) in less time and/or for less money. If the processor becomes so fast that it requires memory bandwidth which is unachievable or unaffordable then the true speed of the processor is reduced by the wait states introduced. I think that in the next five years we will see processors which outrun off chip memory, and certainly now there are a lot of processors running at less than full speed due to the cost of fast memory. There will continue to be a demand for processors with a very high instruction rate (call them RISC if you will), and also for processors which will perform a given task faster with limited memory bandwidth. Vendors will continue to pick the complexity which they feel provides the greatest cost effectiveness for the entire product based on the CPU. Cycles per operation will come down for all vendors, because they have the techniques to use that approach. I also think that vendors who are now regarded as RISC vendors will add complexity to their instruction sets, providing (1) it doesn't slow the chip on other operations, (2) it doesn't take real estate which could be used for things which would improve overall performance more, and (3) that the benefit in code size and performance (due to fewer instructions) would be readily measurable. ________________________________________________________________ The ultimate RISC machine: a one bit opcode; 0 = conditional branch, 1 = nop to fill the delay slots ;-) -- bill davidsen (wedu@crd.GE.COM) {uunet | philabs}!steinmetz!crdos1!davidsen "Stupidity, like virtue, is its own reward" -me
w-colinp@microsoft.UUCP (Colin Plumb) (03/24/89)
lamaster@ames.arc.nasa.gov (Hugh LaMaster) wrote: > So, my question is: If you ASSUME that you have to have high speed > arithmetic, what is the best way to partition functions between chips? > I believe that the best way is Control, ALU/FPU, and instruction cache > on one chip, and data cache/MMU on another chip. Why doesn't the market > agree with me? Well, given that latency to memory is a serious problem these days, and that MMU address translation is often on the critical path, moving it off-chip doesn't sound like such a good idea. I think the MIPS approach is the best: MMU and cache *control* on chip; the actual data (which can be a trifle slower) can be put in external SRAM. SRAMs have a large market, so even ultra-fast ones are comparatively cheap. Associative memory is much more expensive. I've said it before: I'm *astounded* nobody else has used this idea. It's such a great Win. Cache control is the custom bit, so do it in custom logic. With all the rest of the custom logic: on the microprocessor. Cache RAM is very generic. So don't re-invent the wheel. Has anyone out there (other than MIPS, of course) considered this scheme and then rejected it? Is my enthusiasm blind to some Great Problem? -- -Colin (uunet!microsoft!w-colinp) "Don't listen to me. I never do." - The Doctor
marc@oahu.cs.ucla.edu (Marc Tremblay) (03/25/89)
In article <51@microsoft.UUCP> w-colinp@microsoft.uucp (Colin Plumb) writes: >lamaster@ames.arc.nasa.gov (Hugh LaMaster) wrote: >> So, my question is: If you ASSUME that you have to have high speed >> arithmetic, what is the best way to partition functions between chips? >> I believe that the best way is Control, ALU/FPU, and instruction cache >> on one chip, and data cache/MMU on another chip. Why doesn't the market >> agree with me? I also believe that putting the Integer unit and the FPU on the same chip makes sense. These two units have to communicate quickly, possibly sharing registers, and the FPU depends on the core section for its flow of instructions. I think that the trend is toward putting them on the same chip anyway. Floating-point coprocessors were very detached from the processor when they first came out (although surprisingly enough the 8087 was a little closer), especially when you think that just setting up FPU instructions could take around 10 cycles! The MIPS approach, i.e. to make the coprocessor (R3010) closely coupled is a huge improvement, especially regarding the instruction-issuing overhead. The new trend? Because the FPU needs the core unit then put it on-chip, (both Motorola 88000 and Intel i860 have the FPU on-chip). Since you *currently* have to go off-chip to access reasonably large caches, you might as well put the MMU with the caches. The idea of Hugh LeMaster's comment above, may introduce problems for accessing the instruction cache though, especially if it is physical. >Well, given that latency to memory is a serious problem these days, and >that MMU address translation is often on the critical path, moving >it off-chip doesn't sound like such a good idea. My reasoning is: access to reasonable cache -> need to go off-chip MMU is used to access cache -> need to go off-chip since you need to go off-chip anyway -> put MMU off-chip floating-point computations -> can be done internally FPU *needs* the integer unit -> put it close to the processor close to the processor -> at least closely coupled, better on-chip. >I've said it before: I'm *astounded* nobody else has used this idea. >It's such a great Win. Cache control is the custom bit, so do it >in custom logic. With all the rest of the custom logic: on the >microprocessor. Cache RAM is very generic. So don't re-invent the >wheel. FPU is also quite custom! :-) --> put it on the same chip! >Has anyone out there (other than MIPS, of course) considered this scheme >and then rejected it? Is my enthusiasm blind to some Great Problem? I think that one of the reasons why some companies have rejected it is that the size of a chip with integer + FPU is HUGE. The R3010, a great FPU coprocessor, with all its custom logic and its 75000 transistors is quite large (about 8.4 * 8.8 mm) especially when you compare it to a MMU. It is easier (in terms of area) to put an MMU on-chip than a FPU on-chip, at least for a good FPU! Marc Tremblay marc@CS.UCLA.EDU Computer Science Department, UCLA
alan@rnms1.uucp (0000-Alan Lovejoy(0000)) (03/25/89)
In article <13404@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes: > If [a processor] executes [high semantic content] instruction[s] in one >cycle and doesn't use microcode, I would find it hard to argue that it >is not RISC. Processors like the N10 are approaching 1 op/cycle, and the >i80486 is rumored to have an average of < 2 for non-F.P. operations. The issue here seems to be whether RISC is defined by the physical characteristics of a CPU implementation (microcode, cycles per instruction, cacheing, pipelining, number of instructions, instruction lengths...) or by the logical characteristics of an architecture (number of registers, addressing modes, instruction semantics...). The general consensus seems to be that the primary determinant is the physical implementation, but that the logical architecture heavily influences to what extent RISCy implementation techniques can be used. Except that RISC is not just a set of implementation techniques, but a design methodology and philosophy: objectively determine what the cost/benefit ratio of each proposed feature or mechanism is, and use this ratio as the priority for deciding what to put in the architecture or in the implementation. > If the processor becomes so fast that it requires memory >bandwidth which is unachievable or unaffordable[,] then the true speed of >the processor is reduced by the wait states introduced. > There will continue to be a demand for processors with a very high >instruction rate (call them RISC if you will), and also for processors >which will perform a given task faster with limited memory bandwidth. >Vendors will continue to pick the complexity which they feel provides >the greatest cost effectiveness for the entire product based on the CPU. >Cycles per operation will come down for all vendors, because they have >the techniques to use that approach. Single-cycle instructions are not just a function of how much hardware, or how many parallel functional units, you can put on a chip. Instructions whose semantics require off-chip data accesses cannot be completed until the off-chip data is fetched, no matter how many transistors you put on the chip. What should the CPU be doing while it waits for the off-chip data? With a Harvard architecture, data and instruction fetching are independent, so the fact that you fetched one instruction to do the work of three doesn't help your off-chip data-access bandwidth at all. The biggest performance bottleneck is in data fetching, not instruction fetching. Instruction-fetching bottlenecks that do exist are much more easily addressed by caches and pipelines than data-fetching bottlenecks are. The other issue is the granularity of instruction semantics. The greater the granularity, the more opportunities there are for code optimization. It also increases the generality of the instruction set, making it more likely that all instructions will be used by (and useful to) more applications. The greater the difference in semantic level (primitiveness) between the machine instructions and high-level language statements, the easier it is to have high-quality code generation for a wide variety of languages. Alan Lovejoy; alan@pdn; 813-530-2211; AT&T Paradyne: 8550 Ulmerton, Largo, FL. Disclaimer: I do not speak for AT&T Paradyne. They do not speak for me. __American Investment Deficiency Syndrome => No resistance to foreign invasion. Motto: If nanomachines will be able to reconstruct you, YOU AREN'T DEAD YET.
tada@athena.mit.edu (Michael Zehr) (03/25/89)
In article <15695@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >In article <27681@apple.Apple.COM> jrg@Apple.COM (John R. Galloway) writes: >>In article <15702@clover.ICO.ISC.COM>, rcd@ico.ISC.COM (Dick Dunn) writes: >>> In article <37196@bbn.COM>, slackey@bbn.com (Stan Lackey) writes: >>> > RISC is indeed a technology window, driven largely by the amount of >>> > stuff you can fit in a chip... >>> OK, fair 'nuff. As soon as we can put an unlimited amount of stuff on a >>Well actually only while the "extra" space is less than a full cpu, as soon >I suspect it will be some time before people replicate the CPUs on >a chip, just because there's nothing else to do with the silicon. > a) It's hard to get enough bandwidth in and out of these chips, > i.e., I/Os cost money. Professor Daly (sp?) of MIT has been saying something along those lines for a couple years. instead of having a whole bunch of memory chips with one path to a fast CPU and have a cache to prevent slow accesses, take each of those memory chips, halve the amoune of memory on them, and put a CPU on it. you no longer have a memory banchwith problem, because the memory and CPU are on the same (small, easy-to-make) chip. instead of putting a cache on the chip (you don't need one), put a communications circuit to transfer data to the other chips. If you're interested in more information, he's working on something called a J-machine, which will probably be in prototype stage sometime this summer (i think?). which would you rather have -- one CPU that runs at 50 MIPS with 72, 1Mbit memory chips (8 Megabytes * 9 chips per byte) or 72, 10 MIPS processors and 4 Megabytes of memory split among them? -michael j zehr
rodman@mfci.UUCP (Paul Rodman) (03/25/89)
In article <22974@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: > >So, my question is: If you ASSUME that you have to have high speed arithmetic, >what is the best way to partition functions between chips? I believe that the >best way is Control, ALU/FPU, and instruction cache on one chip, and data >cache/MMU on another chip. Why doesn't the market agree with me? > Personally, I think the optimal partitioning for large f.p. problems would be to split the f.p. unit and registers onto another chip. The amount of comms required between the integer domain and floating domain is very small and extra cycles to go from one to the other aren't a problem (speaking from the our experience with partition the cpu in just this way). I haven't thought about how to solve the problems in splitting integer data caches and floating data caches, but I'm sure there would be an acceptable solution. Assuming your compiler guys are up to it , :-) The main advantage here are: - You can get more pins for the f.p. chip for more loads/stores per clock on the f-unit. Also you can get more than 16 d.p. registers (which isn't enough, in our experience for two piped fu's). - The i-chip, which made no use of the funit hardware, has more area for integer goodies, including a larger on-chip data cache for integer data. I would rather have the MMU on this chip to make sure that the memory pipeline for explicit loads is one cycle shorter, i.e. save a chip crossing here. Now the guys that don't use floating point can just buy the i-chip, those that want screaming f.p. perf buy both. I just don't see the point in doing hairy-chested cramming of f.p. hardware on the same chip as the integer stuff, when the two functional units are so nicely seperable, to the benefit of each. Paul K. Rodman rodman@mfci.uucp __... ...__ _.. . _._ ._ .____ __.. ._
wendyt@pyrps5 (Wendy Thrash) (03/25/89)
In article <717@m3.mfci.UUCP> rodman@mfci.UUCP (Paul Rodman) writes: >Now the guys that don't use floating point can just buy the i-chip, those >that want screaming f.p. perf buy both. There's one large hidden cost here that people never seem to acknowledge: If you sell even one system without floating-point hardware, some poor programmer (or group) will spend the next twenty years (your product should last so long) supporting floating-point operations on systems without the floating-point hardware. Hardware costs don't factor in the cost of phone calls from customers complaining about slow (trapped and simulated) software floating point or slow (-fswitch) hardware floating point or outmoded (someone wrote the microcode years ago, and nobody understands it well enough to fix it now) firmware floating point, or the costs of maintaining separate versions of libraries, additional code in compilers, etc. (for compiler-generated software floating point). It's like the guy says on the commercial: You can pay me now (for extra hardware) or pay me later (for extra support). If you sell enough systems at the lower price to cover the hidden costs, then you've made a good decision, but do remember that the costs are merely hidden, not nonexistent.
carlton@betelgeuse (Mike Carlton) (03/25/89)
In article <51@microsoft.UUCP> w-colinp@microsoft.uucp (Colin Plumb) writes: ... >I think the MIPS approach is the best: MMU and cache *control* on chip; >the actual data (which can be a trifle slower) can be put in external >SRAM. SRAMs have a large market, so even ultra-fast ones are comparatively >cheap. Associative memory is much more expensive. > >I've said it before: I'm *astounded* nobody else has used this idea. >It's such a great Win. Cache control is the custom bit, so do it >in custom logic. With all the rest of the custom logic: on the >microprocessor. Cache RAM is very generic. So don't re-invent the >wheel. > >Has anyone out there (other than MIPS, of course) considered this scheme >and then rejected it? Is my enthusiasm blind to some Great Problem? >-- > -Colin (uunet!microsoft!w-colinp) > >"Don't listen to me. I never do." - The Doctor I agree that the MIPS scheme is nice, but it does have its drawbacks. In particular, they've fixed the cache control. If they got it right (for your application) then no problem. Otherwise you're out of luck. It would be possible to make some of the details configurable, but I believe that MIPS doesn't allow this. If I remember right (somebody borrowed my MIPS book so I can't verify), the MIPS cache control requires a write-through cache. Personally, I don't want a write-through cache. Another aspect is the write latency (i.e. how many cycles does your cache take to handle a write-hit?). I think the MIPS controller assumes a single cycle. This implies you've got to build a cache to handle this, and this will get trickier when you can buy a 40 or 50MHz MIPS. -- Mike Carlton, UC Berkeley Computer Science Home: carlton@ji.berkeley.edu or ...!ucbvax!ji!carlton
henry@utzoo.uucp (Henry Spencer) (03/25/89)
In article <63866@pyramid.pyramid.com> wendyt@pyrps5.pyramid.com (Wendy Thrash) writes: >If you sell even one system without floating-point hardware, some poor >programmer (or group) will spend the next twenty years (your product should >last so long) supporting floating-point operations on systems without the >floating-point hardware... The Sun 2/3 line is somewhat a worst case of this, with something like four different floating-point-hardware configurations. This is obviously an enormous headache for Sun and third-party software suppliers. Some of the third-party suppliers simply support the 68881 and nothing else, so their stuff won't run any faster with an FPA and won't run at all on a 3/50 without 68881. Sun hasn't got that escape. It shows, too: the specs for the SPARC (at least, the ones I saw, some time ago) say that there is *one* floating-point architecture, which the kernel must fake if the hardware isn't there. (Sun has sort of blown it on the hardware end for the Sun 4, I gather, but architecturally it makes sense.) -- Welcome to Mars! Your | Henry Spencer at U of Toronto Zoology passport and visa, comrade? | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
cik@l.cc.purdue.edu (Herman Rubin) (03/25/89)
In article <717@m3.mfci.UUCP>, rodman@mfci.UUCP (Paul Rodman) writes: > In article <22974@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: > > > >So, my question is: If you ASSUME that you have to have high speed arithmetic, > >what is the best way to partition functions between chips? I believe that the > >best way is Control, ALU/FPU, and instruction cache on one chip, and data > >cache/MMU on another chip. Why doesn't the market agree with me? > > > > Personally, I think the optimal partitioning for large f.p. problems would > be to split the f.p. unit and registers onto another chip. The amount of > comms required between the integer domain and floating domain is very > small and extra cycles to go from one to the other aren't a problem (speaking > from the our experience with partition the cpu in just this way). > > I haven't > thought about how to solve the problems in splitting integer data caches > and floating data caches, but I'm sure there would be an acceptable solution. > Assuming your compiler guys are up to it , :-) > > The main advantage here are: < < - You can get more pins for the f.p. chip for more loads/stores per < clock on the f-unit. Also you can get more than 16 d.p. registers < (which isn't enough, in our experience for two piped fu's). < < - The i-chip, which made no use of the funit hardware, has more area < for integer goodies, including a larger on-chip data cache for < integer data. I would rather have the MMU on this chip to make sure < that the memory pipeline for explicit loads is one cycle shorter, < i.e. save a chip crossing here. < < Now the guys that don't use floating point can just buy the i-chip, those < that want screaming f.p. perf buy both. < < I just don't see the point in doing hairy-chested cramming of f.p. hardware < on the same chip as the integer stuff, when the two functional units < are so nicely seperable, to the benefit of each. I can see the point of having separate address arithmetic and low-precision multiplication for address purposes. But restricting the term "integer arithmetic" to that is destructive of computing power. I am not arguing one way or the other on partitioning functions among chips. I suspect it is a good idea, but this is not the point. A floating point operation consists of separating the sxponents from the mantissas, differencing the exponents and shifting for addition and subtraction, performing the fixed point operation, and performing the necessary shifting and exponent calculation. The cost is greatest for multiplication and division, where the similarities between fixed and floating point are greatest. Indeed, many architectures with a floating point accelerator do integer multiplication in that unit. But suppose you want high precision arithmetic, integer, fixed point, or floating point? You now want a good integer arithmetic machine; if floating point arithmetic must be used, integer arithmetic must be emulated in it, which is quite clumsy. The computational equipment for high precision multiplication and division is largely the same for integer, fixed point, and floating point. For high-precision addition and subtraction, the overlap is still great. An architecture, language, or programmer not capable of taking advantage of this must be considered limited. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)
bcase@cup.portal.com (Brian bcase Case) (03/26/89)
>The general consensus seems to be that >the primary determinant is the physical implementation, but that the logical >architecture heavily influences to what extent RISCy implementation techniques >can be used. Right, except that I don't agree that the primiary determinant is the physical implementation. A RISC is an architecture that *permits* a *clean*, high-performance implementation. A CISC architecture might be able to use some high-performance implementation tricks, but cleanliness is next to RISCyness. The cleanliness becomes very important when superscalar implementations, and probably multiple-processor-per-chip implementations, are designed. >The other issue is the granularity of instruction semantics. The greater >the granularity, the more opportunities there are for code optimization. >It also increases the generality of the instruction set, making it more >likely that all instructions will be used by (and useful to) more applications. >The greater the difference in semantic level (primitiveness) between >the machine instructions and high-level language statements, the easier it is >to have high-quality code generation for a wide variety of languages. I've never thought to use such pro-active phrasing in explaining the advantage to software of simplicity, but I like it. This is an excellent way of saying it.
mash@mips.COM (John Mashey) (03/26/89)
In article <63866@pyramid.pyramid.com> wendyt@pyrps5.pyramid.com (Wendy Thrash) writes: >In article <717@m3.mfci.UUCP> rodman@mfci.UUCP (Paul Rodman) writes: >There's one large hidden cost here that people never seem to acknowledge: >If you sell even one system without floating-point hardware, some poor >programmer (or group) will spend the next twenty years (your product should >last so long) supporting floating-point operations on systems without the >floating-point hardware. >Hardware costs don't factor in the cost of phone calls from customers >complaining about slow (trapped and simulated) software floating point or >slow (-fswitch) hardware floating point or outmoded (someone wrote the >microcode years ago, and nobody understands it well enough to fix it now) >firmware floating point, or the costs of maintaining separate versions of >libraries, additional code in compilers, etc. (for compiler-generated >software floating point). >It's like the guy says on the commercial: You can pay me now (for extra >hardware) or pay me later (for extra support). If you sell enough systems >at the lower price to cover the hidden costs, then you've made a good >decision, but do remember that the costs are merely hidden, not nonexistent. These comments are well-taken, i.e., worry about the cost of the entire system and its support. Fortunately, this is much less of an issue than it used to be, simply because VLSI FPUs now have good performance, at low cost. It has never been really that much of an issue in the larger-systems arena, in that the FP was a small part of the enitre product. It used to be a serious issue in the small-systems arena, especially when: a) The integer unit was cheap. b) The VLSI FPU was either nonexistent [like for the 68010], or "slow" (i.e., relative to the integer unit's performance). c) A fast FPU was a whole logic board. In this case, the difference between b) and c) was a lot of $$$ as a fraction of the entire prodduct. Both Sun and MIPS came to the same conclusion, albeit from different directions, for their RISC products: generate exactly 1 form of FP code, and if no FPU is present, emulate it in the kernel. Given that FP chips are reasonably inexpensive, most people just buy them, but if somebody really wants to save money, where they're deploying a bunch of machines in a commercial application that doesn't use FP, they can. This leads to an interesting question: the 80387 and Weitek (1167? whatever it is that you use to replace a 387 at higher speed) are not directly binary-compatible? Can anybody out there give a comprehensive tutorial on: a) How 386 UNIX systems deal with this? b) What compilers support both flavors? c) What sorts of software packages handle both? Are there two separate versions? Or is there code that checks at run-time for the presence of the FPU? What is typically done when there is no FPU at all? -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
mash@mips.COM (John Mashey) (03/26/89)
In article <11402@pasteur.Berkeley.EDU> carlton@betelgeuse (Mike Carlton) writes: >I agree that the MIPS scheme is nice, but it does have its drawbacks. In >particular, they've fixed the cache control. If they got it right (for your >application) then no problem. Otherwise you're out of luck. It would be >possible to make some of the details configurable, but I believe that MIPS >doesn't allow this. Actually, R3000s allow a fair amount of flexibility. 1) I & D-caches can have different sizes. 2) The number of words refilled into the cache upon cache miss is settable from 1 to 32 words. 3) You can use instruction-streaming, or not. 4) You can cause partial-word writes to invalidate the corresponding cache word, or cause it to do a read-modify-write. >If I remember right (somebody borrowed my MIPS book so I can't verify), the >MIPS cache control requires a write-through cache. Personally, I don't want >a write-through cache. Another aspect is the write latency (i.e. how many >cycles does your cache take to handle a write-hit?). I think the MIPS >controller assumes a single cycle. This implies you've got to build a cache >to handle this, and this will get trickier when you can buy a 40 or 50MHz MIPS. The first-level cache is a write-thru cache. People often build 2nd-level caches to be write-back for multi-processors. It does expect single-cycle caches; it will get trickier. Of course, the higher clock rates, sooner or later, require everybody doing CMOS/BiCMOS micros to build integrated "superchips" anyway if they want to be competitive. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
henry@utzoo.uucp (Henry Spencer) (03/26/89)
In article <1188@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: >... An architecture, language, or programmer not capable of >taking advantage of this must be considered limited. All architectures, languages, and programmers are limited. The question is whether the limitations interfere with solving the problems you care about. Manufacturers, of necessity, care much more about the 5th-95th percentile requirements than about the outliers (unless the outliers are likely to buy lots of hardware). -- Welcome to Mars! Your | Henry Spencer at U of Toronto Zoology passport and visa, comrade? | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
rodman@mfci.UUCP (Paul Rodman) (03/26/89)
In article <22202@shemp.CS.UCLA.EDU> marc@cs.ucla.edu (Marc Tremblay) writes: >I also believe that putting the Integer unit and the FPU on the same >chip makes sense. These two units have to communicate quickly.... ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ I don't understand why. In most f.p. codes the integer unit generates addresses and program counter values, neither of which are needed by the f.p. unit. What the f.p. unit needs, on the other hand, is a decent bandwidth to cache and/or memory. Paul K. Rodman rodman@mfci.uucp __... ...__ _.. . _._ ._ .____ __.. ._
rodman@mfci.UUCP (Paul Rodman) (03/26/89)
In article <726@m3.mfci.UUCP> rodman@mfci.UUCP (Paul Rodman) writes: >In article <22202@shemp.CS.UCLA.EDU> marc@cs.ucla.edu (Marc Tremblay) writes: > >>I also believe that putting the Integer unit and the FPU on the same >>chip makes sense. These two units have to communicate quickly.... > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >I don't understand why. > >In most f.p. codes the integer unit generates >addresses and program counter values, neither of which are needed >by the f.p. unit. And before millions of people trash on me, let me clairify by saying that obviously an f.p. unit needs a source of control. Marc seems to value a low latency path between the control part of the machine and the f.p. unit. Given that the original request was for a partitioning that would lead to high performance, I think that some of this latency is best traded off for better bandwidth. For example, one might pipeline the icache address to the f.p. unit, causing instruction issue to be delayed by a beat for f.p. ops. This increases the f.p. latency with respect to the pc, but does not increase it with respect to other flops or memory loads/stores. pkr
alan@rnms1.paradyne.com (0000-Alan Lovejoy(0000)) (03/27/89)
In article <16231@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes: >>The general consensus seems to be that >>the primary determinant is the physical implementation, but that the logical >>architecture heavily influences to what extent RISCy implementation techniques >>can be used. >Right, except that I don't agree that the primiary determinant is the >physical implementation. A RISC is an architecture that *permits* a *clean*, >high-performance implementation. A CISC architecture might be able to use >some high-performance implementation tricks, but cleanliness is next to >RISCyness. The cleanliness becomes very important when superscalar >implementations, and probably multiple-processor-per-chip implementations, >are designed. Hmmm... Isn't there an important connection between the chip implementation technology (e.g., silicon transisitors, quantum circuits, photonics, vacuum tubes...) and the problem of designing an efficient implementation of a logical architecture? Logical architectures that cannot be efficiently implemented in NMOS might have no such problems in an optical computer. If it is the logical architecture that is the primary determinant, then whether something is a RISC depends on the current implementation technology. Or am I missing something? Alan Lovejoy; alan@pdn; 813-530-2211; AT&T Paradyne: 8550 Ulmerton, Largo, FL. Disclaimer: I do not speak for AT&T Paradyne. They do not speak for me. __American Investment Deficiency Syndrome => No resistance to foreign invasion. Motto: If nanomachines will be able to reconstruct you, YOU AREN'T DEAD YET.
stevew@wyse.wyse.com (Steve Wilson xttemp dept303) (03/28/89)
In article <10078@bloom-beacon.MIT.EDU> tada@athena.mit.edu (Michael Zehr) writes: > >which would you rather have -- one CPU that runs at 50 MIPS with 72, >1Mbit memory chips (8 Megabytes * 9 chips per byte) or 72, 10 MIPS >processors and 4 Megabytes of memory split among them? > >-michael j zehr One CPU that runs at 50 mips with 72 1Mbit memory chips. I already know how to program a single CPU ;-) Steve Wilson The above is my opinion, not those of my employer.
bcase@cup.portal.com (Brian bcase Case) (03/29/89)
>Hmmm... Isn't there an important connection between the chip implementation >technology (e.g., silicon transisitors, quantum circuits, photonics, vacuum >tubes...) and the problem of designing an efficient implementation of a >logical architecture? Logical architectures that cannot be efficiently >implemented in NMOS might have no such problems in an optical computer. >If it is the logical architecture that is the primary determinant, then >whether something is a RISC depends on the current implementation technology. > >Or am I missing something? Uh, I don't know, you lost me somewhere here. All I want to say is that RISC can be defined by a set of *architectural* features. To be sure, that set was constructed with the implementation implications in mind. If a new set of implementation technologies, or techniques, comes along, then we'll have to define a new, er, "post-RISC", or whatever, set of *architectural* guidelines that leads to a set of architectures that is consistent with the new implementation technology. If optical technology calls for instructions that have, oh, I don't know, 14 source operands, then we need to change things! However, note that technology differences do not change the basic state-machine model of computation. Until we do change the basic model (to what? I don't know! I'm trying to think of it actually...), instructions can't change a whole lot, at least not qualitatively.
jesup@cbmvax.UUCP (Randell Jesup) (03/29/89)
In article <15695@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >>> In article <37196@bbn.COM>, slackey@bbn.com (Stan Lackey) writes: >>> > RISC is indeed a technology window, driven largely by the amount of >>> > stuff you can fit in a chip... >.... >Seriously, I doubt that anyone has silicon to burn. In particular, >the faster the chips get, the more it hurts you to go off chip. >Bigger on-chip caches [I, D, or TLB] keep you on-chip more, and are >therefore good. With more hardware, you can make integer multiplies go >faster, make FP go faster, and maybe put in some multiple FP units, >a la CDC 6600s [and these things chew up area]. Note that Intel, with >a million transistors, said the space budget didn't leave room for >an IEEE divide..... (Compcon paper). Quite true. If CPUs continue to get faster (at the process level - smaller in design rules) The relative overhead for off-chip access will increase. I think this will cause one of two things to happen, or maybe a compromise between them: 1) Bigger caches, or more sophisticated caches; 2) More complex (relatively) instructions, either addressing modes or things like multiple ALUs for address calculation or ..., in an attempt to reduce the number of off-chip fetches. I think the i860 is a step down this path. > c) If you double the size of a giant-monster-chip, its yield > might get a lot worse... P(good 2-cpu chip) = P(good 1-cpu chip) ** 2 or something close to that. -- Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup
jesup@cbmvax.UUCP (Randell Jesup) (03/29/89)
In article <22974@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: >Things are getting to a very interesting stage, however. I guess it is just >my big machine history showing again, but I keep wondering why, with a decent >sized register file, it wouldn't make more sense to put all FP/ALU hardware on >the same chip as the control unit, along with the instruction cache, >and leave the MMU, and the data cache (which is almost always going to be >larger than what you can put >on a chip no matter how large the chips get), to an external implementation. The problem with this is that off-chip access is slower than on-chip, due to signal load and pad static-protection capacitance. If someone can do away with some or most of the pad capacitance, then off-chip caches become the way to go. As it is, it's a balancing act, and things like branch-target internal caches become very useful, with external caches to supply straight instruction streams and data (or internal I & D caches, if you can fit signifigant ones on-chip, or it's being designed for a small-chip-count application). -- Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup
jesup@cbmvax.UUCP (Randell Jesup) (03/29/89)
In article <13404@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes: > There will continue to be a demand for processors with a very high >instruction rate (call them RISC if you will), and also for processors >which will perform a given task faster with limited memory bandwidth. Quite true. Not everyone designs workstations bought solely by larger corporations. There is _signifigant_ demand for CPUs that are a) fairly cheap - say current '030/881 (16MHz) pricing, and b) do the most _work_ given a specific memory bandwidth, determined by the speed of jelly-bean memory parts - say 100-120ns currently, and without expensive external caches. You could sell a lot of CPUs like that. -- Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup
jesup@cbmvax.UUCP (Randell Jesup) (03/29/89)
In article <10078@bloom-beacon.MIT.EDU> tada@athena.mit.edu (Michael Zehr) writes: >Professor Daly (sp?) of MIT has been saying something along those lines >for a couple years. instead of having a whole bunch of memory chips >with one path to a fast CPU and have a cache to prevent slow accesses, >take each of those memory chips, halve the amoune of memory on them, and >put a CPU on it. you no longer have a memory banchwith problem, because >the memory and CPU are on the same (small, easy-to-make) chip. instead >of putting a cache on the chip (you don't need one), put a >communications circuit to transfer data to the other chips. Sounds a lot like a transputer... how is it different? (other than being implemented in a more modern process) -- Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup
keith@mips.COM (Keith Garrett) (03/30/89)
In article <6416@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes: > > The problem with this is that off-chip access is slower than on-chip, >due to signal load and pad static-protection capacitance. there are also serious speed-of-light considerations. chip to chip path lengths are considerably longer than on-chip paths. this effect will become worse in the future as transistor speeds become faster, and transistor spacings become less. connect technologies that don't support terminated transmission lines (can you say TTL??) have worse problems due to long bus settling times. -- Keith Garrett "This is *MY* opinion, OBVIOUSLY" UUCP: keith@mips.com or {ames,decwrl,prls}!mips!keith USPS: Mips Computer Systems,930 Arques Ave,Sunnyvale,Ca. 94086
frazier@oahu.cs.ucla.edu (Greg Frazier) (03/30/89)
In article <6418@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes: >In article <10078@bloom-beacon.MIT.EDU> tada@athena.mit.edu (Michael Zehr) writes: >>Professor Daly (sp?) of MIT has been saying something along those lines >>for a couple years. instead of having a whole bunch of memory chips >>with one path to a fast CPU and have a cache to prevent slow accesses, >>take each of those memory chips, halve the amoune of memory on them, and >>put a CPU on it. you no longer have a memory banchwith problem, because > Sounds a lot like a transputer... how is it different? (other than >being implemented in a more modern process) > There are several dramatic differences. The Message Driven Processor (MDP) directly implements `actors' (I think that's the term). The chip actually contains two processors - one to handle incoming and outgoing msgs, the other to execute code. Arriving msgs directly point to code in memory to be executed. The machine as a whole supports a global address space, which makes this possible. Also, the memory can be used as context-addressable, so that arriving msgs can refer to code symbolically. The idea is that this directly supports object-oriented programming, where each object resides on a node. Each node is expected to send and receive msgs on the order of every 20 inst'ns. Msg routing and forwarding are handled by a Torus Routing Chip-style-router which also resides on the chip. The most obvious problem with this approach is that only 16k memory can accompany a 4 MIP processor (the general rule of thumb is 1M of memory/1 MIP of processor). Dally, et. al. claim that this rule of thumb does not apply because of the global nature of the memory, that with 64k nodes they have and address space of 2^30 bytes. Of course, they also have 256k MIPS of processor, but they didn't memtion that... So, all in all, this is a VERY different beast from the Transputer. Greg Frazier ****************************^^^^^^^^^^^^^^^^^^^^!!!!!!!!!!!!!!!!!!! Greg Frazier o Internet: frazier@CS.UCLA.EDU CS dept., UCLA /\ UUCP: ...!{ucbvax,rutgers}!ucla-cs!frazier ----^/---- /
bb@tetons.UUCP (Bob Blau) (03/31/89)
In article <16181@gumby.mips.COM>, keith@mips.COM (Keith Garrett) writes: : In article <6416@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes: : > : > The problem with this is that off-chip access is slower than on-chip, : >due to signal load and pad static-protection capacitance. : : there are also serious speed-of-light considerations. ... : ... this effect will become worse in : the future as transistor speeds become faster ... This may be true in the far future when everything is optically interconnected (on and off chip.) Today, at very high speeds, on-chip delays rise up to bite you too! Cray can achieve very fast cycle times with low levels of integration and clever packaging - putting everything on-chip is not a panacea. Apply: Std. disclaimer Remember - it is possible for a particle to go faster than the speed of light, just not in a vacuum! -- Bob Blau Amdahl Corporation 143 N. 2 E., Rexburg, Idaho 83440 UUCP:{ames,decwrl,sun,uunet}!amdahl!tetons!bb (208) 356-8915 INTERNET: bb@tetons.idaho.amdahl.com
levy@nsc.nsc.com (Jonathan Levy) (04/01/89)
In article <6417@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes: >In article <13404@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes: >> There will continue to be a demand for processors with a very high >>instruction rate (call them RISC if you will), and also for processors >>which will perform a given task faster with limited memory bandwidth. > > Quite true. Not everyone designs workstations bought solely by >larger corporations. There is _signifigant_ demand for CPUs that are >a) fairly cheap - say current '030/881 (16MHz) pricing, and b) do the >most _work_ given a specific memory bandwidth, determined by the speed of >jelly-bean memory parts - say 100-120ns currently, and without expensive >external caches. > > You could sell a lot of CPUs like that. Funny you should mention that. National Semiconductor has just announced a new processor for the Embedded market. This is the NS32GX32. It has the NS32532 pipeline core and the same performance (18,335 d/s, 8 to 10 vax mips average at 30 MHZ). The processor is *cheap* both in actual chip cost (for pricing you will need to contact marketing...), and what is more important, in system cost. System cost is minimized by providing the system a very simple interface (non multiplexed address and data bus, one of each), a harvard architecture which is *ON* chip (why burden the designer with our problems), two on-chip caches (1K data cache, 0.5K instruction cache), code density which is second to none, dynamic bus sizing for peripherals and boot ROMs, long address to ready timing, long address to data timing, etc. All this provides for a CPU which is very insensitive to wait states. We have a system which was designed with a single bank of DRAM (no interleave) which provides the CPU with 2 wait states on the first access, and one wait state on burst accesses. The performance of this system is more than 85% of the maximum! One cannot do that with a generic RISC. Jonathan
mac@uvacs.cs.Virginia.EDU (Alex Colvin) (04/06/89)
> In most f.p. codes the integer unit generates > addresses and program counter values, neither of which are needed > by the f.p. unit. What the f.p. unit needs, on the other hand, is > a decent bandwidth to cache and/or memory. Ah, but the FP has to have addresses to use on its bandwidth to memory. In particular, subscripting and such address arithmetic come from the integer unit. Wasn't there a simulation done many years ago of one of the high-end 360s (the one with the multiple FP stations) which showed that the bottleneck was the integer unit's generation of addresses? Of course, the 360 required more address arithmetic than many newer machines, and FORTRAN programs have to do all their address arithmetic as subscripting, in contrast to newer languages. I wonder if anyone has done a study recently.
rodman@mfci.UUCP (Paul Rodman) (04/07/89)
In article <3070@uvacs.cs.Virginia.EDU> mac@uvacs.cs.Virginia.EDU (Alex Colvin) writes: >> In most f.p. codes the integer unit generates >> addresses and program counter values, neither of which are needed >> by the f.p. unit. What the f.p. unit needs, on the other hand, is >> a decent bandwidth to cache and/or memory. > >Ah, but the FP has to have addresses to use on its bandwidth to memory. In >particular, subscripting and such address arithmetic come from the integer >unit. > Oh really? What would I want to do with those addresses in the f.p. unit? Convert them to floating point? The addresses go to memory, on behalf of the f.p. unit, they don't go *to* the f.p. unit (!?%$*) No connection between the two is required for this purpose! >Wasn't there a simulation done many years ago of one of the high-end 360s >(the one with the multiple FP stations) which showed that the bottleneck >was the integer unit's generation of addresses? > Probably. And once you have that, you had better have a memory system that will accept said addresses. I know that on the Trace 7 we have the ability to do 4 integer ops for every 2 flops. 2 of the integer ops can be load/stores with a base+scaled offset type of add. The other two integer ops get used for loop exit tests (you do a lot for an unrolled loop), whacking induction varables, random index arithmetic, random integer arith. , and anything else you need to do without burping the memory loads/store address being generated on the other ialu. It was obvious that this ialu bandwidth was required from very early simulation results. >Of course, the 360 required more address arithmetic than many newer >machines, and FORTRAN programs have to do all their address arithmetic as >subscripting, in contrast to newer languages. I wonder if anyone has done >a study recently. I'm curious why you think that Fortran would require more complicated addressing modes than newer languages? Bye, Paul K. Rodman rodman@mfci.uucp __... ...__ _.. . _._ ._ .____ __.. ._