hascall@atanasoff.cs.iastate.edu (John Hascall) (01/31/89)
In response to someone proposing adding instructions to a RISC machine, I wrote: >> Here it is again, adding instructions to a RISC machine... won't >> be long before we have a RISC machine with more instructions >> than a VAX.... :-) And was "corrected" by someone* thusly: > And again... > Sigh, RISC doesn't mean a small number of instructions. RISC means.... REDUCED Instruction Set Computer (i.e., a reduced number of instructions) True, many RISC machine incorporate a number of other feature, which because they have been used by a number of RISC machines, have come to considered a part of RISC--but there is no reason that these feature could not be part of a CISC machine (other than chip real-estate). I think the real problem here is a poorly named acronym, but it probably sounded "cute" (I, for one, am quite tired of papers titled "A RISCy blah blah blah" etc). Perhaps we could have a new buzzword contest, how about SOC (simple, orthogonal computer)? My $.02 (or less) worth, John Hascall ISU Comp Center * My apologies for losing the attribution above, but rn barfed on the overly long "References:" field and I had to do this by hand.
diamond@csl.sony.JUNET (Norman Diamond) (01/31/89)
In article <747@atanasoff.cs.iastate.edu>, hascall@atanasoff.cs.iastate.edu (John Hascall) writes: > > Sigh, RISC doesn't mean a small number of instructions. RISC means.... > REDUCED Instruction Set Computer (i.e., a reduced number of instructions) Maybe reduced number of KINDS of instructions. If you have an add instruction for a byte and another for a word ... if you have an add instruction for signed and another for unsigned ... do you think these are ciscy? Having an add instruction for a little-endian word and another for a big-endian word strikes me as a little silly (maybe a big silly :-), but still riscy. Incidentally, wouldn't little-beginnian and big-beginnian be more accurate? -- Norman Diamond, Sony Computer Science Lab (diamond%csl.sony.jp@relay.cs.net) The above opinions are my own. | Why are programmers criticized for If they're also your opinions, | re-inventing the wheel, when car you're infringing my copyright. | manufacturers are praised for it?
colwell@mfci.UUCP (Robert Colwell) (01/31/89)
In article <747@atanasoff.cs.iastate.edu> hascall@atanasoff.cs.iastate.edu (John Hascall) writes: > > In response to someone proposing adding instructions to a RISC > machine, I wrote: > >>> Here it is again, adding instructions to a RISC machine... won't >>> be long before we have a RISC machine with more instructions >>> than a VAX.... :-) > > And was "corrected" by someone* thusly: > >> And again... >> Sigh, RISC doesn't mean a small number of instructions. RISC means.... > > REDUCED Instruction Set Computer (i.e., a reduced number of instructions) No, that's what the *acronym* stands for. That acronym was not invented by the folks who started the RISC effort (IBM), it came from Berkeley. Which doesn't by itself make it invalid (no, I really mean that :-)) but it does cast doubt on appeals to authority in this case. And early on, it fit the concept a lot better than it does now. I gave a talk at MIT a couple of months ago arguing, I hope successfully, that our VLIW embodies almost all of the major RISC concepts. The one thing it does not have is "few" instructions, 2**1024 is a big number. But, as Brian said, the intrinsic complexity of an instruction isn't the source of evil, it's the cost of implementing it vs. the performance cost of not having it vs. the compiler cost of each. We implement that large number of instructions by having many functional units, each with its own fully decoded instruction residing in its own ICache slice, so what you may have thought would be the large runtime cost of decoding an instruction word that big is actually negligible. Besides, "reduced" can be taken to mean "reduced in number" or "reduced in complexity". I think it's high time to leave the obsolete notion of numbers of instructions as a measure of "RISCness" and move to the more appropriate "function/implementation level". If you move a lot of functionality into runtime hardware you are following in the tradition of CISC machines. If you are maximizing performance by moving as much as you can into compile time, thus minimizing the machine's cycle time, you are probably building a RISC. The days of counting instructions and drawing some legitimate conclusion are long gone. Bob Colwell ..!uunet!mfci!colwell Multiflow Computer or colwell@multiflow.com 175 N. Main St. Branford, CT 06405 203-488-6090
rodman@mfci.UUCP (Paul Rodman) (02/01/89)
In article <747@atanasoff.cs.iastate.edu> hascall@atanasoff.cs.iastate.edu (John Hascall) writes: > > In response to someone proposing adding instructions to a RISC > machine, I wrote: > >>> Here it is again, adding instructions to a RISC machine... won't >>> be long before we have a RISC machine with more instructions >>> than a VAX.... :-) > > And was "corrected" by someone* thusly: > >> And again... >> Sigh, RISC doesn't mean a small number of instructions. RISC means.... > > REDUCED Instruction Set Computer (i.e., a reduced number of instructions) At the risk of starting more pointless RISC/CISC flameage, let me add my 2 cents worth here: (I know many of you out there won't agree....:-) The term RISC has been terribly misused, but my personal definition has be widened to include machines that don't have a "small" number of instructions. E.g. the Multiflow Trace,( which I am using to compose this mail) has a VERY large space of possible instructions. I would still term this machine RISCy as each functional unit is controlled directly by the instruction word, and is decoupled from instruction packets that are wired to other functional units. Hence the original purpose of the RISC idea is served. Conventional RISCs are designed to approach 1 "op" per cycle. We designed a multiple-functional unit machine that executes >1 ops / cycle. The VLIW compiler is considerably "smarter" than a typical RISC compiler, and the compiler <-> hardware fusion is even more important than for a simple RISC, but the basic mind set is still the same. Paul K. Rodman rodman@mfci.uucp
aglew@mcdurb.Urbana.Gould.COM (02/01/89)
> I think the real problem here is a poorly named acronym, but it > probably sounded "cute" (I, for one, am quite tired of papers > titled "A RISCy blah blah blah" etc). > > Perhaps we could have a new buzzword contest, how about SOC (simple, > orthogonal computer)? > > ISU Comp Center RAMM = Reduced Addressing Mode Machine SEISM = Small, Efficient, Instruction Set Machine
alan@pdn.nm.paradyne.com (Alan Lovejoy) (02/02/89)
In article <10030@diamond.csl.sony.JUNET> diamond@csl.sony.JUNET (Norman Diamond) writes: >In article <747@atanasoff.cs.iastate.edu>, hascall@atanasoff.cs.iastate.edu (John Hascall) writes: >> > Sigh, RISC doesn't mean a small number of instructions. RISC means.... >Maybe reduced number of KINDS of instructions. > >Having an add instruction for a little-endian word and another for a >big-endian word strikes me as a little silly (maybe a big silly :-), >but still riscy. RISC should stand for "Reduced Instruction Set Complexity". This means minimizing the number of different instruction formats, having only one instruction size (e.g., 32 bits), eliminating instructions that can't be performed in once cycle with a reasonable amount of hardware or without making the cycle-time considerably longer that is needed for most instructions, and a host of other things less well characterized by the nominal semantics of R.I.S.C. (caches, pipelines, parallelism...). Whether an architecture is little-endian or big-endian depends upon how it maps byte adressess (well, on a byte-addressed machine, anyway) of the bytes in a word or longword to the arithmetic significance of those bytes. Once a word or longword has been fetched from memory into a register, the bytes of that word or longword no longer have addresses, so the endianness becomes undefined. Since traditional RISC architectures only allow load and store operations to reference memory, all other operations must be carried out on values in registers. So there would be no need to have big-endian and little-endian versions of "ADD" or any other operation besides LOAD and STORE. -- Alan Lovejoy; alan@pdn; 813-530-2211; ATT-Paradyne: 8550 Ulmerton, Largo, FL. Disclaimer: I do not speak for ATT-Paradyne. They do not speak for me. ___________ This Month's Slogan: Reach out and BUY someone (tm). ___________ Motto: If nanomachines will be able to reconstruct you, YOU AREN'T DEAD YET.
pauls@Apple.COM (Paul Sweazey) (02/03/89)
How about a bit in the page tables that indicates the endian-ness of data accesses through that map? Seems like the right place to me (and I didn't think it up either). Paul Sweazey pauls@apple.com
keithe@tekgvs.LABS.TEK.COM (Keith Ericson) (02/04/89)
> > Sigh, RISC doesn't mean a small number of instructions. RISC means.. > Seems to me that the "reduced" is a totally incorrect moiniker (sp?): the truly salient point is that all the instructions are equal length, to reduce problems maintaining the instruction pipeline. >Incidentally, wouldn't little-beginnian and big-beginnian be more >accurate? Yeah, you're right. Guess it depends on which eye you look out of. :-) kEITH
rpw3@amdcad.AMD.COM (Rob Warnock) (02/07/89)
+--------------- | > Incidentally, wouldn't little-beginnian and big-beginnian be more accurate? | Yeah, you're right. Guess it depends on which eye you look out of. :-) +--------------- Well, the names "Big-Endian" and "Little-Endian" were borrowed by Danny Cohen (in his classic paper "On Holy Wars And A Plea For Peace") from peoples of the same names in one of Jonathan Swift's stories. Seems there was this "holy war" between folks who like to eat their soft-boiled eggs from the "big end" and those who preferred to eat from the "little end"... So "accuracy" here should probably yield to classic usage, eh? ;-} Rob Warnock Systems Architecture Consultant UUCP: {amdcad,fortune,sun}!redwood!rpw3 ATTmail: !rpw3 DDD: (415)572-2607 USPS: 627 26th Ave, San Mateo, CA 94403
jk3k+@andrew.cmu.edu (Joe Keane) (02/08/89)
Keith Ericson writes: > Seems to me that the "reduced" is a totally incorrect moiniker (sp?): the > truly salient point is that all the instructions are equal length, to reduce > problems maintaining the instruction pipeline. As much as i dislike VAX instruction encoding, i can't agree with this. Single-size instructions are nice, but you'll pay a price in code density. The RT has two instruction sizes, and i think it was the right choice.
dennis@gpu.utcs.toronto.edu (Dennis Ferguson) (02/08/89)
In article <cXvqKRy00Wo=0TV282@andrew.cmu.edu> jk3k+@andrew.cmu.edu (Joe Keane) writes: >Single-size instructions are nice, but you'll pay a price in code density. The >RT has two instruction sizes, and i think it was the right choice. Except that, because they had to encode both the OP code and a couple of registers into 16 bit instructions, the RT ended up with only 16 registers. There just isn't enough room for more registers if you have to accomodate the entire instruction set in 16 bits. Personally, I think that was the wrong choice. I'd think I'd rather have longer instructions and more registers, thanks. Dennis Ferguson University of Toronto
firth@sei.cmu.edu (Robert Firth) (02/08/89)
Keith Ericson writes: > Seems to me that the "reduced" is a totally incorrect moiniker (sp?): the > truly salient point is that all the instructions are equal length, to reduce > problems maintaining the instruction pipeline. In article <cXvqKRy00Wo=0TV282@andrew.cmu.edu> jk3k+@andrew.cmu.edu (Joe Keane) writes: >As much as i dislike VAX instruction encoding, i can't agree with this. >Single-size instructions are nice, but you'll pay a price in code density. The >RT has two instruction sizes, and i think it was the right choice. This issue has been argued quite vigorously in the DoD RISC program, and I'd like to offer an unobjective opinion. It seems pretty clear that instruction density can be improved by having more than one length. The GE design used two lengths (16 and 32) and claimed a 15% improvement in instruction density as a result. This seems reasonable to me. However, set against that the following . more complicated instruction decode . more complicated pipeline management , more complicated Icache design . loss of one bit in span of relative branch or call If the goal is pure speed, the question I ask is: are you better off with 15% more bytes of instructions and a bigger Icache? If you can raise the hit rate from 93% to 94% you have offset the difference in instruction size (you fetch 6% rather than 7% for a reduction of 14% in instructions fetched). Moreover, very little new logic is involved, just more of the same. My view (for what it's worth) is that with CURRENT technology it is better to have all instructions the same length. However, you do need a big Icache (as I think the evolution of the Mips Inc machines demonstrates).
rodman@mfci.UUCP (Paul Rodman) (02/09/89)
In article <cXvqKRy00Wo=0TV282@andrew.cmu.edu> jk3k+@andrew.cmu.edu (Joe Keane) writes: >Keith Ericson writes: >> Seems to me that the "reduced" is a totally incorrect moiniker (sp?): the >> truly salient point is that all the instructions are equal length, to reduce >> problems maintaining the instruction pipeline. > >As much as i dislike VAX instruction encoding, i can't agree with this. >Single-size instructions are nice, but you'll pay a price in code density. The >RT has two instruction sizes, and i think it was the right choice. Well, maybe it was the right choice. Code density is one of the LEAST important aspects of instruction set design if performance is one of your goals. When I was a grad student at CMU in 1978 we actually studied several existing architectures with an ISP simulator. One of the metrics used to "rate" the architectures was the static code size. A better one was the "dynamic" code size which would correspond to instruction cache fetches. None of these metrics really tried to quantify the things that make pipelining an instruction set difficult. Trading static code size for speed in execution is a tradeoff that most folks would love. Who cares what the size of the text is when the data is 100 Mb, anyway? The important thing is getting control bits to the functional units. Some machines try to get the best of both worlds. The Multiflow Trace, for example, uses a very long instruction word to potentially execute many operations per cycle. However, we don't want to drag around 1024 bits for EVERY instruction word , or static code size would be a problem. So we have a "mask" word in memory with the instruction "packets" that tells us which packets were unused and these packets are loaded to a zero (nop) at cmiss time. By processing the mask word at cmiss time we take something that would normally be a first order effect (i.e. cycle time) and move it to second order (cmiss time). You could even do tricks like this at page fault time (3rd order). I belive a company called Computer Consoles built a vax clone that translated the instructions at cmiss time. Personally, I think the vax has about the worst possible archtecture one could come up with (assuming you aren't PURPOSEFULLY trying to create a bad machine :-). It was great in the days when control store proms were 50ns and main memory was 1us. Today, cache rams are the same speed as the control store rams (and the control store better be rams because the microcode is so complex that you have to be able to fix it). Byte aligned instructions make the hardware more difficult and buy you nothing. So many instructions that serve no purpose, etc, etc. Blech, what a mess. And so many minds worked *hard* to create it! Ha! Paul Rodman rodman@mfci.uucp Who says engineering is a science? Engineering is an Art, and don't ever forget it!
bla@hpcupt1.HP.COM (Brad Ahlf) (02/09/89)
>> Sigh, RISC doesn't mean a small number of instructions. RISC means..
Instead of RISC, how about RCC:
Reduced Complexity Computers
I have seen this discussed and coined in some of the analyst papers like
Gantner Group's (I think) discussion of current 'RISC' computers. I like
this term and think it is both easier to grasp and a better expression of
the current successful implementations.
jk3k+@andrew.cmu.edu (Joe Keane) (02/09/89)
Dennis Ferguson writes: > Except that, because they had to encode both the OP code and a couple of > registers into 16 bit instructions, the RT ended up with only 16 registers. > There just isn't enough room for more registers if you have to accomodate the > entire instruction set in 16 bits. I agree register sets are limited by instruction coding, but i don't think that different-sized instructions is part of the problem. Most of the RT's 32-bit instructions consist of an 8-bit opcode, two 4-bit register specifiers, and a 16-bit immediate field. You can't add more registers even if you ignore 16-bit instructions. The PDP-11 has single-size instructions (plus immediate words, does this count?) but only 3-bit register specifiers. Can't fix this either. The R2000 has big single-size instructions, but still only a 5-bit register specifier. 32 registers is nothing to get excited about. What can we do about this? More on this later...
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (02/10/89)
In article <6310013@hpcupt1.HP.COM> bla@hpcupt1.HP.COM (Brad Ahlf) writes: >>> Sigh, RISC doesn't mean a small number of instructions. RISC means.. > >Instead of RISC, how about RCC: > >Reduced Complexity Computers **************************************** How about FIFC? Fixed Instruction Format Computer "To avoid these [described previously] problems two criteria were used in the design of the VAX-11 instruction format: 1) all instructions should have the "natural" number of operands, and (2) all operands should have the same generality in specification. These criteria led to a highly variable instruction format. ..." - W. D. Strecker, "VAX 11/780" (reprinted in Sieworiek, Bell, and Newell). Now, as history has shown, it was exactly that highly variable format, with a variable number of operands, and addressing mode encoding requiring separate decoding, which has made such instruction sets hard to pipeline, and thus harder to build high performance machines with. (And thus the cause of more problems than those solved.) The secret to success for all the RISC machines that I know of, from the CDC 6600 on, has been the choice of simple, fixed, easy to decode instruction formats. As has been pointed out repeatedly on this group, the NUMBER of instructions is really a second order determinant of performance. (On the other hand, including a lot of special hardware for functions which are never exercised certainly is a first order determinant.) So, while we are proposing, I propose turning Strecker's phrase around: the opposite of "highly variable" is "fixed", and it also happens to be one of the few common denominators of all "RISC" machines. ************************************************* As an aside, I note, as previously, that the VAX was very successful at meeting another mid-70's problem however: keeping object code size small. I notice that, recently, some results from the GNU compiler on some RISC's has been a challenge to the object code size of VAX code. Obviously, GNU is doing something interesting there. Does anyone know why the code from the GNU C compiler appears to be significantly more compact than other compilers on the RISC machines? -- Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
lexw@idca.tds.PHILIPS.nl (A.H.L. Wassenberg) (02/10/89)
In article <21606@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: > > How about FIFC? > > Fixed Instruction Format Computer > It feemf to me af if you had a fpeech-defect! :-) __ / ) Lex Wassenberg / Philips Telecommunication & Data Systems B.V. / _ Apeldoorn, The Netherlands __/ /_\ \/ Internet: lexw@idca.tds.philips.nl (_/\___\___/\ UUCP: ..!mcvax!philapd!lexw
jesup@cbmvax.UUCP (Randell Jesup) (02/14/89)
In article <8476@aw.sei.cmu.edu> firth@bd.sei.cmu.edu (Robert Firth) writes: >In article <cXvqKRy00Wo=0TV282@andrew.cmu.edu> jk3k+@andrew.cmu.edu (Joe Keane) writes: >>As much as i dislike VAX instruction encoding, i can't agree with this. >>Single-size instructions are nice, but you'll pay a price in code density. The >>RT has two instruction sizes, and i think it was the right choice. > >This issue has been argued quite vigorously in the DoD RISC program, >and I'd like to offer an unobjective opinion. > >It seems pretty clear that instruction density can be improved by having >more than one length. The GE design used two lengths (16 and 32) and >claimed a 15% improvement in instruction density as a result. This seems >reasonable to me. Not quite. The GE rpm-40 had one instruction size: 16 bits. One of the instructions was 'prefix (PFX)', which supplied 12 bits of immediate for use in the next instruction (most instructions using immediates could use 4 bits of immediate directly, except things like branch, which used 12). You could put multiple PFX instructions before a regular one to build up a 32- bit immediate. Most constants (I think the number was 90%) fall in 4 bits, and close to 99% fall in 16 (1 PFX instruction)). The other disadvantage of 16- bit instructions is two-operand instructions vs 3-operand (though many times three operands aren't needed). The advantage of 16-bit instructions is memory bandwidth. >However, set against that the following > >. more complicated instruction decode Very slightly, since PFX is an instruction, it just routes the result to the immediate value register. The most complex part of this (not very) is shifting the value in the IVR over when a PFX is executed (easy because it's a fixed shift). >. more complicated pipeline management All the reorganizer has to do is keep PFXs with their associated instructions. It does add a small amount of complexity, but not much. >, more complicated Icache design One this you're wrong, since PFX is just another instruction. >. loss of one bit in span of relative branch or call Also wrong, since a branch doesn't need to indicate whether a following word is part of the instruction: it just takes the IVR and masks in the rest of the immediate from the branch. For one pfx, you are limited to 24 bits relative addressing. That's usually enough. If it isn't, use two PFXs for 32 bits relative. >If the goal is pure speed, the question I ask is: are you better off >with 15% more bytes of instructions and a bigger Icache? If you can >raise the hit rate from 93% to 94% you have offset the difference in >instruction size (you fetch 6% rather than 7% for a reduction of 14% >in instructions fetched). Moreover, very little new logic is involved, >just more of the same. I say there is no difference in Icache complexity due to 16-bit instructions. Therefor, you should get twice as many instructions into it, and if the 15% figure is true, then you should get effectively 15% more done with what's in the icache. (This assumes a loop the size of the icache. For other conditions, it may change the hit-rate instead.) >My view (for what it's worth) is that with CURRENT technology it is >better to have all instructions the same length. However, you do >need a big Icache (as I think the evolution of the Mips Inc machines >demonstrates). Icache is what makes the world go around. The next stepping stone: making dcaches more efficient (their current hit-rates are lousy unless they're ridiculously large). This may require yet more integration of back-end software and silicon, or even front-end software. -- Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup
rodman@mfci.UUCP (Paul Rodman) (02/15/89)
In article <5964@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes: > > Very slightly, since PFX is an instruction, it just routes the result >to the immediate value register. The most complex part of this (not very) >is shifting the value in the IVR over when a PFX is executed (easy because >it's a fixed shift). > You mean it takes me *cycles* to build a >4 bit constant??? *Gasp,choke.* I guess thats fine for some machines, but if you're reading 4 x 64 bit words from a large array every beat from a common block, you may need lots of constants without such a penalty. I'd much rather have the instruction bits to do *exactly* what I (i.e. the compiler) want to do, than have a program that is statically smaller. Cache, main memory, and disk are cheap and getting cheaper. Anytime I can trade them for speed its a win. > > I say there is no difference in Icache complexity due to 16-bit >instructions. There would be on a machine that was trying to do more than one lousy operation per cycle. Therefor, you should get twice as many instructions into >it, and if the 15% figure is true, then you should get effectively 15% >more done with what's in the icache. (This assumes a loop the size of the >icache. For other conditions, it may change the hit-rate instead.) Icaches can be made huge. Who cares? > > Icache is what makes the world go around. The next stepping stone: >making dcaches more efficient (their current hit-rates are lousy unless they're >ridiculously large). Why not have a *real* pipelined memory system and a compiler than can handle more than one miserable outstanding load in flight? Or have both. Paul Rodman rodman@mfci.uucp
aglew@mcdurb.Urbana.Gould.COM (02/16/89)
>Why not have a *real* pipelined memory system and a compiler than can >handle more than one miserable outstanding load in flight? Or have both. > > > Paul Rodman Denelcore's HEP. We discussed barrell processors (processors that execute a different process at every pipe stage, hence no dependencies) a while back; someone (I think it was John Mashey) posted a rather good argument against them. Does anyone have that around? (BTW, is comp.arch archived anywhere?)
w-colinp@microsoft.UUCP (Colin Plumb) (02/16/89)
rodman@mfci.UUCP (Paul Rodman) wrote: > In article <5964@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes: >> Very slightly, since PFX is an instruction, it just routes the result >> to the immediate value register. The most complex part of this (not very) >> is shifting the value in the IVR over when a PFX is executed (easy because >> it's a fixed shift). > > You mean it takes me *cycles* to build a >4 bit constant??? *Gasp,choke.* > I guess thats fine for some machines, but if you're reading 4 x 64 bit words > from a large array every beat from a common block, you may need lots of > constants without such a penalty. You's *hate* the transputer. It takes one prefix instruction per 4 bits. Currently, that's one cycle per 4 bits (past the first). There are some noises about speeding up this particular part of the decode process. > Icaches can be made huge. Who cares? Anyone who wants 15% more icache than the competitor. Assume we both have the same icache technology. If my instructions are denser, I get more of my inner loop in the icache, and run faster. If the denser instructions don't cost me in decode as much as they get me in hit rate, I win. Although, as you point out, icache capacity is a rapidly moving target. Still, it can be a valid tradeoff. -- -Colin (uunet!microsoft!w-colinp) "Don't listen to me. I never do."
rodman@mfci.UUCP (Paul Rodman) (02/17/89)
In article <28200275@mcdurb> aglew@mcdurb.Urbana.Gould.COM writes: > >>Why not have a *real* pipelined memory system and a compiler than can >>handle more than one miserable outstanding load in flight? Or have both. >> >> >> Paul Rodman > >Denelcore's HEP. > Yuk. Please don't bring up that pile of junk. I was referring more to our machine, where the compiler sees an exposed pipeline several beats long to main memory. In some cases a data cache is useful for reducing latency if the hit rate will be high enough. Of course, your compiler must do a good job about deciding when to go for the cache. ...I'm disgusted at these piddly RISC chips that brag about register scoreboarding of one lousy outstanding load at a time. Great for use with data caches, not much use for high performance computation on large data sets (where I'd much rather not use the cache, but want to get decent bandwidth). If you want to do a daxpy, for example, you need two loads and a store per 2 flops. What good are floating point chips if you can't feed 'em? The architecture is broken if I can't at least do 1 load/store (from memory,please) per flop. :-) Paul Rodman rodman@mfci.uucp
rcd@ico.ISC.COM (Dick Dunn) (02/17/89)
In article <649@m3.mfci.UUCP>, rodman@mfci.UUCP (Paul Rodman) writes: ... > >Denelcore's HEP. ... > Yuk. Please don't bring up that pile of junk. This is rude, out of place in the newsgroup, and unfounded anyway. The ideas that Denelcor (mostly Burton Smith) attempted to develop in the HEP were reasonably well-founded; pieces of them can be found in more contem- porary designs. Remember that the HEP design was really mid-'70's, and also keep in mind that corporate vicissitudes (financing, contracts, and all that piddly money stuff) do not determine technical merit. > I was referring more to our machine... How nice. Slam machine X, then boost your own. > ...In some > cases a data cache is useful for reducing latency if the hit rate will be > high enough. Of course, your compiler must do a good job > about deciding when to go for the cache... Remarkable as it may seem, there are data caches with very high hit rates, and the compilers don't even have to worry about their existence. -- Dick Dunn UUCP: {ncar,nbires}!ico!rcd (303)449-2870 ...Just say no to mindless dogma.
aglew@mcdurb.Urbana.Gould.COM (02/17/89)
>Andy also mentioned at one point that he thought string ops weren't >very necessary in a machine that had word ops with masks (or somesuch; >I don't recall exactly). I wanted to hear your reasons for that, Andy. >I sent you email, but it was bounced. Would you please post a msg with >more details? > > -Olin My reasoning: the most commonly used string ops are moves. There's only one way to make moves faster - move more data per cycle => larger busses [*]. Most string moves are small, so would, eg., be able to fit into a 128 bit, 16 byte, wide bus. Given that you can move wide words, how do handle misaligneds? By a decomposition of the word into power-of-two sized transactions - works, but gets more difficult as word size increases, and, in the dynamic case, requires decisions. Most byte-addressible architectures already have signals similar to "Store the data off the buss in this word only". Provide explicit control of these. Similar arguments apply for length strings. I don't have enough real data behind these statements, yet; but I've been flogging them for a few years, and have finally got support to examine them in detail. Since I need to invent the tools to do the study first, you could still probably beat me to publication - I wouldn't mind, just tell me so that I can do something more interesting. Now, I admit that John Mashey's statements about the inefficacy of optimizing strings tend to imply that this is not a very rewarding area of research, but I point out that the same things also apply to block moves - and, running profiling on my system I regularly see block moves occupying up to 10% of system time. Why is a discussion for an OS group. [*] Well, remapping and parallel moves are possibilities, appropriate for large moves, but probably not for the most frequent case.
aglew@mcdurb.Urbana.Gould.COM (02/18/89)
>The architecture is broken if I can't at least do 1 load/store >(from memory,please) per flop. :-) > > Paul Rodman > rodman@mfci.uucp Actually, given some of the instruction frequency statistics I've been seeing recently, you need about 3 load/stores per arithmetic operation. From cache, admittedly.
jesup@cbmvax.UUCP (Randell Jesup) (02/25/89)
In article <644@m3.mfci.UUCP> rodman@mfci.UUCP (Paul Rodman) writes: >In article <5964@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes: >> Very slightly, since PFX is an instruction, it just routes the result >>to the immediate value register. The most complex part of this (not very) >>is shifting the value in the IVR over when a PFX is executed (easy because >>it's a fixed shift). >> >You mean it takes me *cycles* to build a >4 bit constant??? *Gasp,choke.* >I guess thats fine for some machines, but if you're reading 4 x 64 bit words >from a large array every beat from a common block, you may need lots of >constants without such a penalty. Excuse me, but what does loading from common blocks have to do with constant sizes? As I said, if you look at statistics on usage of constants, there are a very large percentage that will fit in 4 bits. This design was not a total pedal to the metal design, but one that balanced memory speed, size, and bandwidth against processor speed. 16-bit instructions allows us to use much denser, slower and cheaper instruction memories while still running at 40Mhz, and much faster than we would have with 32-bit instructions at 20Mhz (given constant I-Mem speed). Remember, there are other uses of RISC chips than in Unix workstations. Embedded controllers, for one. >> I say there is no difference in Icache complexity due to 16-bit >>instructions. > >There would be on a machine that was trying to do more than one lousy >operation per cycle. Why the flame? I was talking about the design decision between 32-bit instructions and 16-bit ones on the RPM 40. Not many 'RISC' chips execute more than one instruction per cycle, certainly not the RPM-40. It pushed the design rules a fair ways, and had several close-to-critical paths (in other words without process/design rule changes it would be very hard to add a lot to it). >>Therefor, you should get twice as many instructions into >>it, and if the 15% figure is true, then you should get effectively 15% >>more done with what's in the icache. (This assumes a loop the size of the >>icache. For other conditions, it may change the hit-rate instead.) > >Icaches can be made huge. Who cares? People who can't afford huge external ICaches. Also, at the speeds we're talking about, there are loading limitations to how much ram you can attach to the processor. The ICache I was talking about was the on-chip ICache. >> Icache is what makes the world go around. The next stepping stone: >>making dcaches more efficient (their current hit-rates are lousy unless they're >>ridiculously large). > >Why not have a *real* pipelined memory system and a compiler than can >handle more than one miserable outstanding load in flight? Or have both. We do. We can do one load per cycle, the memory systems are pipelined. (The latency is more than one cycle, of course). > Paul Rodman > rodman@mfci.uucp Calm down. -- Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup