rajiv@im4u.UUCP (Rajiv N. Patel) (02/26/88)
> >All you 32-bit instruction advocates : how many of your 32-bits of >instruction are usually wasted ( like by leading zeroes or ones, or >unused register specifications ) ? If it sounds like I'd welcome a >debate on the merits of 16 vs 32 bit instructions : sure. Isn't that >what comp.arch is for ? And I said a DEBATE, not a fire-fight :-) > >-- > Dennis O'Connor oconnor@sunset.steinmetz.UUCP ?? > ARPA: OCONNORDM@ge-crd.arpa > "Nuclear War is NOT the worst thing people can do to this planet." I agree with Dennis, this is a great topic of DEBATE on comp.arch . I am not sure if it has already been discussed already but its worth while starting again. In recent years many new(?) RISC architectures have been proposed and each of them seems to have a differences in its instruction set. Some have only 32 bit instructions whereas others have both 16 and 32 bit instructions. Then there are some like the Bell CRISP which has 16, 48 and 80 (believe it or not!) bit instructions. With these variations it seems only fair to question how the variable length instruction architectures can compete with the so called fixed length (faster decode) architectures. I have done some assembly coding for an on going RISCy project here where we have 16 and 32 bit instructions available. In my experience most of the programs I have coded (<2K instructions) have about 70-90% of the instructions from the 16 bit category. This only tells me that indeed 16 bit instructions are very useful and the additional amount of time which one may incur in decoding 16 and 32 bit instructions could be offset by the time saved in fetching instructions from memory/cache assuming a 32 bits wide bus. The fixed instruction architectures never seem to talk about the memory traffic involved for getting all the leading zeros and unnecessary third register name. Even CRISP people who had the courage to use an 80 bit instruction have 16 bit instructions and claim a very high percentage of these. I would like to know from other more experienced people (compared to me, ofcourse) what could be the different reasons for using a fixed 32 bit instruction architecture. Rajiv N Patel. (rajiv@im4u) University of Texas at Austin.
haynes@ucscc.UCSC.EDU (99700000) (02/26/88)
In article <2574@im4u.UUCP> rajiv@im4u.UUCP (Rajiv N. Patel) writes: > > I agree with Dennis, this is a great topic of DEBATE on comp.arch . I > am not sure if it has already been discussed already but its worth while > starting again. I'm glad somebody brought this up, and while we're also at it let's debate the value of including 16 bit data types in RISC machines. A variety of data sizes slows a machine down almost as much as a variety of instruction sizes. I was rather surprised to see that Sun included 16-bit data in SPARC in this day and age of cheap memory. 8-bit data has the obvious utility for character strings; but do we really need 16-bit integers anymore? haynes@ucscc.ucsc.edu haynes@ucscc.bitnet ..ucbvax!ucscc!haynes
bfb@mtund.ATT.COM (Barry Books) (02/27/88)
I wrote code for CRISP a while ago includes machine language ( no tools ) and was inpressed by the opcodes. CRISP had only a few instructions but calling it RISC may be stretching things since it had three operand instructions and no registers. However it had a 32 word stack cache which it seems to me is better than registers ( or even register windows which at least allow future versions to have more registers without recompiling to take advantage ). Most of the 16 bit instructions had 5 bit offsets so they could be used to access the stack cache with small instructions. This allowed instruction working on local variables to to be small. The 32 bit instructions I belive had 10 bit offsets so as the stack cache is made larger it still didn't take a large address field. These are also usefull for short jumps. The 80 bit instructions could take to address ( usefull for memcpy which seems to happen all to frequently ). The thing that impressed me the most was it didn't take a killer optimizing compiler ( read big, buggy and slow ) to make the machine run fast. It seems for stack based langauges a stack cache is much more sensible than having a bunch of registers that the compiler has to figure out what to do with. Also since the subject of jumps seem to be popular. CRISP used a brach predition bit which if correct caused jumps to take no cycles. Brach predition is fairly easy for loops the compiler I used always set it so the brach was taken. This simple approach works for loops assuming you go thru them more that once. All in all a nice design. Barry Books mtund!bfb better is bigger
elg@killer.UUCP (Eric Green) (02/27/88)
in article <2574@im4u.UUCP>, rajiv@im4u.UUCP (Rajiv N. Patel) says: >>All you 32-bit instruction advocates : how many of your 32-bits of >>instruction are usually wasted ( like by leading zeroes or ones, or >>unused register specifications ) ? If it sounds like I'd welcome a >>debate on the merits of 16 vs 32 bit instructions : sure. Isn't that > of the programs I have coded (<2K instructions) have about 70-90% of the > instructions from the 16 bit category. This only tells me that indeed 16 > bit instructions are very useful and the additional amount of time which > one may incur in decoding 16 and 32 bit instructions could be offset by > the time saved in fetching instructions from memory/cache assuming a 32 > bits wide bus. The fixed instruction architectures never seem to talk > about the memory traffic involved for getting all the leading zeros and > unnecessary third register name. The operative parameter here is, is the bus width {n:n>1) times greater than the instruction width? If so, then it doesn't matter how large the instructions are -- you'll always be able to fetch multiple instructions faster than you can execute them. For example, what you mentioned -- a 32 bit bus, with 16 bit instructions. Or a 64 bit bus, with 32 bit instructions. Execution-time wise, it doesn't matter either way, unless there's additional decoding overhead (such as, variable length instructions for the first, but not for the second). And then there is the case of cache. Whenever you fetch an opcode out of cache memory, you have no delays anyhow. Someone from ?Pyramid? posted about their architecture a long time ago. The cache is a very important part of that machine, and is integrated with a bus that's wider than the instruction/data width. For example, 32-bit instructions & data, with a 192-bit-wide memory bus. Considering the locality of data, that means the next 5 instructions will already be in cache, instantly available (nearbouts). I fail to see how this can slow the machine down any, except for the i/o overhead for loading it into main memory in the first place (which is not a cpu delay, but, rather, a response time delay during which some other process is running). So, while 16-bit variable-length instructions with a 32-bit data bus may be a win on a machine with slow memory access time and no cache (i.e. your typical microcomputer, at the moment), 32-bit fixed length instructions with a 64-bit instruction-fetch memory interface will blow it into the weeds come ultimate-performance time, because of the lack of instruction decode overhead. All a matter of keeping the memory interface bigger than the instruction size.... I also have some papers here from the original RISC guys at UC-Berkeley, and AMD's design team, which discuss the issue at great length. Their basic conclusion is that the locality of reference of large caches means that 32-bit fixed length instructions are a big win even in the absence of a big memory interface. Of course, their basic problem was justifying the larger flow of instructions in a RISC machine as vs. a CISC machine, instead of specifically addressing variable-length vs. fixed-length instructions, but their conclusions still apply. At least, if you accept the basic premise of RISC, that is. -- Eric Lee Green elg@usl.CSNET Asimov Cocktail,n., A verbal bomb {cbosgd,ihnp4}!killer!elg detonated by the mention of any Snail Mail P.O. Box 92191 subject, resulting in an explosion Lafayette, LA 70509 of at least 5,000 words.
mash@mips.COM (John Mashey) (02/28/88)
In article <2116@saturn.ucsc.edu> haynes@ucscc.UCSC.EDU (Jim Haynes) writes: >In article <2574@im4u.UUCP> rajiv@im4u.UUCP (Rajiv N. Patel) writes: >I'm glad somebody brought this up, and while we're also at it let's debate >the value of including 16 bit data types in RISC machines. A variety >of data sizes slows a machine down almost as much as a variety of >instruction sizes. I was rather surprised to see that Sun included >16-bit data in SPARC in this day and age of cheap memory. 8-bit data >has the obvious utility for character strings; but do we really need >16-bit integers anymore? I encourage all potentially competitive RISC machines to avoid 16-bit integer support. :-) More seriously: -UNIX user-level programs don't use 16-bit quantities very much. -A few applications use halfwords a lot: Berkeley timberwolf, for example, has 11% of its instrucitons (on MIPS R2000) as load/store halfword. -The UNIX kernel uses halfwords frequently, because it has large arrays of structs whosedatra must be tightly packed. You might argue that some of these could be expanded (and they can), but others are painful. -If you want to do data communications, and write device drivers, both cases where data structuring is beyond your control, you will be very sorry if you don't have halfwords: both of these applications use them frequently. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
frazier@oahu.cs.ucla.edu (Greg Frazier) (02/28/88)
In article <3508@killer.UUCP> elg@killer.UUCP (Eric Green) writes: >in article <2574@im4u.UUCP>, rajiv@im4u.UUCP (Rajiv N. Patel) says: >> bit instructions are very useful and the additional amount of time which >> one may incur in decoding 16 and 32 bit instructions could be offset by >> the time saved in fetching instructions from memory/cache assuming a 32 >> bits wide bus. The fixed instruction architectures never seem to talk >> about the memory traffic involved for getting all the leading zeros and >> unnecessary third register name. > >The operative parameter here is, is the bus width {n:n>1) times greater than >the instruction width? If so, then it doesn't matter how large the >instructions are -- you'll always be able to fetch multiple instructions >faster than you can execute them. For example, what you mentioned -- a 32 bit > >And then there is the case of cache. Whenever you fetch an opcode out of cache >memory, you have no delays anyhow. Someone from ?Pyramid? posted about their [cut explanation] > >So, while 16-bit variable-length instructions with a 32-bit data bus may be a >win on a machine with slow memory access time and no cache (i.e. your typical >microcomputer, at the moment), 32-bit fixed length instructions with a 64-bit >instruction-fetch memory interface will blow it into the weeds come [cut some more] > >-- >Eric Lee Green elg@usl.CSNET Asimov Cocktail,n., A verbal bomb >{cbosgd,ihnp4}!killer!elg detonated by the mention of any >Snail Mail P.O. Box 92191 subject, resulting in an explosion >Lafayette, LA 70509 of at least 5,000 words. Eric has a good point about the cache. However, if one is using both 16 and 32 bit instructions, and 70-90% of the instructions are the shorter ones, then you are going to be able to fit almost twice as many instructions into the instruction cache. Indeed, when the RISC group went on to implement a RISC instruction cache (Architecture of a VLSI Instruction Cache for a RISC", Patterson, et. al., 10th Annual Symposium on Comp. Arch.), they went to the mixed-instruction length format, in order to improve its performance. They did expand the instructions on the back end of the cache, partly in order to keep the same CPU hardware. If one states the RISC philosophy as optimizing those aspects of the CPU/instruction set which are used the most often, then it very much make sense to make the instructions executed 70-90% of the time 16 bits. In any case, if one is going to have a dedicated, off-chip instruction cache, then the benefits of having 16-bit instructions is rather minimal. However, if one is going to have a combined instruction/data cache, or if one is going to have an on-chip instruction cache, then the amount of memory required to achieve a good instruction hit ratio is very important, and having 16-bit instructions should be a real win. Greg Frazier o Internet: frazier@CS.UCLA.EDU CS dept., UCLA /\ UUCP: ...!{ihnp4,ucbvax,sdcrdcf,trwspp,randvax,ism780} ----^/---- !ucla-cs!frazier /
radford@calgary.UUCP (Radford Neal) (02/28/88)
In article <2116@saturn.ucsc.edu>, haynes@ucscc.UCSC.EDU (99700000) writes: > ... while we're also at it let's debate > the value of including 16 bit data types in RISC machines. A variety > of data sizes slows a machine down almost as much as a variety of > instruction sizes. I was rather surprised to see that Sun included > 16-bit data in SPARC in this day and age of cheap memory. 8-bit data > has the obvious utility for character strings; but do we really need > 16-bit integers anymore? I don't understand this. Given that you support 8-bit bytes, and thus need the extra two bit in addresses, plus the complications of more than one data size on the bus, why does supporting 16-bit data cost a lot? As far as instructions go, it looks like somewhere between two and four extra instructions for a load/store architecture. I agree that 16-bit data will be used less and less. It's not that that there isn't a lot of data that would fit in 16 bits, and it's not that this wouldn't sometimes be a significant savings, it's because of the popularity of C. If Pascal (which of course had other faults) had become dominant, lots of people would be declaring variables like "var i:1..Max_widgets" and if Max_widgets were 1000 this would be held as 16 bits. As it is, declaring a C variable to be "short" is an invitation to future problems. Radford Neal
root@mfci.UUCP (SuperUser) (02/29/88)
.... >frequently ). The thing that impressed me the most was it didn't take >a killer optimizing compiler ( read big, buggy and slow ) to make the ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >machine run fast. It seems for stack based langauges a stack cache is > >Barry Books >mtund!bfb better is bigger I disagree with this attitude. If you want to run fast, you pay the price. Whatever performance you got out of whatever compiler you had, if it wasn't doing a good job at optimizing, then you could have gotten more. If you let that extra performance dribble away, you may have your reasons, but it's likely that your competitors won't, so you may not be able to keep doing it for long. And anyway, "killer optimizing compilers" don't have to be buggy. Did you have some example in mind? Bob Colwell mfci!colwell@uunet.uucp Multiflow Computer 175 N. Main St. Branford, CT 06405 203-488-6090
ohbuchi@unc.cs.unc.edu (Ryutarou Ohbuchi) (02/29/88)
In article <2116@saturn.ucsc.edu> haynes@ucscc.UCSC.EDU (Jim Haynes) writes: >In article <2574@im4u.UUCP> rajiv@im4u.UUCP (Rajiv N. Patel) writes: >I'm glad somebody brought this up, and while we're also at it let's debate >the value of including 16 bit data types in RISC machines. A variety >of data sizes slows a machine down almost as much as a variety of >instruction sizes. I was rather surprised to see that Sun included >16-bit data in SPARC in this day and age of cheap memory. 8-bit data >has the obvious utility for character strings; but do we really need >16-bit integers anymore? To us Japanese, 16bit unsigned integer too has the obvious utility for character strings. We use 16bit/char. code for Kanji+Kanas+Alpha-numeric characters. There are few variations to encode Kanji (unbounded, but about 7k-8k for business use), Kanas (2 different sets of abuot 60 each), and alphanumerics intermixed. Scanning these 8bit (ASCII)/ 16bit (JIS; Japanese Industry Standard) mixed string is a pain (simple FSM, sometime). You need a new set of C string libraries. Also in these code, bit7 (8'th bit) is used, which is messy with some UNIX utilities. Want to try Boyer-Moore string matching with alphabet size of 10K ? It will be an interesting excersize. Chinese, and several other languages need larger character set, too. If you want to export computers and operating systems to these countries, you better not forget (Along with I/O devices, of course. Who buy a business computer which print bills in unreadable characters ?) ============================================================================== Any opinion expressed here is my own. ------------------------------------------------------------------------------ Ryutarou Ohbuchi "Life's rock." "Climb now, work later." and, now, "Life's snow." "Ski now, work later." ohbuchi@cs.unc.edu <on csnet> Department of Computer Science, University of North Carolina at Chapel Hill ==============================================================================
brooks@lll-crg.llnl.gov (Eugene D. Brooks III) (02/29/88)
In article <2116@saturn.ucsc.edu> haynes@ucscc.UCSC.EDU (Jim Haynes) writes: >instruction sizes. I was rather surprised to see that Sun included >16-bit data in SPARC in this day and age of cheap memory. 8-bit data >has the obvious utility for character strings; but do we really need >16-bit integers anymore? Yes, because no matter how cheap them bits are using twice as many as you need will always cost twice as much!
jesup@pawl18.pawl.rpi.edu (Randell E. Jesup) (02/29/88)
In article <3508@killer.UUCP> elg@killer.UUCP (Eric Green) writes: >The operative parameter here is, is the bus width {n:n>1) times greater than >the instruction width? If so, then it doesn't matter how large the >instructions are -- you'll always be able to fetch multiple instructions >faster than you can execute them. For example, what you mentioned -- a 32 bit >bus, with 16 bit instructions. Or a 64 bit bus, with 32 bit instructions. Who ever said you had to run the bus a the same speed as the CPU? The RPM-40 uses fixed-length 16-bit instructions, and fetches 2 every other cycle. This has several advantages: cheaper memory, more time for external caches to respond, and improved efficiency of caches since the same amount of memory holds twice the number of instructions. I do agree that variable length instructions ARE a big drag on performance in most cases. There are ways to get some of the effects of variable length instructions, such as the RPM-40 PREFIX instruction, in a fixed-length architecture. // Randell Jesup Lunge Software Development // Dedicated Amiga Programmer 13 Frear Ave, Troy, NY 12180 \\// beowulf!lunge!jesup@steinmetz.UUCP (518) 272-2942 \/ (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)
jesup@pawl19.pawl.rpi.edu (Randell E. Jesup) (03/01/88)
In article <2116@saturn.ucsc.edu> haynes@ucscc.UCSC.EDU (Jim Haynes) writes: >I'm glad somebody brought this up, and while we're also at it let's debate >the value of including 16 bit data types in RISC machines. A variety >of data sizes slows a machine down almost as much as a variety of >instruction sizes. I was rather surprised to see that Sun included >16-bit data in SPARC in this day and age of cheap memory. 8-bit data >has the obvious utility for character strings; but do we really need >16-bit integers anymore? One reason should be enough: compatibility. If you want another: Ram space. Ram, especially for RISC computers, is less than cheap. Priced any 20-ns SRAMs lately? Also note that DRAM prices are goin UP lately, in contradiction to all established precedent. And no, being able to load and store halfwords doesn't really slow a machine down. All internal operations can be done in natural word size. // Randell Jesup Lunge Software Development // Dedicated Amiga Programmer 13 Frear Ave, Troy, NY 12180 \\// beowulf!lunge!jesup@steinmetz.UUCP (518) 272-2942 \/ (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)
oconnor@sungoddess.steinmetz (Dennis M. O'Connor) (03/02/88)
An article by elg@killer.UUCP (Eric Green) says: ] in article <2574@im4u.UUCP>, rajiv@im4u.UUCP (Rajiv N. Patel) says: ] >>All you 32-bit instruction advocates : how many of your 32-bits of ] >>instruction are usually wasted ( like by leading zeroes or ones, or ] >>unused register specifications ) ? If it sounds like I'd welcome a ] >>debate on the merits of 16 vs 32 bit instructions : sure. Isn't that ] ] The operative parameter here is, is the bus width {n:n>1) times greater than ] the instruction width? If so, then it doesn't matter how large the ] instructions are -- you'll always be able to fetch multiple instructions ] faster than you can execute them. For example, what you mentioned -- a 32 bit ] bus, with 16 bit instructions. Or a 64 bit bus, with 32 bit instructions. ] Execution-time wise, it doesn't matter either way, unless there's additional ] decoding overhead (such as, variable length instructions for the first, but ] not for the second). The real question isn't bus WIDTH, but rather bus BANDWIDTH, usually measured in megabytes per second. This is NOT an infinitely available resource for single-chip CMOS processors. The package limits you to only having a finite number of pins, and CMOS can only drive those pins at a certain (technology dependant) rate. The comparison, then, is between your INTERNAL execution rate, measured in MIPS, and your available instruction-fetch bandwidth, measured in MB/sec. The ratio of (MB/s) over MIPS yeilds the number of BYTES/INSTRUCTION. First order, of course. ] And then there is the case of cache. Whenever you fetch an opcode out of cache ] memory, you have no delays anyhow. Someone from ?Pyramid? posted about their ^^^^^^^^^^^^^^^^ uh, really ? Don't you mean LESS delay ? ] architecture a long time ago. The cache is a very important part of that ] machine, and is integrated with a bus that's wider than the instruction/data ] width. For example, 32-bit instructions & data, with a 192-bit-wide memory ] bus. Considering the locality of data, that means the next 5 instructions will ] already be in cache, instantly available (nearbouts). I fail to see how this ] can slow the machine down any, except for the i/o overhead for loading it into ] main memory in the first place (which is not a cpu delay, but, rather, a ] response time delay during which some other process is running). I'm just guessing, but it sounds like the Pyramid machine is NOT a single chip microprocessor. Different horses for different course, and all that. Using 192 pins on a micro for the memory bus would be pushing package technology, I think, once you added other signals, address bus, power and ground. ] So, while 16-bit variable-length instructions with a 32-bit data bus may be a ] win on a machine with slow memory access time and no cache (i.e. your typical ] microcomputer, at the moment), 32-bit fixed length instructions with a 64-bit ] instruction-fetch memory interface will blow it into the weeds come ] ultimate-performance time, because of the lack of instruction decode overhead. There is NO intrinsic reason 16-bit instructions would decode slower than 32-bit instructions. In fact, they can ultimately decode FASTER : the fewer bits your decoder has to look at, the faster it can be. Barring other complications of course. I think the assumption your making is that a smaller instruction set has to be more complex to get the job done. There are plenty of examples of this, but it's NOT an imutable law. ] All a matter of keeping the memory interface bigger than the instruction ] size.... Dedicating 64 pins purely to instruction fetch (assuming a Harvard architecture) is quite a lot of a rather scarce resource. Sure you wanna do this on a micro ? ] I also have some papers here from the original RISC guys at UC-Berkeley, and ] AMD's design team, which discuss the issue at great length. Their basic ] conclusion is that the locality of reference of large caches means that 32-bit ] fixed length instructions are a big win even in the absence of a big memory ] interface. Of course, their basic problem was justifying the larger flow of ] instructions in a RISC machine as vs. a CISC machine, instead of specifically ] addressing variable-length vs. fixed-length instructions, but their ] conclusions still apply. At least, if you accept the basic premise of RISC, ] that is. The appropriate measure of cache size, IMHO, is in INSTRUCTIONS. Given you have some limited number of transistors to put into a cache, then the smaller your instructions are, the "bigger" your cache will be. Also, instruction size affects several "second-order" performance factors, like how quickly a program loads from a "low-speed" (like disk) I/O device and how often you page-fault. This effect is of course due to the fact that programs written in a 32-bit RISC instruction set will be (according to our data) 65% larger than the same program in a 16-bit RISC instruction set. Sorry, we haven't published our data yet. It's just an analysis (using information-theory) of existing data anyway. ] Eric Lee Green elg@usl.CSNET Asimov Cocktail,n., A verbal bomb -- Dennis O'Connor UUNET!steinmetz!sunset!oconnor ARPA: OCONNORDM@ge-crd.arpa (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)
bcase@Apple.COM (Brian Case) (03/02/88)
In article <449@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes: >In article <2116@saturn.ucsc.edu> haynes@ucscc.UCSC.EDU (Jim Haynes) writes: >>I'm glad somebody brought this up, and while we're also at it let's debate >>the value of including 16 bit data types in RISC machines. > One reason should be enough: compatibility. If you want another: >Ram space. Ram, especially for RISC computers, is less than cheap. Priced >any 20-ns SRAMs lately? Also note that DRAM prices are goin UP lately, in >contradiction to all established precedent. > > And no, being able to load and store halfwords doesn't really >slow a machine down. All internal operations can be done in natural word >size. Not all RISC machines require 20ns SRAMs in order to have good performance. One thing is for sure though, code size is almost always much less important that data size. DRAM prices are going up instead of down because of increased demand (and it is starting to get a little tight everywhere, so I hear). Have you tried to get megabit DRAMs lately? Yikes! "Being able to load and store halfwords doesn't really slow a machine down" is a statement that does not take in to consideration certain reasonable implementations. The necessary alignment network can indeed slow a machine's cycle. Doing internal operations in natural word size does not solve the whole (main?) problem.
lm@arizona.edu (Larry McVoy) (03/02/88)
In article <1697@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes: >-The UNIX kernel uses halfwords frequently, because it has large >arrays of structs whosedatra must be tightly packed. You might argue >that some of these could be expanded (and they can), but others >are painful. You bet it's painful. I recently worked on the Unix port to the ETA-10. That machine does not support 16 bit ints. Can you say OUCH?!?! I knew you could. -- Larry McVoy lm@arizona.edu or ...!{uwvax,sun}!arizona.edu!lm
bcase@Apple.COM (Brian Case) (03/03/88)
In article <9740@steinmetz.steinmetz.UUCP> sungoddess!oconnor@steinmetz.UUCP writes: >There is NO intrinsic reason 16-bit instructions would decode slower than >32-bit instructions. In fact, they can ultimately decode FASTER : >the fewer bits your decoder has to look at, the faster it can be. >Barring other complications of course. The first statement is true. However, 32-bit instructions typically have more bits dedicated to the opcode than 16-bit instructions. This allows 32-bit instructions to have less-dense encodings and therefore faster decodings. For a real lesson in this aspect, see the Stanford MIPS-X and Original Berkeley RISC instruction encodings. Yeow! The instruction decode logic is almost impossible to see on the MIPS-X. >Dedicating 64 pins purely to instruction fetch (assuming a Harvard >architecture) is quite a lot of a rather scarce resource. Sure >you wanna do this on a micro ? Sounds like it might be a good idea. Note that instruction-only-bus pins are INPUT-only; thus their corresponding pads are much "faster" and simpler than bidirectional bus pads. >The appropriate measure of cache size, IMHO, is in INSTRUCTIONS. >Given you have some limited number of transistors to put into >a cache, then the smaller your instructions are, the "bigger" >your cache will be. This is quite true. However, see the following. >Also, instruction size affects several "second-order" performance >factors, like how quickly a program loads from a "low-speed" (like >disk) I/O device and how often you page-fault. This effect >is of course due to the fact that programs written in a 32-bit >RISC instruction set will be (according to our data) 65% larger >than the same program in a 16-bit RISC instruction set. Yes, the second order effects (not affects) can be very important in certain environments. As to code size differences: I have a very large program (30K lines of C code) that compiles to about 300K bytes of Am29000 code. On the VAX, it compiles to about 225K bytes (yes, that's with PCC and -O turned on and the 29K compiler is a wonderful thing from MetaWare). That's roughly 33%. That seems typical, although code size ratios can be anywhere from 1.1 to over 2.0. However, these kinds of percentages are not terribly important unless they are much closer to 2 to 3 times as big. The question is "what is the cache miss ratio is real life?" This is NOT necessarily directly related to general code size! One kind of optimization that will be very important (and I hope commonplace) is loop unrolling. Yes, the code size ratios will still roughly scale, but the point is that we are talking about, to some degree, space/time tradeoffs. If you want a little less time, you can usally get it by giving up a little space. This works to a point. Clearly, you wouldn't want your instruction format to have one-bit per register. >Sorry, we haven't published our data yet. It's just an analysis >(using information-theory) of existing data anyway. I do hope you guys publish such data. The more information the better!
hansen@mips.COM (Craig Hansen) (03/03/88)
In article <9740@steinmetz.steinmetz.UUCP> sungoddess!oconnor@steinmetz.UUCP writes: >There is NO intrinsic reason 16-bit instructions would decode slower than >32-bit instructions. In fact, they can ultimately decode FASTER : >the fewer bits your decoder has to look at, the faster it can be. >Barring other complications of course. >Dedicating 64 pins purely to instruction fetch (assuming a Harvard >architecture) is quite a lot of a rather scarce resource. Sure >you wanna do this on a micro ? I guess I've been kicking around the RISC world too long, because I've heard these arguments all before, when people were saying RISC machines weren't going to be any faster than CISC machines, and they were just as wrong then as they are now. Instruction bandwidth is important, but not so important that you should go back to compacted instructions. 32-bit instructions aren't much larger than 16-bit instructions, particularly when a register-allocating compiler is used, and the benefit to permitting parallel decoding of instructions with register fetching is a tremendous win. Generally, optimized MIPS code about 10% to 50% larger than "optimized" VAX code, as generated by 4.3 UNIX, and is often equal or smaller in size than optimized 68k code, as generated by Sun compilers. Here's some data for text size for these machines, noting that the same source code is used for all machines. However, since the libraries may be different, and the MIPS machine has a large page size, the data on small benchmarks is a litle distorted: mips sun3 vax awk 61k 106k 34k ccom 193k 156k 120k compress 29k 25k 14k diff 45k 33k 25k hspice 860k 729k 635k wolf 266k 303k 212k ...those big 32-bit instructions don't look so bad next to the machines design for compact encodings... -- Craig Hansen Manager, Architecture Development MIPS Computer Systems, Inc. ...{ames,decwrl,prls}!mips!hansen or hansen@mips.com 408-991-0234
zs01+@andrew.cmu.edu (Zalman Stern) (03/03/88)
I believe that 16 bit instructions can be a win. Good code density saves disk space as well as memory space. Besides, the more instructions you get in N bits, the more performance your cache and bus bandwidth buy you. The trick is to keep from designing a CISC in the process. Mainly, multiple instruction lengths and hairy bit encodings should be avoided like the plague. I actually tried to design an architecture using 16 bit instruction words. Being a complete amateur, it turned out as a complete mess. I did learn the following though: You lose on 32 bit immediate operands. This can be a performance loss. You do not have room for multiple forms of load (i.e. signed/unsigned half word and byte) More than 16 registers is hard. Addressing modes don't fit. (As the RISC advocates say, "So what?") The RPM40 gets around these by using prefix instructions to extend immediates, a separate size register to specify non-word format, and an asymmetric register set to give you 21 registers. (Only 16 of the registers are accessible from every register operand slot.) Addressing modes are unnecessary since you can use some of the instruction bits you save for address calculation instructions. Basically, slick solutions to the above problems. A good argument for 32 bit instructions is the MIPS R2000. Given a really tense compiler, having a full 32 entry register file may be worth the extra bits. Also, I recall seeing some statistics that a significant number of their load instructions (measured dynamically) use 16 bit offsets. (Notably to index off the global variable pointer.) There are also some special privileged instructions on the R2000 that make things like software page fault resolution go like hell. I don't see much room in a 16 bit instruction for stuff like this. (Although you could steal some bits from the coprocessor instruction on the RPM40.) In any event, the R2000 seems geared for large, reasonably fast, memory subsystems. In this case, speed is worth some wasted bits. One thing that is really useful is to be able to take a 32 bit constant and do something with it in 64 bits worth of instruction stream. For example, loading a value from an absolute (32 bit) address. The IBM RT, the MIPS R2000, and the RPM40 can all do this. The RT and the R2000 use one 32 bit instruction to load the high half of a register and a load with offset instruction to finish the job. The RPM40 uses three prefix instructions and a load. As near as I can tell, the AMD 29000 requires 3 32 bit instructions to do this. I find this useful because compilers should (almost) always generate 32 bit addresses. When we first compiled Scribe on an RT, it died during the link phase. Turned out that there was more than a megabyte of executable code and the compiler had chosen an instruction with a 20 bit absolute address... (This has long since been fixed.) Sincerely, Zalman Stern Internet: zs01+@andrew.cmu.edu Usenet: I'm soooo confused... Information Technology Center, Carnegie Mellon, Pittsburgh, PA 15213-3890
gwu@clyde.ATT.COM (George Wu) (03/05/88)
In article <929@mtund.ATT.COM> bfb@mtund.UUCP (Barry Books) writes: > . . . However it had a 32 word stack cache which it seems to me is better >than registers ( or even register windows which at least allow future >versions to have more registers without recompiling to take advantage ). > . . . The thing that impressed me the most was it didn't take >a killer optimizing compiler ( read big, buggy and slow ) to make the >machine run fast. It seems for stack based langauges a stack cache is >much more sensible than having a bunch of registers that the compiler >has to figure out what to do with. . . . Both of these segments are based on the assumption that compilation takes up a significant amount of time compared with other tasks. And this certainly seems correct, especially to developers. However, I find that hard to believe. Once a program is correctly compiled, it will be run many more times than it was compiled, and in the long run, you are better off optimizing the application code, instead of the compiler. Even in a developer's environment, how often is your machine extended such that code needs to be recompiled? Remember, your compiler had to be compiled too. I would expect a good optimizing compiler would yield better overall system performance here too. Overall, I'd expect that sacrificing you're compiler's ability to optimize code in return for a smaller, faster compiler would be a lose. > >Barry Books >mtund!bfb better is bigger Like you said, "better is bigger." :-) George J Wu UUCP: {ihnp4,ulysses,cbosgd,allegra}!clyde!gwu ARPA: gwu%clyde.att.com@rutgers.edu or gwu@faraday.ece.cmu.edu -- George J Wu UUCP: {ihnp4,ulysses,cbosgd,allegra}!clyde!gwu ARPA: gwu%clyde.att.com@rutgers.edu or gwu@faraday.ece.cmu.edu
jesup@pawl23.pawl.rpi.edu (Randell E. Jesup) (03/05/88)
In article <7538@apple.Apple.Com> bcase@apple.UUCP (Brian Case) writes: >>Dedicating 64 pins purely to instruction fetch (assuming a Harvard >>architecture) is quite a lot of a rather scarce resource. Sure >>you wanna do this on a micro ? >Sounds like it might be a good idea. Note that instruction-only-bus pins >are INPUT-only; thus their corresponding pads are much "faster" and simpler >than bidirectional bus pads. Pads aren't the limiting factor, but PINS (and bandwidth) are. The RPM-40 has 144 'pins' (not PGA, but leadless chip carrier), which was the biggest certifed package we could find. We would KILL for more pins! (Of course, by now there are probably larger certified packages.) Also, if this is NOT an embedded system, you have to worry about bus bandwidth. How many 40 Mhz busses are there? Not many. By using 16-bit instructions we cut the bandwidth required (at least at the chip edge, corresponding drops farther down) for instructions from 160 megabytes per sec to 80 MB/sec. That means that on-board caches will fill twice as fast (for the same number of instructions). // Randell Jesup Lunge Software Development // Dedicated Amiga Programmer 13 Frear Ave, Troy, NY 12180 \\// beowulf!lunge!jesup@steinmetz.UUCP (518) 272-2942 \/ (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)
oconnor@sungoddess.steinmetz (Dennis M. O'Connor) (03/06/88)
An article by beowulf!lunge!jesup@steinmetz.UUCP says: ] Pads aren't the limiting factor, but PINS (and bandwidth) are. ] The RPM-40 has 144 'pins' (not PGA, but leadless chip carrier), which was the ] biggest certifed package we could find. We would KILL for more pins! ] (Of course, by now there are probably larger certified packages.) ] ] // Randell Jesup Lunge Software Development Sorry, Randell, not quite right. The RPM40 CPU and FPU are packaged in ** 132-pin ** leadless ceramic chip carriers, both for speed and so that they could eventually be put in 132-pin surface-mount packages. The impression of PGAs we had back in design time was that they were slow, huge and NOT compatable with surface mounting. You wouldn't want to put one in your space-born BM/CCC processor, we thought. -- Dennis O'Connor oconnor%sungod@steinmetz.UUCP ARPA: OCONNORDM@ge-crd.arpa (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)
bcase@Apple.COM (Brian Case) (03/07/88)
In article <479@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes: > Also, if this is NOT an embedded system, you have to worry about >bus bandwidth. How many 40 Mhz busses are there? Not many. By using There aren't many 80 MB/sec system buses either. Even without an embedded environment, you'll have to have dedicated memory paths or caches. >16-bit instructions we cut the bandwidth required (at least at the chip >edge, corresponding drops farther down) for instructions from 160 megabytes >per sec to 80 MB/sec. That means that on-board caches will fill twice >as fast (for the same number of instructions). Yes, but my guess is that the number of instructions isn't the same; that is, there are more of your shorter instructions. You're right, there won't be twice as many, but, as I have said before, I believe the solution is not to cut the bandwidth requirements but to provide the required bandwidth. Note that since the RPM40 has a TIB, you could have two-way interleaved your external instruction memory. This would have: loosened the requirement for 16-bit instructions, required twice as many external instruction chips, increased the size of your TIB (but not double since the tag store doesn't increase in size), and required that the I-bus run at 40 MHz instead of 20 MHz. I suspect that twice the external RAMs would be the problem.
jesup@pawl22.pawl.rpi.edu (Randell E. Jesup) (03/07/88)
In article <1757@mips.mips.COM> hansen@mips.COM (Craig Hansen) writes: > Instruction bandwidth is >important, but not so important that you should go back to compacted >instructions. 32-bit instructions aren't much larger than 16-bit >instructions, particularly when a register-allocating compiler is >used, and the benefit to permitting parallel decoding of instructions >with register fetching is a tremendous win. Who said we were going back to compacted instructions to get to 16 bits? What we have is a lot LESS instructions, and a minimal set of formats for instructions. With 32 bit instructions, we could have had less formats (2 or 3 instead of 5 or 6), but since we can do a decode in a single pipe-stage, what does it matter? The decoder does not determine the critical path and cycle time, the ALU does. If the decoder had slowed us up, we would have made it faster and/or reduced the number of formats. (Keeping the number of formats down and alignment of fields did play a role in our architecture design.) I don't see how having a register allocating compiler affects instruction size. Concerning parallel decode with register fetch, is this anything unusual? Our pipeline looks like this: <IF> Instruction fetch - doesn't really exist per se. <ID> Instruction Decode/register fetch <ALU> ALU operation <WB> WriteBack - write Alu result to register file [ I'm ignoring the extra load stages here ] I don't see what we're losing here. >Generally, optimized MIPS code about 10% to 50% larger than "optimized" >VAX code, as generated by 4.3 UNIX, and is often equal or smaller in >size than optimized 68k code, as generated by Sun compilers. [ many figures deleted ] >...those big 32-bit instructions don't look so bad next to >the machines design for compact encodings... 68020? compact? Surely you jest! :-) I know, it actually is fairly compact, at least the 68000 part of it. It just has SO many instructions and addressing modes, it ends up larger than one would suspect. The proper comparison is not to CISCs, but to a 16-bit version of the same general architecture, or at least the same class (RISCs). I agree that if cost is no object, a 32-bit RISC can probably run faster (effective throughput (VIPS), not MIPS) than a 16-bit. However, the costs mentioned include a higher-bandwidth bus, more disk space for code, more memory space for code, larger (expensive) caches, more power draw (more pins being driven), etc, etc. The typical current solution for the bus bandwidth problem is to throw MUCH bigger caches onto the CPU board, to try to increase hit rates, and reduce bandwidth required of the bus. >Craig Hansen >Manager, Architecture Development >MIPS Computer Systems, Inc. Glad to see in in the conversation. I'm interested in hearing your opinions. // Randell Jesup Lunge Software Development // Dedicated Amiga Programmer 13 Frear Ave, Troy, NY 12180 \\// beowulf!lunge!jesup@steinmetz.UUCP (518) 272-2942 \/ (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)
haynes@ucscc.UCSC.EDU (99700000) (03/08/88)
In article <22746@clyde.ATT.COM> gwu@clyde.UUCP (George Wu) writes: > > However, I find that hard to believe. Once a program is correctly >compiled, it will be run many more times than it was compiled, and in the >long run, you are better off optimizing the application code, instead of the >compiler. Well, now, in the educational environment a program is compiled (not correctly, of course) more times than it is run. I guess this could be an excuse for having separate "checkout" and "production" compilers, or for having optimization be optional. haynes@ucscc.ucsc.edu haynes@ucscc.bitnet ..ucbvax!ucscc!haynes
dave@onfcanim.UUCP (Dave Martindale) (03/13/88)
Another place where 16-bit data is important: computer graphics. 1) Although 8 bits/pixel/colour is most common for "full colour" images, an 8-bit linear encoding just isn't good enough for critical work. For relatively dark portions of the image, the quantization error due to 8 bits causes visible band (Mach bands) or other errors. For image storage, you can use better encoding strategies that make better use of 8 bits, but then you have to encode and decode the pixels every time you touch them (for compositing, etc.). But just using 12 or 16 bits with linear coding is faster. Pixar uses 12 bits/colour internally. Also, when you interface to the physical world, the ADCs and DACs are always linear, so you are forced to work with more than 8 bits if you care about quality. Our digital camera and film recorder are both 12 bits per colour per pixel. 2) 16 bits is a good word width for storing screen coordinates in display lists. You always want to store as many of these as will fit in the physical memory available. Any machine that stores a C "short" in 4 bytes is going to make my image data files twice as big, and allow only half the length of display lists in memory. I wouldn't buy it. Any machine that stores a "short" in 2 bytes but requires multiple instructions (shift/mask) to access the data is going to have a difficult time executing the inner loops of some algorithms fast enough to benchmark comparably to another machine of similar cost which does support 16-bit data types. I don't use 16-bit words for many types of data, but I use them in quantities of millions when I do.
bcase@Apple.COM (Brian Case) (03/15/88)
In article <15580@onfcanim.UUCP> dave@onfcanim.UUCP (Dave Martindale) writes: >Any machine that stores a C "short" in 4 bytes is going to make >my image data files twice as big, and allow only half the length of >display lists in memory. I wouldn't buy it. It's the compiler making the decision, but I definitely do understand that the machine plus compiler make a pretty tightly coupled unit. If only one compiler is available, any compiler decisions look like architecture decisions. >Any machine that stores a "short" in 2 bytes but requires multiple >instructions (shift/mask) to access the data is going to have a >difficult time executing the inner loops of some algorithms fast >enough to benchmark comparably to another machine of similar cost >which does support 16-bit data types. Unless the the second machine is slower *because* it supports 16-bit data types directly (or slower for other reasons). It should not be assumed that accessing small data types (16-bit or 8-bit types) with a sequence of instructions is always slower. Sometimes it will be, sometimes it won't. I won't argue that it's faster (unless the basic cycle is faster).
bron@olympus.SGI.COM (Bron C. Nelson) (03/15/88)
This topic has gone on for awhile, but I haven't noticed (or managed to miss) the answer to the question I consider important: i.e. how hard/expensive is it to decode 16 & 32 bit instructions vs 32 bit only? Several respondents have said "its expensive" or "it takes more time to decode the different formats" and even "fetching the registers in parallel with doing the instruction decode is a big win." (None of these are probably exact quotes.) I sorta wonder at this. Is the instruction decode/register fetch a critical path? From what I can gather, the register fetch probably IS. Can more hardware be thrown at the problem to allow multiple formats? If not, how expensive (time wise) is it really to provide? It seems that if done "right" (oh no! not that word!) it would only add 1 gate delay (see below). This is maybe 10%? (What the heck IS the length of the critical path (in gates) of your favorite cpu?). At worst it would add another stage to the pipeline (plus the associated hardware to support an extra stage). How expensive is that? Maybe 10%?? I'm just trying to get a feel for how much cpu performance people think would have to be given up to get the more compact encoding. ----------------------------------------------------------------------- Bron Nelson bron@sgi.com Don't blame my employers for my opinions. p.s. If we only have 2 formats, we can specify which one by using the first bit in the instruction (much like CRISP uses the first bit(s)). This should (?) let us select between the 2 possible register encodings with only a single additional gate delay (and some more silicon devoted to doing it). What we buy is a 25%+ reduction in program code size. This seems like a good trade off to me since most programs I run take longer to load off disk than they do to execute. I admit my experiance may not be typical, and taking a performance hit may not be a smart marketing decision to a cpu house, but it seems like a good system tradeoff.
marc@oahu.cs.ucla.edu (Marc Tremblay) (03/16/88)
In article <12705@sgi.SGI.COM> bron@olympus.SGI.COM (Bron C. Nelson) writes: > >Several respondents have said "its expensive" or "it takes more time >to decode the different formats" and even "fetching the registers in >parallel with doing the instruction decode is a big win." (None of >these are probably exact quotes.) I sorta wonder at this. Is the >instruction decode/register fetch a critical path? From what I >can gather, the register fetch probably IS. The processor that we are designing here at UCLA has a large register file (64 locals + 10 globals) with overlapping windows. Because of fault tolerance aspect of the chip, we have to decode the full address (7 bits) of the register operands everytime an access is made to the register file. The window pointer can not be pre-decoded. This means that our critical path is composed of a fairly long decode and of a quite long register fetch. It turned out that using an efficient decoder (dedicating some extra area), the decode part represents about 12% of the critical path. The fact is that using a simple instruction encoding we are able to overlap the register decoding/fetching with the instruction decoding done by a separate unit. Adding complexity in the register address decoding would affect the critical path directly. >p.s. If we only have 2 formats, we can specify which one by using the >first bit in the instruction (much like CRISP uses the first bit(s)). I assume that this "first" bit is available for encoding... >This should (?) let us select between the 2 possible register >encodings with only a single additional gate delay (and some more >silicon devoted to doing it). Using a tighter encoding, the time spent decoding the rest of the instruction (opcode) may turn out to shift the critical path to the control unit instead of the register file. Marc Tremblay marc@CS.UCLA.EDU ...!(ihnp4,ucbvax)!ucla-cs!marc Computer Science Department, UCLA
henry@utzoo.uucp (Henry Spencer) (03/23/88)
> how hard/expensive is it to decode 16 & 32 bit instructions vs 32 bit only?
Of course, if you cache decoded instructions, as some machines do, the
decoding overhead becomes considerably less important.
bcase@Apple.COM (Brian Case) (03/24/88)
In article <1988Mar22.202304.3077@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >> how hard/expensive is it to decode 16 & 32 bit instructions vs 32 bit only? > >Of course, if you cache decoded instructions, as some machines do, the >decoding overhead becomes considerably less important. But miss penalties increase. This is not trivial.