schow@bnr-public.uucp (Stanley Chow) (03/19/89)
In a thread discussing what to do with all that transister sites becoming available on big chips, various suggestion of what is good and not good has been brought on. It is not often that I disagree with the likes of Tim Olson and Henry Spencer, so I make a lot of splash when it happens. (Its okey, I have my asbestas-suit). In article <24889@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes: >In article <1989Mar16.190043.23227@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >| In article <37196@bbn.COM> slackey@BBN.COM (Stan Lackey) writes: >| >I predict that the next hardware features to come back will be >| >auto-increment addressing and the hardware handling of unaligned data. >| >| Again, why? Auto-increment addressing is useful only if instructions >| are expensive, because it sneaks two instructions into one. However, >| the trend today is just the opposite: the CPUs are outrunning the >| main memory. Since instructions can be cached fairly effectively, >| they are getting cheaper and data is getting more expensive. Doing >| the increment by hand often costs you almost nothing, because it can >| be hidden in the delay slot(s) of the memory access. Autoincrement >| showed up best in tight loops, exactly where effective caching can be >| expected to largely eliminate memory accesses for instructions. Why >| bother with autoincrement? > >Also, auto-incrementing addressing modes imply: > > - Another adder (to increment the address register in parallel) > > - Another writeback port to the register file > >Unless you wish to sequence the instruction over multiple cycles :-( > >I'm certain that most people can find something better to do with these >resources than auto-increment. For many people, auto-increment *is* something better! The discussion is that with increasing density, the tendency is to add complexity to the chips. There can be debates on the trade-off of different additions, but I would doubt that high performances chips of the future (near future) will worry about another adder. Adding more ports to register files is a challenge for the silicon groups, but, hay, they got to earn their money too! Caches (especially Data-cache) top out very easily. For many Unix-box type application, we have already reached the point of vastly-diminished return. Adding more cache won't bring your performance up. (Or I could talk about our application where the diminishing return starts *before* you start adding cache.) It is precisely because the CPU is running faster than memory (even cached memory) that one have to maximize the amount of work done in each memory cycle. Adding address modes does not mean a difficult machine to pipeline or to design. Don't assume that all architecture with many addressig modes will be as messy as the VAX instruction encoding. It is quite possible to have good, clean instruction encoding that has lots of modes - it just requires lots of gates to be *really* fast. Fortunately, lots of gates is exactly where we are headed. With enough gates, the CPU will get more functional units, this means small RISC instructions will not be able to keep all the functional units busy. The i860 solves this by essentially have a short VLIW mode (hmmm, Short Very Long?). It is also possible to have bigger instructions that keep more units busy longer. Please note that different implementations of the architecture can have a super-fast version that does all instructions in single clock (at least in dispatching of instruction) and also have a cheap version that is (here it comes:) *micro-coded* or whatever. Just because DEC could/would not do it for the VAX, don't conclude that the concept is bad. > >| As for hardware handling of unaligned data, this is purely a concession >| to slovenly programmers. Those of us who have lived with alignment >| restrictions all our professional lives somehow don't find them a problem. >| Mips has done this right: the *compilers* will emit code for unaligned >| accesses if you ask them to, which takes care of the bad programs, while >| the *machine* requires alignment. High performance has always required >| alignment, even on machines whose hardware hid the alignment rules. >| Again, why bother doing it in hardware? > >The R2000/R3000 can also trap unaligned accesses and fix them up in a >trap handler. This is what the Am29000 does, as well. This is mainly a >backwards compatibility problem (FORTRAN equivalences, etc.) It is >infrequent in newer code, mainly appearing in things like packed data >structures in communication protocols. > It would be more accurate to call this "a concession to past constraints". Remember, many of these old programs were writen in the days when memory was not cheap and performance was expensive. It is not fair to call the people "slovenly" just because you now have bigger and faster machines. (If you were talking about some programmers who didn't know what they were doing, them I agree with you.) Even now, there are real money issues in memory alignments. If you have a system with a 100 MegaBytes main memory and "correct" alignment makes it 150 MB, you have just made the system 50% more expensive. Or how about alignment bumps your memory requirement from 63K to 65K causing extra chips and possible board layout problems (not to mention the cost)? Having the H/W be tolerant of alignment means a lot of flexibility in the design trade-off. Also, with more and more gates on a chip, it is conceivable that someone will put together a cache that can handle misalignment in the cache, as long as the whole data item is in the same line. I.e., data can cross word boundry with little or no penalty, but crosing line boundary will be slow or disallowed. With the trend to wider buses (i.e., wider line size), this may well make the performance penalty of mislaignment neglectable. Stanley Chow ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public (613) 763-2831 Please don't tell Bell Northern Reaearch about these silly ideas, I have them convinced that I know everything about processor architecture. They are even paying me to work on it. [If I don't want to tell them; do you think I could represent them?] pay me to work
baum@Apple.COM (Allen J. Baum) (03/21/89)
[] In article <24889@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes: >In article <1989Mar16.190043.23227@utzoo.uucp> (Henry Spencer) writes: >| In article <37196@bbn.COM> slackey@BBN.COM (Stan Lackey) writes: >| >I predict that the next hardware features to come back will be >| >auto-increment addressing and the hardware handling of unaligned data. >| >| Again, why? Auto-increment addressing is useful only if instructions >| are expensive, because it sneaks two instructions into one. However, >| the trend today is just the opposite: the CPUs are outrunning the >| main memory. Since instructions can be cached fairly effectively, >| they are getting cheaper and data is getting more expensive. Doing >| the increment by hand often costs you almost nothing, because it can >| be hidden in the delay slot(s) of the memory access. Autoincrement >| showed up best in tight loops, exactly where effective caching can be >| expected to largely eliminate memory accesses for instructions. Why >| bother with autoincrement? > >Also, auto-incrementing addressing modes imply: > > - Another adder (to increment the address register in parallel) > > - Another writeback port to the register file > >Unless you wish to sequence the instruction over multiple cycles :-( > >I'm certain that most people can find something better to do with these >resources than auto-increment. Well, I'll have to slightly disagree here. Auto-increment does not cost another adder (for my particular definition of auto-increment); it just writes the result of the effective address calculation back to the base register. If you want to be tricky, you can use a multiplexor to select the memory address to be the base register itself, or the effective address calculation, giving you pre- or post- auto-increment. It does cost an extra writeback port. This can be finessed, perhaps, by waiting for a cycle not using the writeback port, but you can't count on it. Now, the question is, can loops profitably use this kind of addressing mode? Or, should you just schedule the address updates in branch and load shadows because you can't find anything else to put there? Note that if you have a superscalar architecture, and can do two inst. in parallel (see the Intel 80960CA paper in newest Compcon proceedings), you can do this kind of thing as a matter of course; but its a lot more expensive to do it that way- you do need a separate read port and adder, as well as a write port. If, in fact, compilers can generate this code (and I believe they can), and it can be scheduled (i.e. there aren't lots of dead cycles hanging around just waiting to be filled with these address update instructions), then it looks like a reasonable tradeoff. It's probably time to dust off those benchmarks and see how often it occurs, and how many cycles it will save. Since this kind of operation is used almost exclusively inside a loop, it has quite a bit of leverage. Yes, instruction caching is most effective there, but that just means it won't cost you additional cycles, above and beyond the separate update instruction, not that it won't save you any cycles. Besides, who says you can't find soething else to do with the extra write port when you're not doing address updates? -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
bcase@cup.portal.com (Brian bcase Case) (03/21/89)
>..., but I would doubt that high performances chips of >the future (near future) will worry about another adder. Adding more >ports to register files is a challenge for the silicon groups, but, hay, >they got to earn their money too! Right, another adder is not a problem. Right, more ports on register files is not too much of a problem, but is adding a port just so that auto-increment goes fast the right thing? I say no. Keep reading. >Caches (especially Data-cache) top out very easily. For many Unix-box >type application, we have already reached the point of vastly-diminished >return. Adding more cache won't bring your performance up. (Or I could >talk about our application where the diminishing return starts *before* >you start adding cache.) Maybe so, but I think that we have a little farther to go than 4K instruction and 8K data, virtual no less. Even if we have 2 million transistors, I don't think the caches are going to be "too big." >It is precisely because the CPU is running faster than memory (even >cached memory) that one have to maximize the amount of work done in >each memory cycle. "Running faster than memory" is a very misleading statement. Mabye latencies are a problem, but bandwidth can be had in abundance. The problem is what to do with it, not how to get it! >Adding address modes does not mean a difficult machine to pipeline or >to design. Don't assume that all architecture with many addressig modes >will be as messy as the VAX instruction encoding. It's true that not all architectures with many addressing modes will be as messy as the VAX. However, you simply have to answer the question: are those addressing modes, beyond register+register and register+offset, really buying you anything? (BTW, the lack of even those doesn't seem to cripple the 29K too much.... If you dislike the 29K because of its lack of addressing modes, blame me. :-) >With enough gates, the CPU will get more functional units, this means >small RISC instructions will not be able to keep all the functional >units busy. The i860 solves this by essentially have a short VLIW mode I thought RISC instructions were too big! :-) :-) Note that the i860's dual-instruction mode is essentially a VLIW mode. >(hmmm, Short Very Long?). It is also possible to have bigger instructions >that keep more units busy longer. Please note that different implementations >of the architecture can have a super-fast version that does all instructions >in single clock (at least in dispatching of instruction) and also have a >cheap version that is (here it comes:) *micro-coded* or whatever. I claim a much better use of multiple functional units is to execute many "small" RISC instructions at the same time, i.e., "super scalar" or multiple instructions per clock. It just doesn't make sense to bundle, bind is a better word, many operations into one instruction. Doing so simply thwarts compiler optimization. Adresssing modes are probably the worst form of semantic binding, in my opinion. So, if we are going to have "too many transistors," we should use them to realize a superscalar *implementation*, not a complex *architecture.* >Even now, there are real money issues in memory alignments. If you have >a system with a 100 MegaBytes main memory and "correct" alignment makes >it 150 MB, you have just made the system 50% more expensive. Or how about >alignment bumps your memory requirement from 63K to 65K causing extra chips >and possible board layout problems (not to mention the cost)? If you can't afford to go to 150 Mbytes (or more likely, paging) or you can't afford to go to 65K of RAM (try getting such a small amount, I challenge you), then performance must not be the most important thing. By all means, then, you should allow un-naturally aligned data and you can handle it in hardware or software, as you wish. If performance is your first priority, which it pretty much is in everything but the cheapest embedded systems (which is also accounts for the highest volume!), then you *don't* want to allow un-natural alignment. >Having the H/W be tolerant of alignment means a lot of flexibility in the >design trade-off. ????? >Also, with more and more gates on a chip, it is conceivable that someone >will put together a cache that can handle misalignment in the cache, as >long as the whole data item is in the same line. I.e., data can cross The problem is not implenting hardware handling of misalignment; the problem is the performance implication. A mis-aligned load/store takes two accesses; a good compiler or programmer will know this and align the access whether the hardware can handle it or not. So what's the point of having the hardware? If data must be packed as tightly into memory as possible, then fine, but you must know that you are giving up performance. At this point, performance no longer is the first priority, so handling it in software is probably acceptable (with simple primitives like those of the MIPS processor, e.g.). >word boundry with little or no penalty, but crosing line boundary will >be slow or disallowed. With the trend to wider >buses (i.e., wider line size), this may well make the performance penalty >of mislaignment neglectable. I'm not sure how feasible it is to force the compiler/programmer to know whether or not data is going to cross a cache line. Things like dynamically-allocated data structures might be a problem; this would take some thought. But, without much thought needed, I do know that the hardware needed to permit misaligned access within a cache line is likely to make cache access slower. Since cache access is probably the limiting stage in integer pipelines (maybe not in FP, but maybe), this is not a good idea.
stevew@wyse.wyse.com (Steve Wilson xttemp dept303) (03/21/89)
In article <355@bnr-fos.UUCP> schow@bnr-public.UUCP (Stanley Chow) writes: > ...lotsa stuff deleted. > >Caches (especially Data-cache) top out very easily. For many Unix-box >type application, we have already reached the point of vastly-diminished >return. Adding more cache won't bring your performance up. (Or I could >talk about our application where the diminishing return starts *before* >you start adding cache.) Before you start giving away the on-chip caches, lets get them to a respectable size first. The average on-chip cache we are seeing now is 512 to 1K bytes for both D and I. Hit rates of 80% for the I cache and 40-50% for the D cache are prevalent. We still have a way to go in this arena from what I can see. (Yeah, I know the 88200 has 16K, but it isn't the same chip as the 88100...;-) Caches are a proven method of keeping the off-chip accesses to a minimum and we still haven't been able to put really large caches on the same chip as the CPU. Let's really saturate this path first... Steve Wilson These are my opinions, not those of my employer.
w-colinp@microsoft.UUCP (Colin Plumb) (03/21/89)
schow@bnr-public.UUCP (Stanley Chow) wrote: > [A lot of things about addressing modes I agree with] > Even now, there are real money issues in memory alignments. If you have > a system with a 100 MegaBytes main memory and "correct" alignment makes > it 150 MB, you have just made the system 50% more expensive. Or how about > alignment bumps your memory requirement from 63K to 65K causing extra chips > and possible board layout problems (not to mention the cost)? > > Having the H/W be tolerant of alignment means a lot of flexibility in the > design trade-off. And a lot of headaches in the cache miss and page fault recovery departments. In any structure, if you rearrange the components, you can lose at most n-1 bytes to padding, where n is the strictest alignment restriction. For most processors, the worst case is a double and a char, 7 bytes out of 16 wasted. But if this is a major concern, rewrite the code to use two parallel arrays. You'll waste at most 7 bytes total (in your 100Meg). In C, this is a bit of a bother, but not too bad. I think requiring alignment is one thing that'll never go out of style. On any chip, you want to do it because it's more efficient, anyway. The only need for unaligned accesses is to handle old data formats, which presumably need old programs run on them, which will (except in pathological cases) run faster on the new machine anyway. -- -Colin (uunet!microsoft!w-colinp) "Don't listen to me. I never do." - The Doctor
henry@utzoo.uucp (Henry Spencer) (03/22/89)
In article <355@bnr-fos.UUCP> schow@bnr-public.UUCP (Stanley Chow) writes: >With enough gates, the CPU will get more functional units, this means >small RISC instructions will not be able to keep all the functional >units busy.... Right, we will find things like VLIW getting more popular, assuming that the compiler technology is up to it. (It's not clear that Intel's is.) However, we will *not* find dedicated adders being thrown in just for address arithmetic -- we will find extra ALUs that *can* be used for that but can also be used for other things. We will not find autoincrement addressing modes coming back, but we may find VLIWish machines that can do the memory access and address-register increment in the same cycle, as two separate operations independently controlled by the program. Autoincrement addressing modes simply aren't a worthwhile investment. >Having the H/W be tolerant of alignment means a lot of flexibility in the >design trade-off. It's just as easy to have the software cope with it (either directly or via the compiler generating special code) in the rare cases where it is needed. This lets occasional needy software use it, *without* investing any hardware complexity in it. >Also, with more and more gates on a chip, it is conceivable that someone >will put together a cache that can handle misalignment in the cache... Sure. But who would *bother*? It's just not worth it. -- Welcome to Mars! Your | Henry Spencer at U of Toronto Zoology passport and visa, comrade? | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
bcase@cup.portal.com (Brian bcase Case) (03/22/89)
>Well, I'll have to slightly disagree here. Auto-increment does not cost another >adder (for my particular definition of auto-increment); it just writes the >result of the effective address calculation back to the base register. If you Yes, this is the i860's way of autoincrement for the floating-point memory reference instructions. > Note that if you have a superscalar architecture, and can do two inst. >in parallel (see the Intel 80960CA paper in newest Compcon proceedings), you >can do this kind of thing as a matter of course; but its a lot more expensive >to do it that way- you do need a separate read port and adder, as well as a >write port. Exactly my point about superscalar. But note that for the expense of the added data path (I assume it is essentially a duplicate of the primary integer (and/or) floating-point pipe), you can now execute *any two* instructions that don't have deliterious dependencies. Sure, adding only the hardware needed for auto-increment is cheaper, but do you really want that garbage in your architecture forever? When you do go to a super- scalar implementation (and you will, whomever you are, just to keep up with the joneses), you now have two data paths that have the added complexity of auto-increment. Super-scalar is a good argument for simple architectures. >It's probably time to dust off those >benchmarks and see how often it occurs, and how many cycles it will save. Well, I'm all for simulation and experimentation. If it is better and the cost now and in the future is not prohibitive, then great. But it ceratainly isn't clear that auto-increment is the right thing! My position is that it is reasonably clear that one should be skeptical. > Since this kind of operation is used almost exclusively inside a loop, >it has quite a bit of leverage. Yes, this is true. This is why one would like to look at the idea seriously. > Besides, who says you can't find soething else to >do with the extra write port when you're not doing address updates? I'm surprised to hear you say that! I think a more realistic outlook is to say that "Besides, who says you *can* find something else to do with the extra write port." Conjecturing, instead of proving, that an added feature (with a significant cost) can be used for something else does not constitute the rigorous persuit of good computer architecture. Shame! :-) :-) :-) :-) :-)
stevew@wyse.wyse.com (Steve Wilson xttemp dept303) (03/22/89)
In article <16058@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes: >I claim a much better use of multiple functional units is to execute many >"small" RISC instructions at the same time, i.e., "super scalar" or >multiple instructions per clock. It just doesn't make sense to bundle, >bind is a better word, many operations into one instruction. Doing so >simply thwarts compiler optimization. Adresssing modes are probably the >worst form of semantic binding, in my opinion. So, if we are going to >have "too many transistors," we should use them to realize a superscalar >*implementation*, not a complex *architecture.* If anything, VLIW is more RISCy than RISC in the sense that it exposes all of the functional units' pipelines to the compiler. You just can't say VLIW in the same sentence with "simply thwarts compiler optimization." Your denying the basis of why to go to VLIW in the first place. The compiler has the option to "optimize" along a straight line code just like you suggest, or, in the inner most loop of an application you can have multiple instances of the same loop running simulataneously. All due to the compiler! If anything, you've got more opportunity for compiler optimizations, not less. Steve Wilson The above opinions are mine, not those of my employer.
schow@bnr-public.uucp (Stanley Chow) (03/22/89)
In article <2160@wyse.wyse.com> stevew@wyse.UUCP (Steve Wilson xttemp dept303) writes: >In article <355@bnr-fos.UUCP> schow@bnr-public.UUCP (Stanley Chow) writes: >> ...lotsa stuff deleted. >> >>Caches (especially Data-cache) top out very easily. For many Unix-box >>type application, we have already reached the point of vastly-diminished >>return. Adding more cache won't bring your performance up. (Or I could >>talk about our application where the diminishing return starts *before* >>you start adding cache.) > >Before you start giving away the on-chip caches, lets get them to a respectable >size first. The average on-chip cache we are seeing now is 512 to 1K bytes >for both D and I. Hit rates of 80% for the I cache and 40-50% for the D >cache are prevalent. We still have a way to go in this arena from what >I can see. (Yeah, I know the 88200 has 16K, but it isn't the same chip >as the 88100...;-) > >Caches are a proven method of keeping the off-chip accesses to a minimum >and we still haven't been able to put really large caches on the same >chip as the CPU. Let's really saturate this path first... > >Steve Wilson > >These are my opinions, not those of my employer. Different people have different ideas of what a "respectable" cache is. Most Unix-type programs/filters/... are pretty happy with small caches (by small, I mean <64K). This is of course how RISC (and CISC) workstations get their performance. Some one mentioned (in an article about vector processing and the i860, I think) that the 32 Meg Cache in an ETA machine is not enough for some applications. Realistically, I do not foresee on-chip cache being expanded to a megabyte anytime soon. Going from 512 to 4K to 16K to 64K will buy performance for some (even many) applications but other applications will still be killed by the miss-rate. I am suggesting that instead of spending the silicon real-estate on bigger caches, other options may be open. In particular, I am suggesting that more complex instructions can be a useful way to decrease band-width demand on the cache/memory system. John Mashy talks about the almost constant band-width needed for a MIP. Caches are a way to up the perceived band-width of the memory. I am suggesting that silicon can be used to decrease the demand. Stanley Chow ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public (613) 763-2831 Cache? Don't carry it, I use platic money. [This way, my employer knows I have to go to conferences and courses to learn about this stuff. Unfortunately, this means they don't let me represent them either.]
mash@mips.COM (John Mashey) (03/22/89)
In article <1989Mar21.194914.3284@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: ... >Autoincrement addressing modes simply aren't a worthwhile investment. Really, this isn't an absolutely obvious conclusion. There are certainly intruction encodings, pipelines, register file designs where the incremental cost of this might be almost zero, in which case one would certainly put it in. Of course, there might be later implementations where you'd be sorry. It's just like anything else: you have to simulate it and see if the cost is worth it. Depending on what you already have, it might or might not be worth it. At least HP thought it was OK, I think (HP PA). ... >>Having the H/W be tolerant of alignment means a lot of flexibility in the >>design trade-off. ... >It's just as easy to have the software cope with it (either directly or >via the compiler generating special code) in the rare cases where it is >needed. This lets occasional needy software use it, *without* investing >any hardware complexity in it. The alignment issue is MUCH nastier and more impactful on a design than auto-increment because it's much more likely to complexify critical paths, the pipeline, and the cache design. (Yes, we certainly prefer to provide a few primitives and let the software do it!) -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
baum@Apple.COM (Allen J. Baum) (03/23/89)
[] >In article <16080@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes: (after quoting me about auto-increment, its cost, etc. ....) >Exactly my point about superscalar. But note that for the expense of the >added data path (I assume it is essentially a duplicate of the primary >integer (and/or) floating-point pipe), you can now execute *any two* >instructions that don't have deliterious dependencies. Sure, adding only >the hardware needed for auto-increment is cheaper, but do you really want >that garbage in your architecture forever? When you do go to a super- >scalar implementation (and you will, whomever you are, just to keep up >with the joneses), you now have two data paths that have the added >complexity of auto-increment. Super-scalar is a good argument for simple >architectures. If auto-increment is frequent enough, then it can be done in addition to executing *any two* operations at once. The leverage really hits you- a 10 inst. loop, including a couple of atuo-incs. shrinks to 5 if you can average two instructions/cycle. At very little cost in hardware (I assert this as a hardware design type), maybe this shrinks to 4 insts., a 20% saving. Try to get 20% some other way- it's real tough! Your mileage may vary, of course. >>It's probably time to dust off those >>benchmarks and see how often it occurs, and how many cycles it will save. > >Well, I'm all for simulation and experimentation. If it is better and the >cost now and in the future is not prohibitive, then great. But it ceratainly >isn't clear that auto-increment is the right thing! My position is that it >is reasonably clear that one should be skeptical. Um, that was my point also, although perhaps I lean towards less skeptical than you. >> Since this kind of operation is used almost exclusively inside a loop, >>it has quite a bit of leverage. > >Yes, this is true. This is why one would like to look at the idea seriously. > >> Besides, who says you can't find something else to >>do with the extra write port when you're not doing address updates? > >I'm surprised to hear you say that! I think a more realistic outlook is >to say that "Besides, who says you *can* find something else to do with >the extra write port." Conjecturing, instead of proving, that an added >feature (with a significant cost) can be used for something else does not >constitute the rigorous persuit of good computer architecture. Shame! :-) >:-) :-) :-) :-) I did have something in mind for that hardware. I dispute the signifcant cost issue- it is roughly equivalent to register scoreboarding logic, and if you have that, the additional cost is small (again, I assert this in my capacity as a hardware design type that has gone through the exercise). I didn't conjecture that it might be used for something else, I know it can, and I know the kind of speedup it will give me, as well as the extra cost to use it for that something else. This is an exercise for the reader- Part A: what can an extra write port to a register file be used for (and what other hardware is required to make it useful)? Part B: Now, suppose this extra write port can be a read/write port? -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
bcase@cup.portal.com (Brian bcase Case) (03/23/89)
>>It just doesn't make sense to bundle, >>bind is a better word, many operations into one instruction. Doing so >>simply thwarts compiler optimization. Adresssing modes are probably the >>worst form of semantic binding, in my opinion. > >If anything, VLIW is more RISCy than RISC in the sense that it exposes >all of the functional units' pipelines to the compiler. You just can't >say VLIW in the same sentence with "simply thwarts compiler optimization." I didn't!!!! I'm sorry if there was some confusion. My reply, from which the above quote is taken, was to a plea for auto-increment. I would certainly never say that VLIW thwarts compiler optimization. I'm sorry if you misunderstood, but I was not knocking "semantic bundling" in the context of VLIW, but in the context of complex addressing modes and the like. VLIWs are sorta like super scalars, and for that similarity, I like them.
dps@halley.UUCP (David Sonnier) (03/24/89)
In article <21931@agate.BERKELEY.EDU> matloff@iris.ucdavis.edu (Norm Matloff) writes: > >The cost is certainly far from 0 if an incrementation has to be "undone" >when a conditional branch is taken. > > Norm This is actually the critical point. It is very difficult to make auto-increment idempotent. However, the branch case is not the critical case (the compiler could handle that). The critical case is for exception processing. In PDP-11 unix(TM), a great deal of software effort is required to un-increment registers in case of exceptions. The VAX adds a great deal of hardware to solve the same problem. The MIPS Rn000 on the other hand, simply defines all instructions to be idempotent. Which means that if you get an exception, you can simply restart the instruction. With interrupt processing time being such a critical path in the system, I think not having auto-increment is a definite win (IMHO). I think that the most important thing RISC has done for us is to move system design from "gut feeling" and "neat features" towards real engineering. Measure the cost of each feature and quantify the trade-offs. -- David Sonnier @ Tandem Computers, Inc. 14231 Tandem Blvd. Austin, Texas 78728 ...!{rutgers,ames,ut-sally}!cs.utexas.edu!halley!dps (512) 244-8394
baum@Apple.COM (Allen J. Baum) (03/24/89)
[] >In article <21931@agate.BERKELEY.EDU> matloff@iris.ucdavis.edu (Norm Matloff) writes: >>In article xxx henry@utzoo.uucp (Henry Spencer) writes: >>>Autoincrement addressing modes simply aren't a worthwhile investment. > >>Really, this isn't an absolutely obvious conclusion. There are certainly >>intruction encodings, pipelines, register file designs where the incremental > ^^^^^^^^^ ^^^^^^^^^^^ >>cost of this might be almost zero,in which case one would certainly put it in > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > >The cost is certainly far from 0 if an incrementation has to be "undone" >when a conditional branch is taken. > > Norm Why would an autoincrement be undone on a branch? Do you mean on a trap? Besides, who says that the hardware cost for undoing the autoincrement is significant? It could be that the register wasn't actually updated until all possible traps have been evaluated, in which case the the cost IS zero. His statement stands as it is written: there ARE pipelines where the incremental cost IS zero (oh yes, not counting that pesky extra register file port. That's when you have to decide if it is worth it). It seems to me that optimizing compilers can figure out when to use autoincrement, so by that (RISC) criteria, it shouldn't be counted out. Only the statistics should count it out, and I ain't seen none ( in either direction, to be sure). -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
bcase@cup.portal.com (Brian bcase Case) (03/24/89)
>If auto-increment is frequent enough, then it can be done in addition to >executing *any two* operations at once. The leverage really hits you- a >10 inst. loop, including a couple of atuo-incs. shrinks to 5 if you can >average two instructions/cycle. At very little cost in hardware (I assert >this as a hardware design type), maybe this shrinks to 4 insts., a 20% >saving. Try to get 20% some other way- it's real tough! Your mileage may >vary, of course. Allen, you got me. (This is quite fair since I said the same thing, "try getting 20% some other way," about something I felt strongly about in an internal report when Allen and I worked for the same company!) You are quite right, if it really does make 20% difference. However, "your mileage may vary" is the right caveat: can that loop really be executed at 2 inst. per cycle if some of the parallelism is taken away by adding the autoinc? I don't know the answer, I am just in violent agreement with you (and John Mashey): you must simulate and measure and think, and then be able to predict the future :-). >I did have something in mind for that hardware. I dispute the signifcant cost >issue- it is roughly equivalent to register scoreboarding logic, and if you >have that, the additional cost is small (again, I assert this in my capacity >as a hardware design type that has gone through the exercise). I didn't >conjecture that it might be used for something else, I know it can, and I >know the kind of speedup it will give me, as well as the extra cost to use >it for that something else. This is an exercise for the reader- Part A: what >can an extra write port to a register file be used for (and what other hardware >is required to make it useful)? Part B: Now, suppose this extra write port can >be a read/write port? Oh, if you already have the answer, what it will also speed up, then I stand corrected. Maybe I am thinking about a different set of implementation trade-offs. What is the answer to your exercise? Does if have anything to do with loads/stores? I don't mean to say that autoincrement is *absolutely* wrong, I don't know all the possible implications for every architectural-cross-implementation approach. But without proof that it is good, I tend to be skeptical. (I guess you can tell! ;-) :-). Will/can you say how it fits in and why it is very good to have?
news@bnr-fos.UUCP (news) (03/24/89)
In article <13@microsoft.UUCP> w-colinp@microsoft.uucp (Colin Plumb) writes: >schow@bnr-public.UUCP (Stanley Chow) wrote: > >In any structure, if you rearrange the components, you can lose at most >n-1 bytes to padding, where n is the strictest alignment restriction. For >most processors, the worst case is a double and a char, 7 bytes out of 16 >wasted. But if this is a major concern, rewrite the code to use two parallel >arrays. You'll waste at most 7 bytes total (in your 100Meg). > I only wish this were true. Many many applications have "natural" data structures that are inconvenient to align. Using parallel arrays means more obscure code and more indexing time. These is of course other problems in multi-processing (like keeping different bits of a word from being written by different processors or processes). Typically, the worst problems come from many copies of a small data structure that is not a nice multiple of the word size. A million copies of a 33 bit structure wastes 4 Mega bytes. It is not possible to pack them together. For example, gate level simulation typically has an array of gate description with connectivity and state information. The natural (or logically clear) ordering of the fields is probably not the most compact ordering. >In C, this is a bit of a bother, but not too bad. I think requiring alignment >is one thing that'll never go out of style. On any chip, you want to do it >because it's more efficient, anyway. The only need for unaligned accesses is >to handle old data formats, which presumably need old programs run on them, >which will (except in pathological cases) run faster on the new machine >anyway. Ah, but this is precisely the point. Many old programs *need* misaligned accesses. If you don't allow that, the old programs will not run at all! Incidentally, the historical trend is to be progressively more tolerant of misalignment, e.g. IBM /360 /370, Motorola 68K families. All the "tolerant" machines always attach a *penalty* to misalignment. It is only the very recent crop of so-called RISC chips that is requiring alignment again. (Please note that I said historical *trend*, not *all* CPU families.) ------------------ In article <16058@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes: ('>>' is Case quoting from my article) >>Caches (especially Data-cache) top out very easily. For many Unix-box >>type application, we have already reached the point of vastly-diminished >>return. Adding more cache won't bring your performance up. (Or I could >>talk about our application where the diminishing return starts *before* >>you start adding cache.) > >Maybe so, but I think that we have a little farther to go than 4K >instruction and 8K data, virtual no less. Even if we have 2 million >transistors, I don't think the caches are going to be "too big." > The point is the for small applictions, the existing workstations are already getting hit-rates in the high 90's. Some big application will thrash any cache. No matter how big. Some applications will only be happy with megabyte caches. I am not saying the caches will be too big, I am saying that there are different kinds of applications: - the small ones that are already running very fast with a 64K cache - the medium ones that will run faster with a bigger (say 256K) cache - the large ones that will not run any faster until you get to 128M By putting more cache on chip (and this obviously depends on the board-level cache and memory system), the small ones will not speed up, the large ones will not speed up. Only a portion of the medium applications will run faster by varying amounts. Depending on which applications you care about, the bigger cache may or may not be worth it to you. >>It is precisely because the CPU is running faster than memory (even >>cached memory) that one have to maximize the amount of work done in >>each memory cycle. > >"Running faster than memory" is a very misleading statement. Mabye >latencies are a problem, but bandwidth can be had in abundance. The >problem is what to do with it, not how to get it! By "the CPU running faster than memory", I mean that within the current semiconductor and PCB processes, we can build: - very fast ALU functions. - very fast on-chip register files. - not so fast on-chip cache (I & D). - slow off-chip (but on-board) memory. - very slow off-board memory. As a result, the problem is getting instructions into the pipeline, not the execution of it. Latency is a very bad problem, bandwidth is merely a bad problem. I like to know why you think "bandwidth can be had in abundance". > >>Adding address modes does not mean a difficult machine to pipeline or >>to design. Don't assume that all architecture with many addressig modes >>will be as messy as the VAX instruction encoding. > >It's true that not all architectures with many addressing modes will be as >messy as the VAX. However, you simply have to answer the question: are >those addressing modes, beyond register+register and register+offset, >really buying you anything? That is of course the $64K question. As other people have pointed out, the question is worth as least serious simulation. I suspect everyone will come up with different answers anyway, but is seems to me premature to dismiss it. > >>With enough gates, the CPU will get more functional units, this means >>small RISC instructions will not be able to keep all the functional >>units busy. The i860 solves this by essentially have a short VLIW mode > >I thought RISC instructions were too big! :-) :-) Note that the i860's >dual-instruction mode is essentially a VLIW mode. > Do I hear an echo here? :-) > >I claim a much better use of multiple functional units is to execute many >"small" RISC instructions at the same time, i.e., "super scalar" or >multiple instructions per clock. It just doesn't make sense to bundle, >bind is a better word, many operations into one instruction. Doing so >simply thwarts compiler optimization. Adresssing modes are probably the >worst form of semantic binding, in my opinion. So, if we are going to >have "too many transistors," we should use them to realize a superscalar >*implementation*, not a complex *architecture.* Since you believe that is lots of bandwidth to burn, your conclusions above are quite logical. Since I believe there is insufficient bandwidth now, I disagree. > >>Even now, there are real money issues in memory alignments. If you have >>a system with a 100 MegaBytes main memory and "correct" alignment makes >>it 150 MB, you have just made the system 50% more expensive. Or how about >>alignment bumps your memory requirement from 63K to 65K causing extra chips >>and possible board layout problems (not to mention the cost)? > >If you can't afford to go to 150 Mbytes (or more likely, paging) or you >can't afford to go to 65K of RAM (try getting such a small amount, I >challenge you), then performance must not be the most important thing. >By all means, then, you should allow un-naturally aligned data and you >can handle it in hardware or software, as you wish. If performance is >your first priority, which it pretty much is in everything but the >cheapest embedded systems (which is also accounts for the highest volume!), >then you *don't* want to allow un-natural alignment. > Am I the only one who worries about production cost and preformance/cost ratios? I thought only the US government gets to go for maximum performance at any cost. If I can put out a product at maximum performance with 150 Mbytes or another product at 95% performance with 100 MBbytes (thereby costing only 75%), how do you think the decision with go? Peformance may or may not be "the most important" criterion, I have never work in a project where performance is the *only* criterion. From: schow@bnr-public.uucp (Stanley Chow) Path: bnr-public!schow Stanley Chow ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public (613) 763-2831 I do not represent anyone except myself. Even then, I don't often let me represent myself.
aglew@mcdurb.Urbana.Gould.COM (03/25/89)
>With interrupt processing time being such a critical path in the system, >I think not having auto-increment is a definite win (IMHO). >-- >David Sonnier@Tandem Computers, Inc. 14231 Tandem Blvd. Austin, Texas 78728 >...!{rutgers,ames,ut-sally}!cs.utexas.edu!halley!dps (512) 244-8394 What data can people provide that says that interrupts are a critical path in the system? That didn't come out quite right - how do you distinguish between [A] the work that has to be done *some* time, that you can choose to be done at interrupt time or elsewhere, and [B] the work that is inherent in processing interrupts, and can be minimized by making the interrupt dispatch/return more efficient. Several comments to the effect that "the work the hardware does in interrupt dispatch is only a miniscule fraction of our interrupt bottleneck" make me think that [A] is the bottleneck. If so, then how really important would be a bit of work in the interrupt handler, to decode and back up autoincremented registers? Ie. how really important is a [B] << [A]? (Especially if you have an instruction set that is easy to decode, unlike the VAXes', so the autoincrement would be easy to back up (BTW, on Gould machines we had to do a little bit of instruction decoding to restore state on an interrupt (usually just +/- 2 or 4 on the PC) -- it was fast and easy)). Andy "Krazy" Glew aglew@urbana.mcd.mot.com uunet!uiucdcs!mcdurb!aglew Motorola Microcomputer Division, Champaign-Urbana Design Center 1101 E. University, Urbana, Illinois 61801, USA. My opinions are my own, and are not the opinions of my employer, or any other organisation. I indicate my company only so that the reader may account for any possible bias I may have towards our products.
alan@rnms1.paradyne.com (0000-Alan Lovejoy(0000)) (03/27/89)
An idea: Suppose every machine instruction for some hypothetical machine were of the form <load/store op><non-load/store op> Ra, Rb, Rc, Rx, Ry, Rz, where Ra is the source/destination of a load/store, Rb is a base address, Rc is an optional index, Rx is the destination of some operation (e.g., addition) and Ry and Rz are operands. For example: LD.B/AND.L R1, (R2), (R3), R4, R4, R5; R1.B := *(R2 + R3), R4.L &= R5.L Such instructions name six registers. If there are 32 (visible) registers, then 30 bits are required per instruction just to specify all the registers. If all instructions were...say...forty eight bits long, then this machine should achieve 25% more work per instruction bit than conventional RISCs. If the machine fetches 48 bits of instruction per cycle, and can execute both the load/store and register-data operations in one cycle, then it also gets UP TO twice as much work done per cycle as conventional RISCs. Since the ALU operations and the data access function can easily be made to operate in parallel, executing such instructions in once cycle should not be difficult. Given the frequency of load/store operations, this could be a very big win. Comments? Alan Lovejoy; alan@pdn; 813-530-2211; AT&T Paradyne: 8550 Ulmerton, Largo, FL. Disclaimer: I do not speak for AT&T Paradyne. They do not speak for me. __American Investment Deficiency Syndrome => No resistance to foreign invasion. Motto: If nanomachines will be able to reconstruct you, YOU AREN'T DEAD YET.
bcase@cup.portal.com (Brian bcase Case) (03/29/89)
>If all instructions were...say...forty eight bits long, then this machine >should achieve 25% more work per instruction bit than conventional RISCs. Two comments: non-power-of-two-sized instructions are not a good idea (you can't fit an integral number into a page, unless your pages are non-power-of-two-sized, and that is a *bad* idea). Also, you should look at Wulf's WM machine; his architecture is designed along the same lines, i.e., two things per instruction, but WM has 32-bit instructions and two ALU ops per instruction and data streaming (i.e., loads and stores happen without an explicit initiating load or store instruction). Bill, if you're out there, what's the status of the WM machine? Your idea is a good one: it is VLIW, in some sense. However, it is probably not possible to keep the load/store side of the instruction busy a large fraction of the time in general code, although scientific codes can probably take advantage of this "vector" caapability.