ram@nucsrl.UUCP (05/03/87)
The Japanese are tagging along with 32 bit machines: This is some info for the interested: 1. From Electronics of Apr 16. Japan's first microprocessor with 32-bit data and address buses runs at a maximum rate of 6.6 MIPS - 4 MIPS is typical of 32-bit chips(Roger,Brian,John - smile here). Fabricated in 1.5um CMOS, the NEC V70 has a dynamic bus sizing that enables it to match input/output with 8,16, and 32 bit buses. Its TRON (we are entering space age) OS will make it shine in real-time control and robot applications. The 20MHz V70 also incorporates FP(!!) facilities on-chip and has function redundancy monitor for fault-tolerant computing. Sample prices of the device is $687.52. Prices will be lower for production quantities. The spec on FP on-chip is rather confusing. I doubt if a hardware FP unit exists on chip. ANy clarifications? Also, will it be compatible with iAPX386 :-). 2. From the same magazine: Japan's NTT has world's fastest Lisp Processor. The machine called ELIS is a dual-processor machine, one 68010 used as a frontend processor and another micro-programmed Lisp processor. From the diagram, I see an FP accelerator included with the machine. Claimed performance: 1 M basic lisp instructions/sec. [What the hell is basic lisp instruction? (car '(a b c))? -artificialstone :-)] Has 128M of mem, multiple paradigm language - Tao (has flavors of Common lisp, prolog and smalltalk). The micro-program control store is 64K of 64 bit words and 1K register stack (of 32 bit size). The 68010s softwareis written in C and for the Lisp processor in Tao. I lost my data sheets/info on TI's Lisp processor to compare these two. Anybody care to. 3. Mitsubishi's 32 bit mP for their TRON project. Some details 1. No TLB 2. RISCy (at least adhering to lean cycles, fixed format simple instructions). 3. 1mM CMOS 4. 25-33 MHz 5. 5 stage pipeline. 6. Instruction queue size - 16 bytes 7. 3 bus ALU 8. Branch prediction ideas seem to be modelled after Hennesey's & A.J. Smith's work. 9. "A high speed memory for saving contexts, can also be incorporated". One cavaet about complex CPUs. [Notice how the big semiconductor manufacturers have started anouncing like GM/FORD/CHRYSLER (1988 car in 1987, 1989 car in 1988 - wonder what type of calendar they use) with chip of the future to-day.]. Judging by AMD and NSC announcements, it takes at least 1 yr for Si to be available after initial announcement and given the complexity of these chips, it takes at least 1 year for the bugs to be weeded out before the chip hardware becomes stable (386, 32X32). Do we need more complex designs on a single wafer or go for "small is beautiful". ------------------- Renu Raman UUCP:...ihnp4!nucsrl!ram 1410 Chicago Ave., #505 ARPA:ram@eecs.nwu.edu Evanston IL 60201 AT&T:(312)-869-4276
pec@necntc.NEC.COM (Paul Cohen) (05/04/87)
> This is some info for the interested: > > 1. From Electronics of Apr 16. > > Japan's first microprocessor with 32-bit data and address buses > runs at a maximum rate of 6.6 MIPS - 4 MIPS is typical of 32-bit > chips(Roger,Brian,John - smile here). Fabricated in 1.5um CMOS, > the NEC V70 has a dynamic bus sizing that enables it to match > input/output with 8,16, and 32 bit buses. Its TRON (we are entering > space age) OS will make it shine in real-time control and robot > applications. The 20MHz V70 also incorporates FP(!!) facilities > on-chip and has function redundancy monitor for fault-tolerant > computing. Sample prices of the device is $687.52. Prices > will be lower for production quantities. > > The spec on FP on-chip is rather confusing. I doubt if a hardware > FP unit exists on chip. ANy clarifications? Also, will it be > compatible with iAPX386 :-). > The V70's floating point is implemented not as a separate unit, but with additional microcode. There are some refinements of the ALU to facilitate this, such as a 64 bit barrel shifter. 32 and 64-bit IEEE standard basic arithmetic functions (add, subtract, multiply, divide, convert) are supported on-chip. A separate floating point unit for 80 bit operations and specialized functions such as transcendental and matrix operations will be available in the near future. The V70 is * * * N O T * * * 386 compatible. It has thirty-two general purpose 32-bit registers, an orthogonal instruction set and generally a much cleaner architecture that is not hampered by decisions made years ago for 16-bit machines (I am talking about native mode V70: it does have an emulation mode so that it can execute V30 machine code (a superset of 80186 code)). The V70 has an on-chip MMU so that virtual memory accesses can take place in two clocks. The processor also supports bit addressing and has some interesting (bit, byte and half-word) string instructions. The V70 is identical, except for bus interface, with the V60. You can get a V60 programmer's reference manual and a V70 datasheet by calling 1-800-NEC-ELEC (California) 1-800-NEC-ELE1 (Outside California) These telephones are supported only during west coast working hours. > One cavaet about complex CPUs. [Notice how the big semiconductor > manufacturers have started anouncing like GM/FORD/CHRYSLER (1988 car > in 1987, 1989 car in 1988 - wonder what type of calendar they use) with > chip of the future to-day.]. Judging by AMD and NSC announcements, it > takes at least 1 yr for Si to be available after initial announcement and > given the complexity of these chips, it takes at least 1 year for the bugs > to be weeded out before the chip hardware becomes stable (386, 32X32). > Do we need more complex designs on a single wafer or go for "small is > beautiful". > V70 silicon already exists and is fairly functional (after all, the V70 is not much different from the V60). Customer samples of the V70 should be available in the third quarter of this year. If you are seriously interested I can arrange for V60 samples now.
mash@mips.UUCP (John Mashey) (05/05/87)
In article <3810030@nucsrl.UUCP> ram@nucsrl.UUCP (Renu Raman) writes: >... 3. Mitsubishi's 32 bit mP for their TRON project. > Some details > 1. No TLB > 2. RISCy (at least adhering to lean cycles, fixed format > simple instructions).... Re: TRON: See IEEE Micro, April 1987; whole issue on TRON project; esp. Ken Sakamura, "Architecture of the TRON VLSI CPU", 17-31. I wouldn't call the TRON architecture RISCy [note that this isn't saying good or bad, just that it tends to not be much like what most people think are RISC machines]: Fixed format instructions: not exactly: there are 16-bit instructions (short form), and then there are arbitrary-sized ones that are multiples of 16-bits. From my reading, they appear to allow arbitrary cascading of indirect addressing [like the Sperry 1100s, for example], which has interesting implications for pipelining. Thus, their addressing appears more complex than a 68020's. The architecture specifies a bunch of user-level instructions which compilers will find it difficult to generate: reverse-byte-order, search-for-zero-or-one, bitmap operations [not just bitfields, bit maps], string operations [including search for substring!], queue manipulation [insert, delete, search]. It also specs BCD operations. Note: I mean no criticism of the design, but if you call it RISC, then almost no machine is a CISC! In fact: ``What's a RISC?'' ANS: any machine announced since 1983. [This is clearly true, we've even been reading lately that the Motorola 68030 really has a lot of features expected to be found only on RISC machines. In particular, "One of the most basic concepts of RISC architectures is that of hardware support for instructions. The MC68020/MC68030, although not RISC processors, have an impressive amount of on-chip hardware for special instructions." T. L. Johnson, "The RISC/CISC Melting Pot", Byte, April 87. Huh? I always thought CPUs were there to provide hardware support for instructions....sigh.] -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
geo@necis.UUCP (05/05/87)
In article <3810030@nucsrl.UUCP> ram@nucsrl.UUCP (Renu Raman) writes: > > The Japanese are tagging along with 32 bit machines: > > This is some info for the interested: > > 1. From Electronics of Apr 16. > > Japan's first microprocessor with 32-bit data and address buses > runs at a maximum rate of 6.6 MIPS - 4 MIPS is typical of 32-bit > chips(Roger,Brian,John - smile here). Fabricated in 1.5um CMOS, > the NEC V70 has a dynamic bus sizing that enables it to match > input/output with 8,16, and 32 bit buses. Its TRON (we are entering > space age) OS will make it shine in real-time control and robot > applications. The 20MHz V70 also incorporates FP(!!) facilities > on-chip and has function redundancy monitor for fault-tolerant > computing. Sample prices of the device is $687.52. Prices > will be lower for production quantities. > > The spec on FP on-chip is rather confusing. I doubt if a hardware > FP unit exists on chip. ANy clarifications? Also, will it be > compatible with iAPX386 :-). I have some preliminary specs on the V70 and I quote: "On-Chip Floating Point Support - IEEE 32 and 64-Bit Data Types" It also includes: On-Chip MMU On-Chip instruction and data cache I wonder when we will see On-Chip Hard disk ;-} ? The V-70 looks like a hell-of-a-chip. Some other misc. hype: - 32 General Registers ( all 32 bits of course ) - Symmetric Instruction Set - 20 Addressing modes - Variable Byte Length Format - Virtual Memory 4.3 GB Virtual Space / task 2 level paging 16 Entry Full Association It's not iAPX386 compatible, this looks like a NICE architecture!! --geo Opinions?? Yeah, I have that in my car; Rack and Opinion Steering.-- ----- george aguiar < UUCP: necis!geo >
lm@cottage.WISC.EDU (Larry McVoy) (05/05/87)
In article <491@necis.UUCP> geo@necis.UUCP (George Aguiar ext. 219) writes: >The V-70 looks like a hell-of-a-chip. Some other misc. hype: > > - 32 General Registers ( all 32 bits of course ) > - Symmetric Instruction Set > - 20 Addressing modes > - Variable Byte Length Format Ummm, not to rain on your parade or anything - but I have real problems with the last two. 20 addressing modes? That's a lot of logic. And I'll bet they support stuff like embedded displacements in the instruction stream (I'm not talking about 4 bit constants, I'm talking about things like National and Motorola do with the top bits of their byte, word, and long word displacements). That can cost you - you might not know off the top how long the instruction is so your decoder might start decoding the previous instructions displacement(s). Similar problem with the variable length format. Unless they got smart and put the whole length as part of the first byte, you have to delay the logic that looks at the trailing part of the instruction. This has messy implications when you consider the pipeline, does it not? It looks like the 29K may have made some smart moves.... --- Larry McVoy lm@cottage.wisc.edu or uwvax!mcvoy "What a wonderful world it is that has girls in it!" -L.L.
pec@necntc.NEC.COM (Paul Cohen) (05/06/87)
>In article <3810030@nucsrl.UUCP> ram@nucsrl.UUCP (Renu Raman) writes: >> Japan's first microprocessor with 32-bit data and address buses >> The NEC V70 has a dynamic bus sizing that enables it to match >> input/output with 8,16, and 32 bit buses. >I have some preliminary specs on the V70 and I quote: > "On-Chip Floating Point Support > - IEEE 32 and 64-Bit Data Types" > On-Chip MMU > On-Chip instruction and data cache Sorry on this one point. The preliminary specs you quote are in fact very preliminary. In order to get the product to market sooner, the V70 was designed without these caches. Really, this is just a change of names since the V70 + cache and other goodies will follow, but with a different name, possibly V80. >The V-70 looks like a hell-of-a-chip. Some other misc. hype: > - 32 General Registers ( all 32 bits of course ) > - Symmetric Instruction Set > - 20 Addressing modes > - Variable Byte Length Format > - Virtual Memory > 4.3 GB Virtual Space / task > 2 level paging > 16 Entry Full Association >It's not iAPX386 compatible, this looks like a NICE architecture!! Say that again: ***************************************** ***************************************** *** *** *** THE V70 IS NOT 386 COMPATIBLE *** *** *** *** THE V60 IS NOT 286 COMPATIBLE *** *** *** ***************************************** ***************************************** this is a NICE architecture
pec@necntc.NEC.COM (Paul Cohen) (05/06/87)
In article <1157@cottage.WISC.EDU> lm@cottage.WISC.EDU (Larry McVoy) writes: >In article <491@necis.UUCP> geo@necis.UUCP (George Aguiar ext. 219) writes: >>The V-70 looks like a hell-of-a-chip. Some other misc. hype: >> - 32 General Registers ( all 32 bits of course ) >> - Symmetric Instruction Set >> - 20 Addressing modes Actually there are 21 addressing modes for addressing bytes. Eighteen of these addressing modes can also be used for addressing bits. >> - Variable Byte Length Format >Ummm, not to rain on your parade or anything - but I have real problems with >the last two. 20 addressing modes? That's a lot of logic. And I'll bet >they support stuff like embedded displacements in the instruction stream (I'm >not talking about 4 bit constants, I'm talking about things like National >and Motorola do with the top bits of their byte, word, and long word >displacements). That can cost you - you might not know off the top how long >the instruction is so your decoder might start decoding the previous >instructions displacement(s). The V60/V70 instruction set uses orthogonal encodings. From the first byte the decoder can determine the number of operands and where the operand encodings begin. The operands themselves are encoded separately (orthogonally) and may vary considerably in size. No doubt this increases the complexity of the decoder, but it also improves code density (the importance of this is not primarily to save on memory space but so that it is not necessary to fetch as much code). As an example, consider the following COMPILED C code for the V60/V70: _______________________________________________ | struct sttyp { | unsigned first, second, third; | struct sttyp *fourth; | double fifth, a,b,c,d,e;} *stru C code: | | | stru->fourth->fourth->third = 2; |============================================== | mov.w _stru+0xc,r0 # &(stru->fourth) in R0 Assembly Code: | | mov.w #2,0x8[0xc[r0]] # stru->fourth-> | # fourth->third = 2; |______________________________________________ No doubt about it, the V70 is a complex chip; it is also fast. It packs a great deal of functionality to provide high performance at a reasonable cost in a real system. >It looks like the 29K may have made some smart moves.... It depends on its objectives. The 29K requires two separate paths to memory, one for code and another for data. The memory must be extremely fast (read expensive) to service the CPU without wait states. It also expects some specialized bus monitoring hardware in the memory system: >From: tim@amdcad.AMD.COM (Tim Olson) >Subject: Re: AM29000 memory management (was flame) >The "best" place for the referenced and changed bits, however, are in >an external memory array, which "watches" the bus and automatically >updates the R & C bits. This array can also be read from or written >to via I/O space to read or clear the bits. Also, taking advantage of the AMD RISC style architecture places some uncomfortable demands on compiler developers. I'm not knocking the AMD part. It is an interesting processor and I'll be interested in seeing what it does in a real system but I'll also be interested in seeing what such a system costs. If system cost does is no concern to you then disregard my comments but if a good cost/performance ratio seems important to you (not to mention V30 software compatibility) I suggest that you take another look at the V70 and the V60.
bcase@amdcad.AMD.COM (Brian Case) (05/06/87)
In article <4016@necntc.NEC.COM> pec@necntc.UUCP (Paul Cohen) writes: >In article <1157@cottage.WISC.EDU> lm@cottage.WISC.EDU (Larry McVoy) writes: >No doubt about it, the V70 is a complex chip; it is also fast. It packs >a great deal of functionality to provide high performance at a reasonable >cost in a real system. > >>It looks like the 29K may have made some smart moves.... > >It depends on its objectives. The 29K requires two separate >paths to memory, one for code and another for data. The memory must be >extremely fast (read expensive) to service the CPU without wait states. >It also expects some specialized bus monitoring hardware in the memory >system: > > >From: tim@amdcad.AMD.COM (Tim Olson) > >Subject: Re: AM29000 memory management (was flame) > > >The "best" place for the referenced and changed bits, however, are in > >an external memory array, which "watches" the bus and automatically > >updates the R & C bits. This array can also be read from or written > >to via I/O space to read or clear the bits. > >Also, taking advantage of the AMD RISC style architecture places some >uncomfortable demands on compiler developers. > >I'm not knocking the AMD part. It is an interesting processor and I'll >be interested in seeing what it does in a real system but I'll also be >interested in seeing what such a system costs. > >If system cost does is no concern to you then disregard my comments but >if a good cost/performance ratio seems important to you (not to mention >V30 software compatibility) I suggest that you take another look at the >V70 and the V60. First, the memory system for the Am29000 may be as simple as VideoDRAM. These VDRAMs are, I believe, only marginally more expensive than regular DRAMs and allow the Am29000 to deliver a fair fraction of its maximum performance. I know of one potential customer who is simulating the Am29000 with VDRAMs and is quite satisfied with the results (frankly, I was very surprised at the peformance, but this may be an isolated case). Let's face it: you can try lots of stuff with instruction set encoding, pipelinging tricks, etc. etc., but in the end, the performance of the CPU comes down to that of the memory hierarchy. As the designers of the Am29000, we recognized this fact and did what we could to *solve* the problem instead of trying our best to *hide* the problem in highly- encoded instruction formats. To get the best performance from the Am29000 probably *does* require an expensive memory system; we look at it this way: at least the Am29000 gives the system designer a *chance* to get superior performance. We feel we have given the designer the "Max Headroom." :-) :-) Second, the Am29000 does not "expect" some sophisticated bus monitoring hardware. Maintaining referenced and modified bits in hardware associated with the memory arrays is what we consider the *best* way; since the TLB reload routines for the Am29000 can be tailored to specific needs, it is, of course, quite possible to maintain this information by software means. However, there is a performance cost associated. Even if the TLB reload is done by "hardware" (really microcode or some state machine) on the CPU chip, there is a performance cost. Referenced and modified bits in hardware next to the memory arrays is probably the best for multiprocessor systems too. But there are *lots* of specific tradeoffs to make for a particular system; again we feel that we have given the "Max Headroom" since a designer may choose to maintain referenced and modified information wherever he (she?) chooses. When the TLB reload and other VM tasks are done by fixed routines/state machines in "hardware", there can be problems. Thirdly, I don't know what architectural features of the Am29000 are considered to place uncomfortable demands on compiler writers. Overlapped loads and stores are there to be taken advantage of (and have been demonstrated, by one customer on one graphics benchmark, to be worth nearly a factor of two in performance (but I don't think this will be the case most of the time)) if possible; the Am29000 interlocks to insure correct operation when full overlap isn't possible. Delayed branches must be dealt with by software constructors (be they human beings or compilers), but this is not a big deal (in fact it is, I believe, one of the simplest optimizations to perform). For the Am29000, using the local register file as a stack cache can make register allocation easy. Three address register-register instructions and the load/store architecture make code generation easy. The kinds of optimizations that are important for reaping maximum performance from the Am29000 are the same ones that are important for reaping maximum performance from any architecture: loop optimizations, common sub- expression elimination, induction variable elimination, strength reductions, etc. etc. We believe that the Am29000 makes these optimizations *easier* not more difficult. I believe that most of the members of the compiler-writing and architecture community would agree that a simple architecture with a predictable cost for instructions (in both time and space) is the best match for automatic code generation. I wouldn't mind if some of you in the compiler-writing and architecture community (and OS community too, sorry John) would come to my aid. I am not trying to say anything bad about the V70. I just want to set the record straight about the Am29000. bcase
mash@mips.UUCP (John Mashey) (05/07/87)
In article <16561@amdcad.AMD.COM> bcase@amdcad.UUCP (Brian Case) writes: >In article <4016@necntc.NEC.COM> pec@necntc.UUCP (Paul Cohen) writes: >>It depends on its objectives. The 29K requires two separate >>paths to memory, one for code and another for data. The memory must be >>extremely fast (read expensive) to service the CPU without wait states. 2 paths are usually better than one, as any chips will discover when they keep pushing clock rates. >case). Let's face it: you can try lots of stuff with instruction set >encoding, pipelinging tricks, etc. etc., but in the end, the performance >of the CPU comes down to that of the memory hierarchy. agree. I did have one question: what kind of write buffers do the AMD simulations use [i.e., how deep], and what kind of % hit is there for: a) write stalls [write buffer full] b) read/write memory conflicts [i.e., I was already doing a write, and a read comes along that's a cache miss]. If I missed this info published somewhere, just point us at it. > >optimizations *easier* not more difficult. I believe that most of >the members of the compiler-writing and architecture community would >agree that a simple architecture with a predictable cost for instructions >(in both time and space) is the best match for automatic code generation. 100% we may disagree on other issues, but not this one. [following on the earlier comments on V70 addressing modes] If somebody says "20 addressing modes are good", to be convincing, they'd better be able to show tradeoffs, and show us the dynamic and static usages of those things, in real compiled code of substantial size. They may be worth it, or they may be not, but there is substantial data that says that complex addressing modes just aren't used very much. Perhaps this is an exception.... -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
pec@necntc.NEC.COM (Paul Cohen) (05/07/87)
In article <372@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes: >In article <16561@amdcad.AMD.COM> bcase@amdcad.UUCP (Brian Case) writes: >>In article <4016@necntc.NEC.COM> pec@necntc.UUCP (Paul Cohen) writes: >>>It depends on its objectives. The 29K requires two separate >>>paths to memory, one for code and another for data. The memory must be >>>extremely fast (read expensive) to service the CPU without wait states. >2 paths are usually better than one, as any chips will discover when they >keep pushing clock rates. I very much agree if you mean better performance. Do you also mean better system cost? >>optimizations *easier* not more difficult. I believe that most of >>the members of the compiler-writing and architecture community would >>agree that a simple architecture with a predictable cost for instructions >>(in both time and space) is the best match for automatic code generation. There is no doubt that having fewer options makes decisions easier. I agree that comments from compiler writers would be welcome here. >If somebody says "20 addressing modes are good", to be convincing, they'd >better be able to show tradeoffs, and show us the dynamic and static usages >of those things, in real compiled code of substantial size. A quibble here: Why is the size of the code important? I don't understand why the size of the code has any bearing on use of addressing modes, though I do see that it would have some bearing on the difficulty of determining usage statistics. Even if there is some connection, I suspect there are many machines in active use that spend most of their time executing fairly small programs. The C compiler that is currently available for the V60/V70 is based on the Unix Portable C Compiler. This is admitedly not the best (in terms of performance, disregarding cost) compiler technology around (there are two other C compilers under development for the V60/V70 by U.S. compiler companies), but this compiler does use all of the available addressing modes (and all of the V60/V70 non-privileged instructions except for some of the string instructions). I wish that I had the time to do a study of the sort suggested (though probably any results that I would get would be suspected of bias). One question to ponder in this regard: suppose only 15 (or even only 5) of the addressing modes were found to be extensively used by SOME compiler. Could you conclude that you would be better off with only one or two addressing modes? If anyone (preferably someone with no axe to grind) would like to volunteer to do some research of this kind on V60/V70 code I'd be more than happy to cooperate. On another note, in response to an earlier posting, I've had numerous requests for help in getting documentation on the V60/V70. I earlier posted the telephone numbers: 1-800-NEC-ELEC (California) 1-800-NEC-ELE1 (During California working hours) It did not occur to me that I would get requests from Europe as well. If that is your general location, a better number would be: 0049-211-6503-333 (Dusseldorf)
alexande@drivax.UUCP (Mark Alexander) (05/07/87)
In article <491@necis.UUCP> geo@necis.UUCP (George Aguiar ext. 219) writes: >I have some preliminary specs on the V70... >It's not iAPX386 compatible... It does, however, have a virtual 8086 mode, kinda-sorta like the 386, but a little bit easier to work with (mainly because there are still a LOT of registers left over after you take away those that are mapped to 8086 registers). -- Mark Alexander ...{hplabs,seismo,sun,ihnp4}amdahl!drivax!alexande (This space intentionally left blank.)
mash@mips.UUCP (05/08/87)
In article <4070@necntc.NEC.COM> pec@necntc.UUCP (Paul Cohen) writes: >In article <372@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes: >>In article <16561@amdcad.AMD.COM> bcase@amdcad.UUCP (Brian Case) writes: >>>In article <4016@necntc.NEC.COM> pec@necntc.UUCP (Paul Cohen) writes: > >>2 paths are usually better than one, as any chips will discover when they >>keep pushing clock rates. >I very much agree if you mean better performance. Do you also mean >better system cost? Depends; I was talking about performance, and actually, I meant separate I & D caches. >>If somebody says "20 addressing modes are good", to be convincing, they'd >>better be able to show tradeoffs, and show us the dynamic and static usages >>of those things, in real compiled code of substantial size. >A quibble here: Why is the size of the code important? All that I meant was: real programs, not toys, and not synthetic benchmarks. > >I wish that I had the time to do a study of the sort suggested (though >probably any results that I would get would be suspected of bias). >One question to ponder in this regard: suppose only 15 (or even only 5) >of the addressing modes were found to be extensively used by SOME compiler. >Could you conclude that you would be better off with only one or two >addressing modes? As usual, it depends. Maybe one discovers that no compilers use all of the modes, but all the modes are used substantially by something, or that the unused modes cost more to omit than include [by the time one has included the heavily-used ones.] THe point is that to show that "lots of modes" is a good thing, it is not sufficient to show one example of how a compiler might use them. What's really needed is a good tradeoff analysis: it's often hard, after the fact, to know what the modes cost in terms of cycle time. What can be measured is the usage frequency of the modes, and this is valuable information, and how we make progress in the architecture area. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
preece@ccvaxa.UUCP (05/11/87)
mash@mips.UUCP: > If somebody says "20 addressing modes are good", to be convincing, > they'd better be able to show tradeoffs, and show us the dynamic and > static usages of those things, in real compiled code of substantial > size. They may be worth it, or they may be not, but there is > substantial data that says that complex addressing modes just aren't > used very much. Perhaps this is an exception.... ---------- I find it a little amusing that the same people who say "complex feature x just isn't used very much" tend to be the same people who say "not to worry, a sufficiently clever compiler will take care of our ship's need for X". If compilers can be made smart enough to handle some of the special things that RISCs need, they could be made smart enough to make better use of the complex features in CISCs. The point isn't that RISCs make certain optimizations easier or harder, but that they make certain optimizations NECESSARY. Compilers smart enough to use some of the special features of CISCs haven't been sufficiently necessary -- they work "well enough" using simple instruction sequences. My impression from the literature is that RISCs demand more compiler optimization to reach the performance that is expected of them than do CISCs. Perhaps that simply means we have higher expectations of them, perhaps it simply means that baseline compiler performance is better than it used to be and those expectations are reasonable. Whatever. -- scott preece gould/csd - urbana uucp: ihnp4!uiucdcs!ccvaxa!preece arpa: preece@gswd-vms
baum@apple.UUCP (05/12/87)
-------- [] >.... Since RISC programs will tend to be substantially >larger than CISC programs, a RISC system will need more memory than >a CISC system. > >(Disclaimer: I, of course, have no evidence to back up this theory.) > >-- Chuck I have evidence in both directions. The IBM801 folks have said that the 801 code size was about 20% larger than IBM/370 code. This is probably smaller than the difference between two different compilers. An I-cache that is 20% smaller will not significantly affect your performance (.e.g delta will be < 20%). On the other hand, the ATT CRISP folks claim code size which is equal to or smaller than VAX code, which is known to be fairly compact. So much for canard that RISC code is huge. -- {decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
lm@cottage.WISC.EDU (Larry McVoy) (05/12/87)
In article <28200037@ccvaxa> preece@ccvaxa.UUCP writes: >I find it a little amusing that the same people who say "complex >feature x just isn't used very much" tend to be the same people who >say "not to worry, a sufficiently clever compiler will take care of >our ship's need for X". If compilers can be made smart enough to >handle some of the special things that RISCs need, they could be >made smart enough to make better use of the complex features in >CISCs. This is hogwash. As someone who has a certain amount of compiler experience, I can say that e RISC compiler is likely to be much less intelligent than a CISC one. The reason (hold the flames a bit, ok?) is that I have yet to see an orthogonal CISC machine. The 32000 series is the closest. Vax and 68000 don't come very close. The problem is this: you're generating code for a particular action, right? And this special instruction looks like just the ticket. And then (or maybe three months later) you realize that you needed a signed displacement and they give you an unsigned displacement. Or something similar. So you end up going in and generating the code in the "stupid" straightforward manner. Or - maybe you're really dedicated and you add another 200 lines of code to the compiler to catch this special case. And in another week you find out... The problem can be summarized as follows: Provide 0, 1 or infinity. No exceptions. CISC is trying to approximate infinity. As it gets closer the chip gets slower. The infinity choice is clearly wrong. Admit it. Larry McVoy lm@cottage.wisc.edu or uwvax!mcvoy
tim@amdcad.AMD.COM (Tim Olson) (05/12/87)
In article <28200037@ccvaxa> preece@ccvaxa.UUCP writes: +----- | The point isn't that RISCs make certain optimizations easier or harder, | but that they make certain optimizations NECESSARY. ... +----- No, the point is that RISCs make certain optimizations *POSSIBLE* -- by using only simple, single-cycle instructions, optimization opportunities are uncovered which may not be available with complex, multi-cycle instructions -- especially in loops. +----- | ... Compilers smart | enough to use some of the special features of CISCs haven't been | sufficiently necessary -- they work "well enough" using simple | instruction sequences. ... +----- And those simple instruction sequences allow higher levels of one of the most beneficial optimizations -- code motion out of loops. -- Tim Olson Advanced Micro Devices
aeusesef@csun.UUCP (Sean Eric Fagan) (05/14/87)
In article <277@astroatc.UUCP> johnw@astroatc.UUCP (John F. Wardale) writes: >Now the trend is to build simpler, faster REGULAR machines, and the only >thing that falls the the category of "fashionability" is terms "RISC" and >"CISC" Seymore (sp?) made a RISCy machine called the "Cray-1" before >anyone started to use the term "RISC" >A wonderful appearing "middle ground" part is the "Clipper" which is a >basic RISC machine with an additional set of "macro" instructions. >(So, we can't do MULT or DIV in one clock, we'll give you a MULT and DIV >macro in "hardware". I like this approach, but then I'm firmly rooted in >the RISC camp) >Name: John F. Wardale >UUCP: ... {seismo | harvard | ihnp4} !uwvax!astroatc!johnw >arpa: astroatc!johnw@rsch.wisc.edu Not to be nitpicky, or anything, but old Seymore did a RISC long before the Cray-1. Ever hear of the CDC Cyber line? The 6600, 7600, 170 lines? All RISC, with a grand total of about, oh, 70+ instructions. Two instructions to directly access memory, and a few instructions to indirectly access memory. Execution of non-divide, non-context save instructions in less than 10 clock cycles, generally under 5. Double precision multiply takes about 5 clock cycles, I believe (worse case). Single precision is the same, only it returns the low half. (I should mention that double precision on a Cyber is 120 bits.) Very fast floating point, slightly slower integer (yeah, slower), very nice instruction set. Lousy operating system, though, but that is not Seymore's fault. If you ever get the hardware reference manuals for both the Cray-1 and the Cyber 7600 (or a 170 model), you will notice that they are *extremely* similar, almost the same except for small things like word size, lack of a divide instruction on the Cray, no vectors on the Cyber, etc. Sorry, but I tend to ramble about the Cyber, especially when people bring up the Cray. If I misspelled Mr. Cray's first name, please forgive me... ----- Sean Eric Fagan Office of Computing/Communications Resources (213) 852 5086 Suite 2600 AGTLSEF@CALSTATE.BITNET 5670 Wilshire Boulevard Los Angeles, CA 90036 {litvax, rdlvax, psivax, hplabs, ihnp4}!csun!{aeusesef,titan!eectrsef} -------------------------------------------------------------------------------- My employers do not endorse my | "I may be slow, but I'm not stupid. opinions, and, at least in my | I can count up to five *real* good." preference of Unix, heartily | The Great Skeeve disagree. | (Robert Asprin)
baum@apple.UUCP (Allen J. Baum) (05/14/87)
--------
[]
>(Has anyone ever made a machine the pre-fetches I-cache lines from memory?)
I believe that the big Amdahl machines can do that (condtioned on some strange
status bit somewhere), as can the Fairchild Clipper.
--
{decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
henry@utzoo.UUCP (Henry Spencer) (05/14/87)
> ...On the other hand, the ATT CRISP folks claim code size which is equal > to or smaller than VAX code, which is known to be fairly compact... Not everyone agrees that VAX code is "fairly compact"... -- "The average nutritional value Henry Spencer @ U of Toronto Zoology of promises is roughly zero." {allegra,ihnp4,decvax,pyramid}!utzoo!henry
brucek@hpsrla.HP.COM (Bruce Kleinman) (05/15/87)
+----- | Any operation can be done faster if implemented at a 'lower' level in | the machine. +----- Completely true. The problem is there is only so much 'lower' level. Chip real estate isn't unlimited, at least it wasn't last time I checked. CISC chips frequently offer hundreds of instructions with a dozen or more address modes. This usually necessitates the use of one, sometimes two, levels of microcode. The richness/complexity of the instruction set requires a mass of pipeline interlock logic. CISC CPUs tend to be, uh, rather complex. And, therefore, the CPU usually dominates the chip. RISC chips generally offer a hundred or fewer instructions with a few address modes. This usually allows the instruction set to be hardwired. The orthogonal nature of the instruction set requires very little pipeline interlock logic. RISC CPUs tend to be rather simple. And, therefore, the CPU is usually a small portion of the chip. The CISC advocate interprets the quote at the top of this note and says, " Putting 'register indirect with base + offset' addressing in the instruction set is a win, because my chip will be able to do it with a single instruction -- which will make my chip really fast. " This approach buys you some very useful operations at the expense of real estate. Your instruction set soon becomes less orthogonal, more exceptions are introduced, and you have to handle them in the pipeline logic. Pretty soon you are out of real estate, because you have a big chunk or two of microcode, a large area for decode, and a gate array worth of logic to glue your pipeline together .... The RISC advocate interprets the quote and says, " Leaving 'register indirect with base + offset' addressing out of the instruction set is a win, because I will be able to hardwire the CPU -- which will make my chip really fast. " This approach buys you a very orthogonal instruction set, while using up relatively little real estate. Your CPU is hardwired, most instructions execute in a single cycle, and your pipeline can be balanced more easily. And you've got a bunch of real estate left over .... Who wins? My (completely unbiased) answer: The RISC chip wins because the extra real estate can be used for a massive register file, or an on chip floating point unit, or a *real* cache (i.e. one greater than 256 bytes), etc, etc. Yes, any operation can be done faster if implemented at a 'lower' level in the machine. And a RISC chip leaves you a lot of 'lower' level to work with after the CPU is complete. And I'll take a 192 registers or a 4K cache over 'register indirect with base + offset' any day. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Bruce Kleinman Hewlett Packard -- Network Measurements Division Santa Rosa, California ....hplabs!hpsrla!brucek ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
steve@edm.UUCP (05/15/87)
In article <754@apple.UUCP>, baum@apple.UUCP (Allen J. Baum) writes: > -------- > [] > >.... Since RISC programs will tend to be substantially > >larger than CISC programs, a RISC system will need more memory than > >a CISC system. > the 801 code size was about 20% larger than IBM/370 code. This is > ... so much for canard that RISC code is huge. One thing to consider is that often the DATA space is larger than the I-space. (this, of course may not apply to the kernel). I don't know just how widespread this is, but if it is relatively common (I would guess this is especially common with number-crunching type exploits (where RISC speeds are so useful)) then the expanded code size of RISC may not be all that expensive. -- ------------- Stephen Samuel Disclaimer: You betcha! {ihnp4,ubc-vision,seismo!mnetor,vax135}!alberta!edm!steve
meissner@dg_rtp.UUCP (05/17/87)
> In article <277@astroatc.UUCP> johnw@astroatc.UUCP (John F. Wardale) writes: > > >Now the trend is to build simpler, faster REGULAR machines, and the only > >thing that falls the the category of "fashionability" is terms "RISC" and > >"CISC" Seymore (sp?) made a RISCy machine called the "Cray-1" before > >anyone started to use the term "RISC" ... And in article <609@csun.UUCP> aeusesef@csun.UUCP (Sean Eric Fagan) replies: > > Not to be nitpicky, or anything, but old Seymore did a RISC long before the > Cray-1. Ever hear of the CDC Cyber line? The 6600, 7600, 170 lines? All > RISC, with a grand total of about, oh, 70+ instructions. Two instructions > to directly access memory, and a few instructions to indirectly access > memory. .... > Very fast floating point, slightly slower integer (yeah, > slower), very nice instruction set. While the 6600 and 7600 lines had a sparse number of instructions, I doubt whether they would qualify as a RISC machine (I don't know about the CRAY). In the first place, the main thing about ALL of the RISC's that are true RISC's, is that EVERY instruction takes one cycle. No exceptions. The CDC machine instructions could take multiple cycles (divide in particular). One of the things that highlighted the machines was multiple parallel functional units (ie, you would typically fire off a divide, and then multiplies, each with different accumulators, and as long as you did not issue an instruction using that accumulator until the unit was done, you could proceede and do something else). Another thing that the RISC philosophy has come to mean is regular (ie, no special purpose accumulator's, any register can be the target or source(s) of any instruction). I dare you to call this machine regular. To load from main memory, you stored the address you wanted into an A register (A1-A5 only), and in a few cycles the value would appear in the corresponding X register. To store a value, you would store the address into an A register (A6-A7 only), and the corresponding X register would be stored. Thus the X registers were special purpose (X1-X5 could read from memory, X6-X7 could write to memory, X0 was scratch). The A registers were tied to the corresponding X registers, and only 18 bits wide. The B registers were also 18 bits wide, with B0 being hardwired to 0 (you could store into B0, but the machine would ignore it). Given the parallel functional units, the machine kept track of which accumulator was in use, and would pend any instruction that would reference it, until the functional unit was done (unlike some recent machines, where the compiler would have to do the scheduling itself because the hardware interlock was removed for speed). The Fortran compiler (FTN) would attempt to keep the different functional units busy. Also, given that the machine wordsize (60 bits) was much larger than the instruction size (mostly 15 or 30 bits), any instruction that was the target of a branch had to begin on a word boundary. Because of parallel functional units and up to 4 instructions per word, faults were only approximate (you knew which word faulted, but not which instruction in the word). It also became a game of assembler hackers to conconct things like a sequence that would raise three different faults (overflow, underflow, etc) in the same machine word. It was an interesting machine. -- Michael Meissner, Data General Uucp: ...mcnc!rti!dg_rtp!meissner It is 11pm, do you know what your sendmail and uucico are doing?
baum@apple.UUCP (Allen J. Baum) (05/18/87)
-------- [] >In the first place, the main thing about ALL of the RISC's that are true >RISC's, is that EVERY instruction takes one cycle. No exceptions. >Another thing that the RISC philosophy has come to mean >is regular (ie, no special purpose accumulator's, any register can be the >target or source(s) of any instruction). I hate to disappoint you, but I can't think of any RISC machines that meet these (rather Ad Hoc) criteria. Most of the RISC processors out there have multi-cycle instructions, including the original RISC I, which had a two cycle load, and most of them have a register which is a hard-wired zero. The HP Spectrum has at least two instructions with hardwired register destinations. The definition of 'RISC' is in the mind of the beholder. There is no 'agreed' definition of what a RISC processor is except that maybe you know one when you see one..... -- {decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
jps@apollo.uucp (Jeffrey P. Snover) (05/19/87)
>The definition of 'RISC' is in the mind >of the beholder. There is no 'agreed' definition of what a RISC processor >is except that maybe you know one when you see one..... It seems to me that the progression of things goes something like this: - Come up with a new angle on things - Pick a flashy aspect and name the set of ideas after it ( RISC is easier to say than OIPC [One Instruction Per Cycle] and flashier than NM [No Microcode]) - Evolve the concept, expanding on the good ideas and passing on the not so good ideas. - Spend years complaining when the evolved idea doesn't comform to the flashy name. o the idea was bast***ized o thats not a *REAL* xxxx o the damn trilateral commission is at it again!
mash@mips.UUCP (John Mashey) (05/21/87)
I think this has mostly been answered by other posters, so I've missed most of the discussion, having been off in New Zealand [P.S., if you ever get a chance to attend the N.Z. UNIX conference, DO IT! well-run, great bunch of people, lot of fun, wonderful sightseeing, sheep jokes....] However, let me add a few notes on the above comments, plus a few more examples not already given by the other posters on this topic. In article <28200037@ccvaxa> preece@ccvaxa.UUCP writes: > mash@mips.UUCP: >> If somebody says "20 addressing modes are good", to be convincing,.... >---------- >I find it a little amusing that the same people who say "complex >feature x just isn't used very much" tend to be the same people who >say "not to worry, a sufficiently clever compiler will take care of >our ship's need for X". If compilers can be made smart enough to >handle some of the special things that RISCs need, they could be >made smart enough to make better use of the complex features in >CISCs. Good compiler technology is always useful; it's merely more useful on some machines than an others. Let's assume that one can have the same compilers on a variety of machines [both Stanford and IBM have done this]. The point is that good optimizing compilers change the statistics of what's going on, at least somewhat, on ANY machine, and if the statistics say that such compilers greatly lessen the use of a feature, you might think of eliminating the feature entirely, if there was a nonzero cost for it, and if the compilers have reasonable alternatives. Let's go back to the example that started all this, which was me claiming that "lots of addressing modes" needed justification as a good feature. [This was NOT a statement that lots of modes was necessarily bad, merely that it needed to be justified, because the published data seemed not to justify it. BTW, does anybody have some dynamic statistics on the multi-level indirect modes? What we've got so far is mostly static counts, which can be misleading.] For example, consider code like: if (a->b && a->b->c) ... [not uncommon] Suppose you have a machine that has all the indirect modes. If you have a non-optimizing compiler, but you special-case it to pick up the indirect modes, you can use them [and as somebody has pointed out, on some machines, there may be an implicit reference thru a frame pointer to get to a, and I'll assume that]. Depending on the machine, this could get you something like fetch a->b 1 memory ref to offset.of.a + (fp) 1 memory reference to offset.of.b + above test branch around if zero fetch a->b->c 1 memory ref to offset.of.a + (fp) 1 memory reference to offset.of.b + above 1 memory reference to offset.of.c + above test .... Suppose you have an optimizing compiler, which will surely do common subexpression elimination and serious register allocation, or it has no business calling itself an optimizer. What would it do? fetch a into r1 1 memory ref to offset.of.a + (fp) fetch b into r2 1 memory referenc to offset.of.b + (r1) test r2 branch around if zero fetch c 1 memory reference to offset.of.c + (r2) test... There are all sorts of variants, depending on the machine, and of course, it's quite possible that the optimizer might have decided "a" was a good thing to have in a register long before anyway, and amortized the cost of getting it there over several references. The point is that the first example has 5 address specifiers, and the second one has 3, and if the optimizer is at all lucky, it moved the first fetch away from the second one and got to re-use the value. On most machines I've seen, the 2nd case will go faster than the first, so what's happened is that some good machine-independent optimizations have reduced the utility of the specific machine feature [multi-level indirect addressing]. It's not that compilers can't be smart enough to take advantage of special features [I've done some ferocious hacking on compilers to do just that: once you have a machine, you do whatever you can!], but that given good optimizers, some features are of less use than others, because the optimizers change the statistics. At that point, you can make reasoned tradeoffs, but it's hard to do without a good understanding of what's likely to be possible for the compilers to do. > >The point isn't that RISCs make certain optimizations easier or harder, >but that they make certain optimizations NECESSARY. Compilers smart >enough to use some of the special features of CISCs haven't been >sufficiently necessary -- they work "well enough" using simple >instruction sequences. My impression from the literature is that RISCs >demand more compiler optimization to reach the performance that is >expected of them than do CISCs. Perhaps that simply means we have >higher expectations of them, perhaps it simply means that baseline >compiler performance is better than it used to be and those expectations >are reasonable. Whatever. Optimizations are by definition NEVER necessary! only desirable. We see something like 20% improvement from the more global optimizations, which is well worthwhile, since that's adding a few Mips to the performance, and some important cases sometimes get more. Nevertheless, the machines are still OK without this, and there's less weird machine-specific hackery by far than things I've seen done on many other machines. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086