[comp.arch] Japanese 32-bit CPUs

ram@nucsrl.UUCP (05/03/87)

    The Japanese are tagging along with 32 bit machines:

    This is some info for the interested:

    1. From Electronics of Apr 16.

	Japan's first microprocessor with 32-bit data and address buses
	runs at a maximum rate of 6.6 MIPS - 4 MIPS is typical of 32-bit
	chips(Roger,Brian,John - smile here).  Fabricated in 1.5um CMOS,
	the NEC V70 has a dynamic bus sizing that enables it to match
	input/output with 8,16, and 32 bit buses.  Its TRON (we are entering
	space age) OS will make it shine in real-time control and robot
	applications.  The 20MHz V70 also incorporates FP(!!) facilities
	on-chip and has function redundancy monitor for fault-tolerant
	computing.  Sample prices of the device is $687.52.  Prices
	will be lower for production quantities.

    The spec on FP on-chip is rather confusing.  I doubt if a hardware
    FP unit exists on chip.  ANy clarifications?  Also, will it be
    compatible with iAPX386 :-).

    2.  From the same magazine:

	Japan's NTT has world's fastest Lisp Processor.  

	The machine called ELIS is a dual-processor machine, one 68010 used
	as a frontend processor and another micro-programmed Lisp processor.
	From the diagram, I see an FP accelerator included with the machine.
	Claimed performance: 1 M basic lisp instructions/sec.  [What the
	hell is basic lisp instruction? (car '(a b c))? -artificialstone :-)]
	Has 128M of mem, multiple paradigm language - Tao (has flavors of 
	Common lisp, prolog and smalltalk).  The micro-program control store is
	64K of 64 bit words and 1K register stack (of 32 bit size).
	The 68010s softwareis written in C and for the Lisp processor in
	Tao.  I lost my data sheets/info on TI's Lisp processor to compare
	these two.  Anybody care to.

    3.  Mitsubishi's 32 bit mP for their TRON project.

	Some details

	    1. No TLB
	    2. RISCy (at least adhering to lean cycles, fixed format
	       simple instructions).
	    3. 1mM CMOS
	    4. 25-33 MHz
	    5. 5 stage pipeline.
	    6. Instruction queue size - 16 bytes
	    7. 3 bus ALU
	    8. Branch prediction ideas seem to be modelled after
	       Hennesey's & A.J. Smith's work.
            9. "A high speed memory for saving contexts, can also
	       be incorporated".

    
    One cavaet about complex CPUs.  [Notice how the big semiconductor
    manufacturers have started anouncing like GM/FORD/CHRYSLER (1988 car
    in 1987, 1989 car in 1988 - wonder what type of calendar they use) with
    chip of the future to-day.].  Judging by AMD and NSC announcements, it 
    takes at least 1 yr for Si to be available after initial announcement and 
    given the complexity of these chips, it takes at least 1 year for the bugs
    to be weeded out before the chip hardware becomes stable (386, 32X32). 
    Do we need more complex designs on a single wafer or go for "small is 
    beautiful". 

-------------------
Renu Raman				UUCP:...ihnp4!nucsrl!ram
1410 Chicago Ave., #505			ARPA:ram@eecs.nwu.edu 
Evanston  IL  60201			AT&T:(312)-869-4276

pec@necntc.NEC.COM (Paul Cohen) (05/04/87)

>    This is some info for the interested:
>
>    1. From Electronics of Apr 16.
>
>	Japan's first microprocessor with 32-bit data and address buses
>	runs at a maximum rate of 6.6 MIPS - 4 MIPS is typical of 32-bit
>	chips(Roger,Brian,John - smile here).  Fabricated in 1.5um CMOS,
>	the NEC V70 has a dynamic bus sizing that enables it to match
>	input/output with 8,16, and 32 bit buses.  Its TRON (we are entering
>	space age) OS will make it shine in real-time control and robot
>	applications.  The 20MHz V70 also incorporates FP(!!) facilities
>	on-chip and has function redundancy monitor for fault-tolerant
>	computing.  Sample prices of the device is $687.52.  Prices
>	will be lower for production quantities.
>
>    The spec on FP on-chip is rather confusing.  I doubt if a hardware
>    FP unit exists on chip.  ANy clarifications?  Also, will it be
>    compatible with iAPX386 :-).
>

The V70's floating point is implemented not as a separate unit, but 
with additional microcode.  There are some refinements of the ALU to
facilitate this, such as a 64 bit barrel shifter.  32 and 64-bit
IEEE standard basic arithmetic functions (add, subtract, multiply,
divide, convert) are supported on-chip.  A separate floating point unit
for 80 bit operations and specialized functions such as transcendental 
and matrix operations will be available in the near future.

The V70 is  * * * N O T * * * 386 compatible.  It has thirty-two general
purpose 32-bit registers, an orthogonal instruction set and generally 
a much cleaner architecture that is not hampered by decisions made years 
ago for 16-bit machines (I am talking about native mode V70: it does have 
an emulation mode so that it can execute V30 machine code (a superset of 
80186 code)).  

The V70 has an on-chip MMU so that virtual memory accesses can take
place in two clocks.  The processor also supports bit addressing and has
some interesting (bit, byte and half-word) string instructions.

The V70 is identical, except for bus interface, with the V60.  You can
get a V60 programmer's reference manual and a V70 datasheet by calling

		1-800-NEC-ELEC (California)
		1-800-NEC-ELE1 (Outside California)

These telephones are supported only during west coast working hours.

>    One cavaet about complex CPUs.  [Notice how the big semiconductor
>    manufacturers have started anouncing like GM/FORD/CHRYSLER (1988 car
>    in 1987, 1989 car in 1988 - wonder what type of calendar they use) with
>    chip of the future to-day.].  Judging by AMD and NSC announcements, it 
>    takes at least 1 yr for Si to be available after initial announcement and 
>    given the complexity of these chips, it takes at least 1 year for the bugs
>    to be weeded out before the chip hardware becomes stable (386, 32X32). 
>    Do we need more complex designs on a single wafer or go for "small is 
>    beautiful". 
>

V70 silicon already exists and is fairly functional (after all, the V70
is not much different from the V60).  Customer samples of the V70 should
be available in the third quarter of this year.  If you are seriously 
interested I can arrange for V60 samples now.

mash@mips.UUCP (John Mashey) (05/05/87)

In article <3810030@nucsrl.UUCP> ram@nucsrl.UUCP (Renu Raman) writes:
>...    3.  Mitsubishi's 32 bit mP for their TRON project.
>	Some details
>	    1. No TLB
>	    2. RISCy (at least adhering to lean cycles, fixed format
>	       simple instructions)....

Re: TRON:
See IEEE Micro, April 1987; whole issue on TRON project; esp.
Ken Sakamura, "Architecture of the TRON VLSI CPU", 17-31.

I wouldn't call the TRON architecture RISCy [note that this isn't
saying good or bad, just that it tends to not be much like what most
people think are RISC machines]:

Fixed format instructions: not exactly: there are 16-bit instructions
(short form), and then there are arbitrary-sized ones that are multiples
of 16-bits.  From my reading, they appear to allow arbitrary cascading
of indirect addressing [like the Sperry 1100s, for example],
which has interesting implications for pipelining.  Thus, their addressing
appears more complex than a 68020's.

The architecture specifies a bunch of user-level instructions which
compilers will find it difficult to generate: reverse-byte-order,
search-for-zero-or-one, bitmap operations [not just bitfields, bit maps],
string operations [including search for substring!],
queue manipulation [insert, delete, search].
It also specs BCD operations.

Note: I mean no criticism of the design, but if you call it RISC,
then almost no machine is a CISC! In fact:

``What's a RISC?''
ANS: any machine announced since 1983.

[This is clearly true, we've even been reading lately that the Motorola
68030 really has a lot of features expected to be found only on RISC
machines.  In particular, "One of the most basic concepts
of RISC architectures is that of hardware support for instructions.  The
MC68020/MC68030, although not RISC processors, have an impressive amount of
on-chip hardware for special instructions." T. L. Johnson, "The RISC/CISC
Melting Pot", Byte, April 87.  Huh?  I always thought CPUs were there
to provide hardware support for instructions....sigh.]
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

geo@necis.UUCP (05/05/87)

In article <3810030@nucsrl.UUCP> ram@nucsrl.UUCP (Renu Raman) writes:
>
>    The Japanese are tagging along with 32 bit machines:
>
>    This is some info for the interested:
>
>    1. From Electronics of Apr 16.
>
>	Japan's first microprocessor with 32-bit data and address buses
>	runs at a maximum rate of 6.6 MIPS - 4 MIPS is typical of 32-bit
>	chips(Roger,Brian,John - smile here).  Fabricated in 1.5um CMOS,
>	the NEC V70 has a dynamic bus sizing that enables it to match
>	input/output with 8,16, and 32 bit buses.  Its TRON (we are entering
>	space age) OS will make it shine in real-time control and robot
>	applications.  The 20MHz V70 also incorporates FP(!!) facilities
>	on-chip and has function redundancy monitor for fault-tolerant
>	computing.  Sample prices of the device is $687.52.  Prices
>	will be lower for production quantities.
>
>    The spec on FP on-chip is rather confusing.  I doubt if a hardware
>    FP unit exists on chip.  ANy clarifications?  Also, will it be
>    compatible with iAPX386 :-).

I have  some preliminary specs on the V70 and I quote:

	"On-Chip Floating Point Support
		- IEEE 32 and 64-Bit Data Types"

It also includes:
	
	On-Chip MMU
	On-Chip instruction and data cache

I wonder when we will see On-Chip Hard disk ;-} ?

The V-70 looks like a hell-of-a-chip.  Some other misc. hype:

	- 32 General Registers ( all 32 bits of course )
	- Symmetric Instruction Set
	- 20 Addressing modes
	- Variable Byte Length Format
	- Virtual Memory
		4.3 GB Virtual Space / task
		2 level paging
		16 Entry Full Association 

It's not iAPX386 compatible, this looks like a NICE architecture!!

--geo

Opinions??  Yeah, I have that in my car; Rack and Opinion Steering.-- 
-----
george aguiar	< UUCP: necis!geo >

lm@cottage.WISC.EDU (Larry McVoy) (05/05/87)

In article <491@necis.UUCP> geo@necis.UUCP (George Aguiar ext. 219) writes:
>The V-70 looks like a hell-of-a-chip.  Some other misc. hype:
>
>	- 32 General Registers ( all 32 bits of course )
>	- Symmetric Instruction Set
>	- 20 Addressing modes
>	- Variable Byte Length Format

Ummm, not to rain on your parade or anything - but I have real problems with 
the last two.  20 addressing modes?  That's a lot of logic.  And I'll bet
they support stuff like embedded displacements in the instruction stream (I'm
not talking about 4 bit constants, I'm talking about things like National 
and Motorola do with the top bits of their byte, word, and long word 
displacements).  That can cost you - you might not know off the top how long
the instruction is so your decoder might start decoding the previous 
instructions displacement(s).

Similar problem with the variable length format.  Unless they got smart and 
put the whole length as part of the first byte, you have to delay the logic 
that looks at the trailing part of the instruction.  This has messy 
implications when you consider the pipeline, does it not?

It looks like the 29K may have made some smart moves....

---
Larry McVoy 	        lm@cottage.wisc.edu  or  uwvax!mcvoy

"What a wonderful world it is that has girls in it!"  -L.L.

pec@necntc.NEC.COM (Paul Cohen) (05/06/87)

>In article <3810030@nucsrl.UUCP> ram@nucsrl.UUCP (Renu Raman) writes:

>>	Japan's first microprocessor with 32-bit data and address buses
>>	The NEC V70 has a dynamic bus sizing that enables it to match
>>	input/output with 8,16, and 32 bit buses.  


>I have  some preliminary specs on the V70 and I quote:
>	"On-Chip Floating Point Support
>		- IEEE 32 and 64-Bit Data Types"
>	On-Chip MMU
>	On-Chip instruction and data cache

Sorry on this one point.  The preliminary specs you quote are in fact
very preliminary.  In order to get the product to market sooner, the V70
was designed without these caches.  Really, this is just a change of
names since the V70 + cache and other goodies will follow, but with a
different name, possibly V80.

>The V-70 looks like a hell-of-a-chip.  Some other misc. hype:

>	- 32 General Registers ( all 32 bits of course )
>	- Symmetric Instruction Set
>	- 20 Addressing modes
>	- Variable Byte Length Format
>	- Virtual Memory
>		4.3 GB Virtual Space / task
>		2 level paging
>		16 Entry Full Association 

>It's not iAPX386 compatible, this looks like a NICE architecture!!



Say that again: 

		*****************************************
		*****************************************
		***				      ***
		***   THE V70 IS NOT 386 COMPATIBLE   ***
		***				      ***
		***   THE V60 IS NOT 286 COMPATIBLE   ***
		***				      ***
		*****************************************
		*****************************************

this is a NICE architecture

pec@necntc.NEC.COM (Paul Cohen) (05/06/87)

In article <1157@cottage.WISC.EDU> lm@cottage.WISC.EDU (Larry McVoy) writes:

>In article <491@necis.UUCP> geo@necis.UUCP (George Aguiar ext. 219) writes:
>>The V-70 looks like a hell-of-a-chip.  Some other misc. hype:

>>	- 32 General Registers ( all 32 bits of course )
>>	- Symmetric Instruction Set
>>	- 20 Addressing modes

Actually there are 21 addressing modes for addressing bytes.  Eighteen
of these addressing modes can also be used for addressing bits.

>>	- Variable Byte Length Format

>Ummm, not to rain on your parade or anything - but I have real problems with 
>the last two.  20 addressing modes?  That's a lot of logic.  And I'll bet
>they support stuff like embedded displacements in the instruction stream (I'm
>not talking about 4 bit constants, I'm talking about things like National 
>and Motorola do with the top bits of their byte, word, and long word 
>displacements).  That can cost you - you might not know off the top how long
>the instruction is so your decoder might start decoding the previous 
>instructions displacement(s).

The V60/V70 instruction set uses orthogonal encodings.  From the first 
byte the decoder can determine the number of operands and where the operand
encodings begin.  The operands themselves are encoded separately 
(orthogonally) and may vary considerably in size.  No doubt this increases
the complexity of the decoder, but it also improves code density (the
importance of this is not primarily to save on memory space but so that
it is not necessary to fetch as much code).  

As an example, consider the following COMPILED C code for the V60/V70:

		_______________________________________________
		| struct sttyp {
		|	unsigned first, second, third;
		|	struct sttyp *fourth;
		|	double fifth, a,b,c,d,e;} *stru
C code:		|
		|
		| stru->fourth->fourth->third = 2;
		|==============================================
		| mov.w _stru+0xc,r0	# &(stru->fourth) in R0
Assembly Code:	|
		| mov.w #2,0x8[0xc[r0]]	# stru->fourth->
		|			#    fourth->third = 2;
		|______________________________________________

No doubt about it, the V70 is a complex chip; it is also fast.  It packs 
a great deal of functionality to provide high performance at a reasonable 
cost in a real system.  

>It looks like the 29K may have made some smart moves....

It depends on its objectives.  The 29K requires two separate
paths to memory, one for code and another for data.  The memory must be
extremely fast (read expensive) to service the CPU without wait states.
It also expects some specialized bus monitoring hardware in the memory
system:  

  >From: tim@amdcad.AMD.COM (Tim Olson)
  >Subject: Re: AM29000 memory management (was flame)

  >The "best" place for the referenced and changed bits, however, are in 
  >an external memory array, which "watches" the bus and automatically 
  >updates the R & C bits.  This array can also be read from or written
  >to via I/O space to read or clear the bits.

Also, taking advantage of the AMD RISC style architecture places some
uncomfortable demands on compiler developers.

I'm not knocking the AMD part.  It is an interesting processor and I'll
be interested in seeing what it does in a real system but I'll also be
interested in seeing what such a system costs.

If system cost does is no concern to you then disregard my comments but 
if a good cost/performance ratio seems important to you (not to mention 
V30 software compatibility) I suggest that you take another look at the 
V70 and the V60.

bcase@amdcad.AMD.COM (Brian Case) (05/06/87)

In article <4016@necntc.NEC.COM> pec@necntc.UUCP (Paul Cohen) writes:
>In article <1157@cottage.WISC.EDU> lm@cottage.WISC.EDU (Larry McVoy) writes:
>No doubt about it, the V70 is a complex chip; it is also fast.  It packs 
>a great deal of functionality to provide high performance at a reasonable 
>cost in a real system.  
>
>>It looks like the 29K may have made some smart moves....
>
>It depends on its objectives.  The 29K requires two separate
>paths to memory, one for code and another for data.  The memory must be
>extremely fast (read expensive) to service the CPU without wait states.
>It also expects some specialized bus monitoring hardware in the memory
>system:  
>
>  >From: tim@amdcad.AMD.COM (Tim Olson)
>  >Subject: Re: AM29000 memory management (was flame)
>
>  >The "best" place for the referenced and changed bits, however, are in 
>  >an external memory array, which "watches" the bus and automatically 
>  >updates the R & C bits.  This array can also be read from or written
>  >to via I/O space to read or clear the bits.
>
>Also, taking advantage of the AMD RISC style architecture places some
>uncomfortable demands on compiler developers.
>
>I'm not knocking the AMD part.  It is an interesting processor and I'll
>be interested in seeing what it does in a real system but I'll also be
>interested in seeing what such a system costs.
>
>If system cost does is no concern to you then disregard my comments but 
>if a good cost/performance ratio seems important to you (not to mention 
>V30 software compatibility) I suggest that you take another look at the 
>V70 and the V60.

First, the memory system for the Am29000 may be as simple as VideoDRAM.
These VDRAMs are, I believe, only marginally more expensive than regular
DRAMs and allow the Am29000 to deliver a fair fraction of its maximum
performance.  I know of one potential customer who is simulating the
Am29000 with VDRAMs and is quite satisfied with the results (frankly,
I was very surprised at the peformance, but this may be an isolated
case).  Let's face it:  you can try lots of stuff with instruction set
encoding, pipelinging tricks, etc. etc., but in the end, the performance
of the CPU comes down to that of the memory hierarchy.  As the designers
of the Am29000, we recognized this fact and did what we could to *solve*
the problem instead of trying our best to *hide* the problem in highly-
encoded instruction formats.  To get the best performance from the Am29000
probably *does* require an expensive memory system; we look at it this
way:  at least the Am29000 gives the system designer a *chance* to get
superior performance.  We feel we have given the designer the "Max
Headroom."  :-) :-)

Second, the Am29000 does not "expect" some sophisticated bus monitoring
hardware.  Maintaining referenced and modified bits in hardware associated
with the memory arrays is what we consider the *best* way; since the TLB
reload routines for the Am29000 can be tailored to specific needs, it is,
of course, quite possible to maintain this information by software means.
However, there is a performance cost associated.  Even if the TLB reload
is done by "hardware" (really microcode or some state machine) on the CPU
chip, there is a performance cost.  Referenced and modified bits in
hardware next to the memory arrays is probably the best for multiprocessor
systems too.  But there are *lots* of specific tradeoffs to make for a
particular system; again we feel that we have given the "Max Headroom"
since a designer may choose to maintain referenced and modified information
wherever he (she?) chooses.  When the TLB reload and other VM tasks are
done by fixed routines/state machines in "hardware", there can be
problems.

Thirdly, I don't know what architectural features of the Am29000 are
considered to place uncomfortable demands on compiler writers.  Overlapped
loads and stores are there to be taken advantage of (and have been
demonstrated, by one customer on one graphics benchmark, to be worth
nearly a factor of two in performance (but I don't think this will be
the case most of the time)) if possible; the Am29000 interlocks to
insure correct operation when full overlap isn't possible.  Delayed
branches must be dealt with by software constructors (be they human
beings or compilers), but this is not a big deal (in fact it is, I
believe, one of the simplest optimizations to perform).  For the Am29000,
using the local register file as a stack cache can make register
allocation easy.  Three address register-register instructions and the
load/store architecture make code generation easy.  The kinds of
optimizations that are important for reaping maximum performance from
the Am29000 are the same ones that are important for reaping maximum
performance from any architecture:  loop optimizations, common sub-
expression elimination, induction variable elimination, strength
reductions, etc. etc.  We believe that the Am29000 makes these
optimizations *easier* not more difficult.  I believe that most of
the members of the compiler-writing and architecture community would
agree that a simple architecture with a predictable cost for instructions
(in both time and space) is the best match for automatic code generation.
I wouldn't mind if some of you in the compiler-writing and architecture
community (and OS community too, sorry John) would come to my aid.

I am not trying to say anything bad about the V70.  I just want to set
the record straight about the Am29000.

   bcase

mash@mips.UUCP (John Mashey) (05/07/87)

In article <16561@amdcad.AMD.COM> bcase@amdcad.UUCP (Brian Case) writes:
>In article <4016@necntc.NEC.COM> pec@necntc.UUCP (Paul Cohen) writes:
>>It depends on its objectives.  The 29K requires two separate
>>paths to memory, one for code and another for data.  The memory must be
>>extremely fast (read expensive) to service the CPU without wait states.

2 paths are usually better than one, as any chips will discover when they
keep pushing clock rates.

>case).  Let's face it:  you can try lots of stuff with instruction set
>encoding, pipelinging tricks, etc. etc., but in the end, the performance
>of the CPU comes down to that of the memory hierarchy.  

agree.
I did have one question: what kind of write buffers do the AMD simulations
use [i.e., how deep], and what kind of % hit is there for:
	a) write stalls [write buffer full]
	b) read/write memory conflicts [i.e., I was already doing a write,
	and a read comes along that's a cache miss].
If I missed this info published somewhere, just point us at it.
>
>optimizations *easier* not more difficult.  I believe that most of
>the members of the compiler-writing and architecture community would
>agree that a simple architecture with a predictable cost for instructions
>(in both time and space) is the best match for automatic code generation.

100% we may disagree on other issues, but not this one.
[following on the earlier comments on V70 addressing modes]
If somebody says "20 addressing modes are good", to be convincing, they'd
better be able to show tradeoffs, and show us the dynamic and static usages
of those things, in real compiled code of substantial size.  They may be
worth it, or they may be not, but there is substantial data that says that
complex addressing modes just aren't used very much.  Perhaps this is
an exception....
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

pec@necntc.NEC.COM (Paul Cohen) (05/07/87)

In article <372@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes:
>In article <16561@amdcad.AMD.COM> bcase@amdcad.UUCP (Brian Case) writes:
>>In article <4016@necntc.NEC.COM> pec@necntc.UUCP (Paul Cohen) writes:
>>>It depends on its objectives.  The 29K requires two separate
>>>paths to memory, one for code and another for data.  The memory must be
>>>extremely fast (read expensive) to service the CPU without wait states.

>2 paths are usually better than one, as any chips will discover when they
>keep pushing clock rates.

I very much agree if you mean better performance.  Do you also mean
better system cost?

>>optimizations *easier* not more difficult.  I believe that most of
>>the members of the compiler-writing and architecture community would
>>agree that a simple architecture with a predictable cost for instructions
>>(in both time and space) is the best match for automatic code generation.

There is no doubt that having fewer options makes decisions easier.

I agree that comments from compiler writers would be welcome here.

>If somebody says "20 addressing modes are good", to be convincing, they'd
>better be able to show tradeoffs, and show us the dynamic and static usages
>of those things, in real compiled code of substantial size.  

A quibble here: Why is the size of the code important?  

	I don't understand why the size of the code has any bearing on 
	use of addressing modes, though I do see that it would have some 
	bearing on the difficulty of determining usage statistics.  Even 
	if there is some connection, I suspect there are many machines 
	in active use that spend most of their time executing fairly 
	small programs.  

The C compiler that is currently available for the V60/V70 is based on 
the Unix Portable C Compiler.  This is admitedly not the best (in terms
of performance, disregarding cost) compiler technology around (there are 
two other C compilers under development for the V60/V70 by U.S. compiler 
companies), but this compiler does use all of the available addressing 
modes (and all of the V60/V70 non-privileged instructions except for 
some of the string instructions).  

I wish that I had the time to do a study of the sort suggested (though 
probably any results that I would get would be suspected of bias).
One question to ponder in this regard: suppose only 15 (or even only 5) 
of the addressing modes were found to be extensively used by SOME compiler.
Could you conclude that you would be better off with only one or two 
addressing modes?

If anyone (preferably someone with no axe to grind) would like to 
volunteer to do some research of this kind on V60/V70 code I'd be more 
than happy to cooperate.

On another note, in response to an earlier posting, I've had numerous 
requests for help in getting documentation on the V60/V70.  I earlier 
posted the telephone numbers:

		1-800-NEC-ELEC (California)
		1-800-NEC-ELE1 (During California working hours)

It did not occur to me that I would get requests from Europe as well.
If that is your general location, a better number would be:

		0049-211-6503-333 (Dusseldorf)

alexande@drivax.UUCP (Mark Alexander) (05/07/87)

In article <491@necis.UUCP> geo@necis.UUCP (George Aguiar ext. 219) writes:

>I have  some preliminary specs on the V70...
>It's not iAPX386 compatible...

It does, however, have a virtual 8086 mode, kinda-sorta like
the 386, but a little bit easier to work with (mainly because
there are still a LOT of registers left over after you take
away those that are mapped to 8086 registers).
-- 
Mark Alexander	...{hplabs,seismo,sun,ihnp4}amdahl!drivax!alexande
(This space intentionally left blank.)

mash@mips.UUCP (05/08/87)

In article <4070@necntc.NEC.COM> pec@necntc.UUCP (Paul Cohen) writes:
>In article <372@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes:
>>In article <16561@amdcad.AMD.COM> bcase@amdcad.UUCP (Brian Case) writes:
>>>In article <4016@necntc.NEC.COM> pec@necntc.UUCP (Paul Cohen) writes:
>
>>2 paths are usually better than one, as any chips will discover when they
>>keep pushing clock rates.

>I very much agree if you mean better performance.  Do you also mean
>better system cost?
Depends; I was talking about performance, and actually, I meant separate
I & D caches.

>>If somebody says "20 addressing modes are good", to be convincing, they'd
>>better be able to show tradeoffs, and show us the dynamic and static usages
>>of those things, in real compiled code of substantial size.  

>A quibble here: Why is the size of the code important?  

All that I meant was: real programs, not toys, and not synthetic benchmarks.
>
>I wish that I had the time to do a study of the sort suggested (though 
>probably any results that I would get would be suspected of bias).
>One question to ponder in this regard: suppose only 15 (or even only 5) 
>of the addressing modes were found to be extensively used by SOME compiler.
>Could you conclude that you would be better off with only one or two 
>addressing modes?

As usual, it depends.  Maybe one discovers that no compilers use all of
the modes, but all the modes are used substantially by something,
or that the unused modes cost more to omit than include [by the time
one has included the heavily-used ones.]  THe point is that to show that
"lots of modes" is a good thing, it is not sufficient to show one example
of how a compiler might use them. What's really needed is a good tradeoff
analysis: it's often hard, after the fact, to know what the modes cost
in terms of cycle time.  What can be measured is the usage frequency
of the modes, and this is valuable information, and how we make
progress in the architecture area.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

preece@ccvaxa.UUCP (05/11/87)

  mash@mips.UUCP:
> If somebody says "20 addressing modes are good", to be convincing,
> they'd better be able to show tradeoffs, and show us the dynamic and
> static usages of those things, in real compiled code of substantial
> size.  They may be worth it, or they may be not, but there is
> substantial data that says that complex addressing modes just aren't
> used very much.  Perhaps this is an exception....
----------
I find it a little amusing that the same people who say "complex
feature x just isn't used very much" tend to be the same people who
say "not to worry, a sufficiently clever compiler will take care of
our ship's need for X".  If compilers can be made smart enough to
handle some of the special things that RISCs need, they could be
made smart enough to make better use of the complex features in
CISCs.

The point isn't that RISCs make certain optimizations easier or harder,
but that they make certain optimizations NECESSARY.  Compilers smart
enough to use some of the special features of CISCs haven't been
sufficiently necessary -- they work "well enough" using simple
instruction sequences.  My impression from the literature is that RISCs
demand more compiler optimization to reach the performance that is
expected of them than do CISCs.  Perhaps that simply means we have
higher expectations of them, perhaps it simply means that baseline
compiler performance is better than it used to be and those expectations
are reasonable.  Whatever.

-- 
scott preece
gould/csd - urbana
uucp:	ihnp4!uiucdcs!ccvaxa!preece
arpa:	preece@gswd-vms

baum@apple.UUCP (05/12/87)

--------
[]
>....  Since RISC programs will tend to be substantially
>larger than CISC programs, a RISC system will need more memory than
>a CISC system.
>
>(Disclaimer:  I, of course, have no evidence to back up this theory.)
>
>-- Chuck

I have evidence in both directions. The IBM801 folks have said that
the 801 code size was about 20% larger than IBM/370 code. This is
probably smaller than the difference between two different compilers.
An I-cache that is 20% smaller will not significantly affect your
performance (.e.g delta will be < 20%).

On the other hand, the ATT CRISP folks claim code size which is equal
to or smaller than VAX code, which is known to be fairly compact. So
much for canard that RISC code is huge.
--

{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

lm@cottage.WISC.EDU (Larry McVoy) (05/12/87)

In article <28200037@ccvaxa> preece@ccvaxa.UUCP writes:
>I find it a little amusing that the same people who say "complex
>feature x just isn't used very much" tend to be the same people who
>say "not to worry, a sufficiently clever compiler will take care of
>our ship's need for X".  If compilers can be made smart enough to
>handle some of the special things that RISCs need, they could be
>made smart enough to make better use of the complex features in
>CISCs.

This is hogwash.  As someone who has a certain amount of compiler experience,
I can say that e RISC compiler is likely to be much less intelligent than
a CISC one.  The reason (hold the flames a bit, ok?) is that I have yet
to see an orthogonal CISC machine.  The 32000 series is the closest.  Vax
and 68000 don't come very close.   The problem is this:  you're generating
code for a particular action, right?  And this special instruction looks
like just the ticket.  And then (or maybe three months later) you realize
that you needed a signed displacement and they give you an unsigned 
displacement.  Or something similar.  So you end up going in and 
generating the code in the "stupid" straightforward manner.  Or - maybe
you're really dedicated and you add another 200 lines of code to the
compiler to catch this special case.  And in another week you find out...

The problem can be summarized as follows:

Provide 0, 1 or infinity.  No exceptions.  CISC is trying to approximate
infinity.  As it gets closer the chip gets slower.  The infinity choice
is clearly wrong.  Admit it.

Larry McVoy 	        lm@cottage.wisc.edu  or  uwvax!mcvoy

tim@amdcad.AMD.COM (Tim Olson) (05/12/87)

In article <28200037@ccvaxa> preece@ccvaxa.UUCP writes:
+-----
| The point isn't that RISCs make certain optimizations easier or harder,
| but that they make certain optimizations NECESSARY. ...
+-----
No, the point is that RISCs make certain optimizations *POSSIBLE* -- by
using only simple, single-cycle instructions, optimization opportunities
are uncovered which may not be available with complex, multi-cycle
instructions -- especially in loops.

+-----
|                                                 ...  Compilers smart
| enough to use some of the special features of CISCs haven't been
| sufficiently necessary -- they work "well enough" using simple
| instruction sequences.  ...
+-----
And those simple instruction sequences allow higher levels of one of the
most beneficial optimizations -- code motion out of loops.

	-- Tim Olson
	Advanced Micro Devices

aeusesef@csun.UUCP (Sean Eric Fagan) (05/14/87)

In article <277@astroatc.UUCP> johnw@astroatc.UUCP (John F. Wardale) writes:

>Now the trend is to build simpler, faster REGULAR machines, and the only
>thing that falls the the category of "fashionability" is terms "RISC" and
>"CISC"  Seymore (sp?) made a RISCy machine called the "Cray-1" before
>anyone started to use the term "RISC"
>A wonderful appearing "middle ground" part is the "Clipper" which is a
>basic RISC machine with an additional set of "macro" instructions.
>(So, we can't do MULT or DIV in one clock, we'll give you a MULT and DIV
>macro in "hardware".  I like this approach, but then I'm firmly rooted in
>the RISC camp)
>Name:	John F. Wardale
>UUCP:	... {seismo | harvard | ihnp4} !uwvax!astroatc!johnw
>arpa:   astroatc!johnw@rsch.wisc.edu
Not to be nitpicky, or anything, but old Seymore did a RISC long before the
Cray-1.  Ever hear of the CDC Cyber line?  The 6600, 7600, 170 lines?  All
RISC, with a grand total of about, oh, 70+ instructions.  Two instructions
to directly access memory, and a few instructions to indirectly access
memory.  Execution of non-divide, non-context save instructions in less than
10 clock cycles, generally under 5.  Double precision multiply takes about 5
clock cycles, I believe (worse case).  Single precision is the same, only it
returns the low half.  (I should mention that double precision on a Cyber is
120 bits.)  Very fast floating point, slightly slower integer (yeah,
slower), very nice instruction set.  Lousy operating system, though, but
that is not Seymore's fault.  If you ever get the hardware reference manuals
for both the Cray-1 and the Cyber 7600 (or a 170 model), you will notice
that they are *extremely* similar, almost the same except for small things
like word size, lack of a divide instruction on the Cray, no vectors on the
Cyber, etc.
Sorry, but I tend to ramble about the Cyber, especially when people bring up
the Cray.  If I misspelled Mr. Cray's first name, please forgive me...

 -----

 Sean Eric Fagan          Office of Computing/Communications Resources
 (213) 852 5086           Suite 2600
 AGTLSEF@CALSTATE.BITNET  5670 Wilshire Boulevard
                          Los Angeles, CA 90036
{litvax, rdlvax, psivax, hplabs, ihnp4}!csun!{aeusesef,titan!eectrsef}
--------------------------------------------------------------------------------
My employers do not endorse my   | "I may be slow,  but I'm not  stupid.
opinions,  and, at least in my   |  I can count up to five *real* good."
preference  of Unix,  heartily   |      The Great Skeeve
disagree.                        |      (Robert Asprin)

baum@apple.UUCP (Allen J. Baum) (05/14/87)

--------
[]

>(Has anyone ever made a machine the pre-fetches I-cache lines from memory?)

I believe that the big Amdahl machines can do that (condtioned on some strange
status bit somewhere), as can the Fairchild Clipper.

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

henry@utzoo.UUCP (Henry Spencer) (05/14/87)

> ...On the other hand, the ATT CRISP folks claim code size which is equal
> to or smaller than VAX code, which is known to be fairly compact...

Not everyone agrees that VAX code is "fairly compact"...
-- 
"The average nutritional value    Henry Spencer @ U of Toronto Zoology
of promises is roughly zero."     {allegra,ihnp4,decvax,pyramid}!utzoo!henry

brucek@hpsrla.HP.COM (Bruce Kleinman) (05/15/87)

+-----
| Any operation can be done faster if implemented at a 'lower' level in 
| the machine.
+-----

Completely true.  The problem is there is only so much 'lower' level.
Chip real estate isn't unlimited, at least it wasn't last time I checked.

CISC chips frequently offer hundreds of instructions with a dozen or more
address modes.  This usually necessitates the use of one, sometimes two,
levels of microcode.  The richness/complexity of the instruction set
requires a mass of pipeline interlock logic.  CISC CPUs tend to be, uh,
rather complex.  And, therefore, the CPU usually dominates the chip.

RISC chips generally offer a hundred or fewer instructions with a few
address modes.  This usually allows the instruction set to be hardwired.
The orthogonal nature of the instruction set requires very little
pipeline interlock logic.  RISC CPUs tend to be rather simple. And,
therefore, the CPU is usually a small portion of the chip.

The CISC advocate interprets the quote at the top of this note and says,
   " Putting 'register indirect with base + offset' addressing in the
     instruction set is a win, because my chip will be able to do it
     with a single instruction -- which will make my chip really fast. "
This approach buys you some very useful operations at the expense of
real estate.  Your instruction set soon becomes less orthogonal, more
exceptions are introduced, and you have to handle them in the pipeline
logic.  Pretty soon you are out of real estate, because you have a big
chunk or two of microcode, a large area for decode, and a gate array
worth of logic to glue your pipeline together ....

The RISC advocate interprets the quote and says,
   " Leaving 'register indirect with base + offset' addressing out of
     the instruction set is a win, because I will be able to hardwire
     the CPU -- which will make my chip really fast. "
This approach buys you a very orthogonal instruction set, while using
up relatively little real estate.  Your CPU is hardwired, most instructions
execute in a single cycle, and your pipeline can be balanced more easily.
And you've got a bunch of real estate left over ....

Who wins?

My (completely unbiased) answer:
The RISC chip wins because the extra real estate can be used for a massive
register file, or an on chip floating point unit, or a *real* cache
(i.e. one greater than 256 bytes), etc, etc.  

Yes, any operation can be done faster if implemented at a 'lower' level in 
the machine.  And a RISC chip leaves you a lot of 'lower' level to work
with after the CPU is complete.  And I'll take a 192 registers or a 4K 
cache over 'register indirect with base + offset' any day.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                              Bruce Kleinman
              Hewlett Packard -- Network Measurements Division
                          Santa Rosa, California

                         ....hplabs!hpsrla!brucek
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

steve@edm.UUCP (05/15/87)

In article <754@apple.UUCP>, baum@apple.UUCP (Allen J. Baum) writes:
> --------
> []
> >....  Since RISC programs will tend to be substantially
> >larger than CISC programs, a RISC system will need more memory than
> >a CISC system.
> the 801 code size was about 20% larger than IBM/370 code. This is
> ... so much for canard that RISC code is huge.
 One thing to consider is that often the DATA space is larger than the I-space.
(this, of course may not apply to the kernel). I don't know just how widespread
this is, but if it is relatively common (I would guess this is especially
common with number-crunching type exploits (where RISC speeds are so useful))
then the expanded code size of RISC may not be all that expensive.

-- 
-------------
 Stephen Samuel 			Disclaimer: You betcha!
  {ihnp4,ubc-vision,seismo!mnetor,vax135}!alberta!edm!steve

meissner@dg_rtp.UUCP (05/17/87)

> In article <277@astroatc.UUCP> johnw@astroatc.UUCP (John F. Wardale) writes:
> 
> >Now the trend is to build simpler, faster REGULAR machines, and the only
> >thing that falls the the category of "fashionability" is terms "RISC" and
> >"CISC"  Seymore (sp?) made a RISCy machine called the "Cray-1" before
> >anyone started to use the term "RISC"
	...
And in article <609@csun.UUCP> aeusesef@csun.UUCP (Sean Eric Fagan) replies:
> 
> Not to be nitpicky, or anything, but old Seymore did a RISC long before the
> Cray-1.  Ever hear of the CDC Cyber line?  The 6600, 7600, 170 lines?  All
> RISC, with a grand total of about, oh, 70+ instructions.  Two instructions
> to directly access memory, and a few instructions to indirectly access
> memory.
	....
> Very fast floating point, slightly slower integer (yeah,
> slower), very nice instruction set.

While the 6600 and 7600 lines had a sparse number of instructions, I doubt
whether they would qualify as a RISC machine (I don't know about the CRAY).
In the first place, the main thing about ALL of the RISC's that are true
RISC's, is that EVERY instruction takes one cycle.  No exceptions.  The CDC
machine instructions could take multiple cycles (divide in particular).  One
of the things that highlighted the machines was multiple parallel functional
units (ie, you would typically fire off a divide, and then multiplies, each
with different accumulators, and as long as you did not issue an instruction
using that accumulator until the unit was done, you could proceede and do
something else).  Another thing that the RISC philosophy has come to mean
is regular (ie, no special purpose accumulator's, any register can be the
target or source(s) of any instruction).  I dare you to call this machine
regular.  To load from main memory, you stored the address you wanted into
an A register (A1-A5 only), and in a few cycles the value would appear in
the corresponding X register.  To store a value, you would store the address
into an A register (A6-A7 only), and the corresponding X register would be
stored.  Thus the X registers were special purpose (X1-X5 could read from
memory, X6-X7 could write to memory, X0 was scratch).  The A registers were
tied to the corresponding X registers, and only 18 bits wide.  The B registers
were also 18 bits wide, with B0 being hardwired to 0 (you could store into
B0, but the machine would ignore it).

Given the parallel functional units, the machine kept track of which
accumulator was in use, and would pend any instruction that would
reference it, until the functional unit was done (unlike some recent
machines, where the compiler would have to do the scheduling itself
because the hardware interlock was removed for speed).  The Fortran
compiler (FTN) would attempt to keep the different functional units busy.
Also, given that the machine wordsize (60 bits) was much larger than the
instruction size (mostly 15 or 30 bits), any instruction that was the target
of a branch had to begin on a word boundary.  Because of parallel functional
units and up to 4 instructions per word, faults were only approximate (you
knew which word faulted, but not which instruction in the word).  It also
became a game of assembler hackers to conconct things like a sequence that
would raise three different faults (overflow, underflow, etc) in the same
machine word.  It was an interesting machine.
-- 
	Michael Meissner, Data General	Uucp: ...mcnc!rti!dg_rtp!meissner

It is 11pm, do you know what your sendmail and uucico are doing?

baum@apple.UUCP (Allen J. Baum) (05/18/87)

--------
[]

>In the first place, the main thing about ALL of the RISC's that are true
>RISC's, is that EVERY instruction takes one cycle.  No exceptions.
>Another thing that the RISC philosophy has come to mean
>is regular (ie, no special purpose accumulator's, any register can be the
>target or source(s) of any instruction).

I hate to disappoint you, but I can't think of any RISC machines that
meet these (rather Ad Hoc) criteria. Most of the RISC processors out
there have multi-cycle instructions, including the original RISC I,
which had a two cycle load, and most of them have a register which is
a hard-wired zero. The HP Spectrum has at least two instructions with
hardwired register destinations. The definition of 'RISC' is in the mind
of the beholder. There is no 'agreed' definition of what a  RISC processor
is except that maybe you know one when you see one.....

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

jps@apollo.uucp (Jeffrey P. Snover) (05/19/87)

>The definition of 'RISC' is in the mind
>of the beholder. There is no 'agreed' definition of what a  RISC processor
>is except that maybe you know one when you see one.....
 
It seems to me that the progression of things goes something like this:
    - Come up with a new angle on things
    - Pick a flashy aspect and name the set of ideas after it
        ( RISC is easier to say than OIPC [One Instruction Per Cycle] 
          and flashier than NM [No Microcode])
    - Evolve the concept, expanding on the good ideas and passing on the
      not so good ideas.
    - Spend years complaining when the evolved idea doesn't comform 
      to the flashy name.
        o the idea was bast***ized
        o thats not a *REAL* xxxx
        o the damn trilateral commission is at it again!

mash@mips.UUCP (John Mashey) (05/21/87)

I think this has mostly been answered by other posters, so I've missed
most of the discussion, having been off in New Zealand [P.S., if you ever
get a chance to attend the N.Z. UNIX conference, DO IT! well-run, great
bunch of people, lot of fun, wonderful sightseeing, sheep jokes....]
However, let me add a few notes on the above comments, plus a few more examples
not already given by the other posters on this topic.

In article <28200037@ccvaxa> preece@ccvaxa.UUCP writes:
>  mash@mips.UUCP:
>> If somebody says "20 addressing modes are good", to be convincing,....
>----------
>I find it a little amusing that the same people who say "complex
>feature x just isn't used very much" tend to be the same people who
>say "not to worry, a sufficiently clever compiler will take care of
>our ship's need for X".  If compilers can be made smart enough to
>handle some of the special things that RISCs need, they could be
>made smart enough to make better use of the complex features in
>CISCs.

Good compiler technology is always useful; it's merely more useful on some
machines than an others.  Let's assume that one can have the same
compilers on a variety of machines [both Stanford and IBM have done this].
The point is that good optimizing compilers change the statistics of
what's going on, at least somewhat, on ANY machine, and if the statistics
say that such compilers greatly lessen the use of a feature, you might think
of eliminating the feature entirely, if there was a nonzero cost for it,
and if the compilers have reasonable alternatives.
Let's go back to the example that started all this, which was me claiming that
"lots of addressing modes" needed justification as a good feature.  [This was
NOT a statement that lots of modes was necessarily bad, merely that it needed
to be justified, because the published data seemed not to justify it.
BTW, does anybody have some dynamic statistics on the multi-level indirect
modes?  What we've got so far is mostly static counts, which can be misleading.]
For example, consider code like:
	if (a->b && a->b->c) ...  [not uncommon]
Suppose you have a machine that has all the indirect modes.  If you have
a non-optimizing compiler, but you special-case it to pick up the indirect
modes, you can use them [and as somebody has pointed out, on some machines,
there may be an implicit reference thru a frame pointer to get to a, and I'll
assume that].  Depending on the machine, this could get you something like
	fetch a->b	1 memory ref to offset.of.a + (fp)
			1 memory reference to offset.of.b + above
	test
	branch around if zero
	fetch a->b->c	1 memory ref to offset.of.a + (fp)
			1 memory reference to offset.of.b + above
			1 memory reference to offset.of.c + above
	test ....
Suppose you have an optimizing compiler, which will surely do common
subexpression elimination and serious register allocation, or it has no
business calling itself an optimizer.  What would it do?
	fetch a into r1	1 memory ref to offset.of.a + (fp)
	fetch b into r2	1 memory referenc to offset.of.b + (r1)
	test r2
	branch around if zero
	fetch c		1 memory reference to offset.of.c + (r2)
	test...

There are all sorts of variants, depending on the machine, and of course,
it's quite possible that the optimizer might have decided "a" was a good
thing to have in a register long before anyway, and amortized the cost
of getting it there over several references.  The point is that the first
example has 5 address specifiers, and the second one has 3, and if the
optimizer is at all lucky, it moved the first fetch away from the second
one and got to re-use the value.  On most machines I've seen, the 2nd
case will go faster than the first, so what's happened is that some good
machine-independent optimizations have reduced the utility of the specific
machine feature [multi-level indirect addressing].

It's not that compilers can't be smart enough to take advantage of special
features [I've done some ferocious hacking on compilers to do just that:
once you have a machine, you do whatever you can!], but that given good
optimizers, some features are of less use than others, because the optimizers
change the statistics.  At that point, you can make reasoned tradeoffs,
but it's hard to do without a good understanding of what's likely
to be possible for the compilers to do.
>
>The point isn't that RISCs make certain optimizations easier or harder,
>but that they make certain optimizations NECESSARY.  Compilers smart
>enough to use some of the special features of CISCs haven't been
>sufficiently necessary -- they work "well enough" using simple
>instruction sequences.  My impression from the literature is that RISCs
>demand more compiler optimization to reach the performance that is
>expected of them than do CISCs.  Perhaps that simply means we have
>higher expectations of them, perhaps it simply means that baseline
>compiler performance is better than it used to be and those expectations
>are reasonable.  Whatever.

Optimizations are by definition NEVER necessary! only desirable.
We see something like 20% improvement from the more global optimizations,
which is well worthwhile, since that's adding a few Mips to the performance,
and some important cases sometimes get more. Nevertheless, the machines
are still OK without this, and there's less weird machine-specific
hackery by far than things I've seen done on many other machines.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086