[comp.arch] Why do RISCs use

hoelzle@Neon.Stanford.EDU (Urs Hoelzle) (12/14/90)

The integer part of a typical RISC can be implemented with about 100K
transistors.  Current impementation technology allows >1M
transistors/chip (e.g. i486, 68040).  Why don't current RISC
implementations take advantage of the extra transistors and put a
large cache on the chip?  And/or an FPU?  For example (as far as I
know), typical SPARC chips neither have an on-chip FPU nor a large
cache; same for MIPS. 

One reason for a lower transistor count might be cost - but having
separate FPU or MMU chips doesn't exactly reduce system cost.  Or does
the better yield of the small chips outweigh the extra cost of having
several chips instead of just one?
Another reason would be faster implementation technology which doesn't
allow the same transistor density - but most popular RISC chips don't
run on a faster clock than, say, a 486.

So: why do RISC implementations currently use fewer transistors??
-- 
------------------------------------------------------------------------------
Urs Hoelzle, CS PhD student			       hoelzle@cs.stanford.EDU
Center for Integrated Systems, CIS 42, Stanford University, Stanford, CA 94305

brandis@inf.ethz.ch (Marc Brandis) (12/14/90)

In article <1990Dec14.001129.988@Neon.Stanford.EDU> hoelzle@Neon.Stanford.EDU (Urs Hoelzle) writes:
>The integer part of a typical RISC can be implemented with about 100K
>transistors.  Current impementation technology allows >1M
>transistors/chip (e.g. i486, 68040).  Why don't current RISC
>implementations take advantage of the extra transistors and put a
>large cache on the chip?  And/or an FPU?  For example (as far as I
>know), typical SPARC chips neither have an on-chip FPU nor a large
>cache; same for MIPS. 

This is only true for what you find in current machines, but it is not true
for chips. E.g. the new MIPS R3300 (or maybe R3400, I do not exactly remember
the number) contains the FPU and caches on chip. There are SPARC 
implementations that have cache on chip. There are RISC CPUs like the i860
or the IBM RS/6000 that have around 1 million transistors on a chip (actually,
the RS/6000 uses 9 chips with almost 8 million transistors as a whole).

>One reason for a lower transistor count might be cost - but having
>separate FPU or MMU chips doesn't exactly reduce system cost.  Or does
>the better yield of the small chips outweigh the extra cost of having
>several chips instead of just one?

One or two years ago it may have been that the better yield would result in
lower cost as a whole. But the picture is changing. I think in one or two
years you will find a lot RISC workstations containing CPUs with 1 million
transistors.

Marc-Michael Brandis
Computer Systems Laboratory, ETH-Zentrum (Swiss Federal Institute of Technology)
CH-8092 Zurich, Switzerland
email: brandis@inf.ethz.ch

croft@csusac.csus.edu (Steve Croft) (12/14/90)

Panasonic has put their whole SPARC implementation (FPU, CPU, cache, etc)
on on chip.  And just lookit that heat sink!

Steve
stevec@water.ca.gov

rcpieter@svin02.info.win.tue.nl (Tiggr) (12/15/90)

hoelzle@Neon.Stanford.EDU (Urs Hoelzle) writes:

>So: why do RISC implementations currently use fewer transistors??

The ARM is a good example of a RISC with very few transistors (core is < .5k
transistors).  The reason is that its designers wanted it to be CHEAP.

Tiggr

minich@d.cs.okstate.edu (Robert Minich) (12/15/90)

by rcpieter@svin02.info.win.tue.nl (Tiggr):
| The ARM is a good example of a RISC with very few transistors (core is < .5k
| transistors).  The reason is that its designers wanted it to be CHEAP.

  Wow, 500 transistors for the core. Must be one of those RidiculISC
extremists. :-)

-- 
|_    /| | Robert Minich            |
|\'o.O'  | Oklahoma State University| "I'm a newcomer here, but does the
|=(___)= | minich@d.cs.okstate.edu  |  net every lay any argument to rest?"
|   U    | - Ackphtth               |                    -- dan herrick

rcpieter@svin02.info.win.tue.nl (Tiggr) (12/15/90)

minich@d.cs.okstate.edu (Robert Minich) writes:

|by rcpieter@svin02.info.win.tue.nl (Tiggr):
|| The ARM is a good example of a RISC with very few transistors (core is < .5k
|| transistors).  The reason is that its designers wanted it to be CHEAP.

|  Wow, 500 transistors for the core. Must be one of those RidiculISC
|extremists. :-)

Oops!  k as in 100000.  Sounds weird, right?

rob@dagmar.coyote.trw.com (Robert Heiss) (12/16/90)

In article <1990Dec15.021854.7613@d.cs.okstate.edu> minich@d.cs.okstate.edu (Robert Minich) writes:
>by rcpieter@svin02.info.win.tue.nl (Tiggr):
>| The ARM is a good example of a RISC with very few transistors (core is < .5k
>| transistors).  The reason is that its designers wanted it to be CHEAP.
>
>  Wow, 500 transistors for the core. Must be one of those RidiculISC
>extremists. :-)

Okay, the software engineers demand a 32 bit processor, and the bean
counters give a 500 transistor budget.  Overconstrained?  Let's give
it a try.

A bit serial approach is selected.  This results in meager instruction
throughput, but makes possible an extremely high clock rate.  Since most
of your market believes computer horsepower is measured in MHz, sales
will take off.  The sluggish performance will have them back as repeat
customers within a year.

The serial adder with carry flip-flop is just a few dozen transistors.
Data path multiplexers and logical instructions cost four transistors each.
Special locations in main memory will substitute for operand registers.
So far it looks like 500 will be easy.  Unfortunately we need a full size
memory address register to form those 32 bit addresses for the bloated
software.  A 32 bit instruction pointer probably requires a hardware
register too, but we can economize by storing it on dynamic nodes next
to the memory address register.  The control section has a 6 bit
counter to keep track of all those shifting microcycles.

Instructions.  Let's splurge and have four independent ones:  ADD, NOR,
STORE, and JMPNEG.  One addressing mode: direct.  The hardware will need
just 2 bits of instruction register, and the rest of the instruction
format can gate directly into the memory address register.  Oops, that's
looks like only 30 bits of word addressability.  Here a byte, there a
shift, and in come the orders for the commercial instruction set option.

The customer's consultants must never see the raw hardware instruction
set.  Some of them would complain bitterly about having to teach the
computer to subtract, multiply, and divide (probably because they don't
remember how) and our machine could get an unfavorable reputation.  Base
level assembly language is verbose, convoluted and ugly.  Wrap two layers
of microcode around the machine to emulate a sophisticated instruction
set.

The first layer of microcode emulates arithmetic, stack, index registers,
and memory mapping.  Level 1 assembly language is concise and effective.
The second layer of microcode implements call gates, virtual machines,
capabilities, and something we can call object oriented.  Level 2 assembly
language is verbose, convoluted, and ugly, but the consultants would feel
like members of a high priesthood.

PUTTING IT ALL TOGETHER

  Function		Bits@Trans	Subtotal
   full-adder		 1@28		  28
   nor			 1@4		   4
   shift-register	 33@8		 264
   dynamic-latch	 33@2		  66
   multiplexers		 40@2		  80
   counter		 6@18		 108
   control-pla		 n/a		 100 est.

  Total Transistors			 650

The design is running 30% over budget.  The bean counter wants to compare
die size against the competition to see if they have a cost advantage.

  i486 Die:

    +--------------------------------------------------------------------+
    |  0    0    0    0    0    0    0    0    0    0    0    0    0   0 |
    |0################################################################## |
    | ##################################################################0|
    |0################################################################## |
    | ##################################################################0|
    |0################################################################## |
    | ##################################################################0|
    |0################################################################## |
    | ##################################################################0|
    |0################################################################## |
    | ##################################################################0|
    |0################################################################## |
    | ##################################################################0|
    |0################################################################## |
    | ##################################################################0|
    |0################################################################## |
    | ##################################################################0|
    |0################################################################## |
    | ##################################################################0|
    |  0    0    0    0    0    0    0    0    0    0    0    0    0   0 |
    +--------------------------------------------------------------------+

  SPARC Die:

    +-----------------------------+
    | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 |
    |0###########################0|
    |0###########################0|
    |0###########################0|
    |0###########################0|
    |0###########################0|
    |0###########################0|
    |0###########################0|
    |0###########################0|
    |0###########################0|
    |0###########################0|
    |0###########################0|
    |0###########################0|
    |0###########################0|
    |0###########################0|
    |0###########################0|
    | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 |
    +-----------------------------+

  500T Die:

    +-------------+
    | 00000000000 |
    |0           0|
    |0           0|
    |0           0|
    |0     #     0|
    |0           0|
    |0           0|
    |0           0|
    | 00000000000 |
    +-------------+

Our yield is expected to be quite good.  Note that the die size is limited
by the bond pads for the address bus.  There is opportunity for a cost-
reduced variant with multiplexed address bus.

  500SX Die:

    +---+
    | 0 |
    |0#0|
    | 0 |
    +---+

The bean counter is ecstatic!  Yield is 99.9% and a tested die costs only .3
cents.  In quantity 10000 it sells for less than a bypass capacitor.  The
package looks like a transistor.  Rumor has it that some are marked 2N500SX
to get around the export ban to the Soviet Union.

Robert Heiss   rob@wilbur.coyote.trw.com

mash@mips.COM (John Mashey) (12/17/90)

In article <18164@neptune.inf.ethz.ch> brandis@inf.ethz.ch (Marc Brandis) writes:
>In article <1990Dec14.001129.988@Neon.Stanford.EDU> hoelzle@Neon.Stanford.EDU (Urs Hoelzle) writes:
>>The integer part of a typical RISC can be implemented with about 100K
>>transistors.  Current impementation technology allows >1M
>>transistors/chip (e.g. i486, 68040).  Why don't current RISC
>>implementations take advantage of the extra transistors and put a
>>large cache on the chip?  And/or an FPU?  For example (as far as I
>>know), typical SPARC chips neither have an on-chip FPU nor a large
>>cache; same for MIPS. 

>This is only true for what you find in current machines, but it is not true
>for chips. E.g. the new MIPS R3300 (or maybe R3400, I do not exactly remember
>the number) contains the FPU and caches on chip. There are SPARC 
>implementations that have cache on chip. There are RISC CPUs like the i860
>or the IBM RS/6000 that have around 1 million transistors on a chip (actually,
>the RS/6000 uses 9 chips with almost 8 million transistors as a whole).

>>One reason for a lower transistor count might be cost - but having
>>separate FPU or MMU chips doesn't exactly reduce system cost.  Or does
>>the better yield of the small chips outweigh the extra cost of having
>>several chips instead of just one?
>
>One or two years ago it may have been that the better yield would result in
>lower cost as a whole. But the picture is changing. I think in one or two
>years you will find a lot RISC workstations containing CPUs with 1 million
>transistors.

It might be good to try stepping back up a level to the generic
rules, then look at the specifics again.

GENERIC:
1) For a system (such as a UNIX system), you need:
	1) Overall control
	2) Integer datapath
	3) FPU
	4) MMU
	5) Caches (at least 1 level)
	6) Cache control
	7) Glue&gunk&support to outside world

2) At any given time, within "reasonable" cost and "reasonable"
testability, and "reasonable" time-to-market, there are limits on:
	a) The maximum size of chip you can build
	b) The speed of the internal circuitry
	c) The speed of external interface
Note that c) can often be a worse contraint than b).
Note also, that shrinking a chip increases yield, and over time, lowers
cost, although at some point, the limit becomes the pad-ring, i.e.,
the pads have a minimal possible size, even though you could shrink
the insides of the chip more.


3) There are plenty of different ways to partition this:

	MIPS	SPARC	88K	IBM RS6000
		(in SS2)
1. cntl	R3000	601 IU	88100	ICU
2. int	R3000	601 IU	88100	FXU
3. fpu	R3010	TI FPU	88100	FPU
4. MMU	R3000	SRAMs	88200s	ICU+FXU
5. cach	SRAMs	SRAMs	88200s	ICU + 2-4 DCUs
6. cc	R3000	extlogic 88200	ICU+FXU+DCU
7. glue	WBs	extlogic -	SCU

4) In the round in which most MIPS/SPARC/88K chips on the market were
designed, you couldn't get everything on one chip, of course,
The Intel 486 & 860, and Motorola 68040 essentially integrate
all of these things.

5) Amongst the MIPS camp, there are chips with R3000+R3010 stuck
together (Performance Semiconductor), which can help performance & cost;
and embedded control chips
(IDT, LSIL) which integrate up to 8KB I-cache & 2KB D-cache with
an R3000 + buffering (i.e., 1,2,5,6,7, and sometimes 4, but not 3)
together.   Intel 960s come in various combinations, some of
which are similar to these.

6) In SPARCland, the Solbourne million-transistor chip integrates
everything (BTW, can somebody confirm/deny: 12 Specmarks at
25MHz, or at 33MHz?)  Embedded control versions are also
supposed to be coming.

7) As to why there aren't more million-transistor RISCs around,
when there were already 486s, there is a simple answer:
	Some people started earlier, some started later:

	a) Intel certainly knows how to build micros,
	and started on the design (according to my sources, somebody
	at Intel correct me if I'm wrong) in 1H86, getting chips
	out in 1989. In 1H86, there were no SPARC chips, and only
	the earliest MIPS R2000s, with early systems going out in 3Q86.
	b) Next-generation RISC chips from MIPS (R4000) and Sun (Viking)
	should come out in 1991, and are reputed to be pretty aggressive
	chips with a lot of stuff crammed in there.  I don't know
	when they started; we started (effectively) about 30-33 months
	after Intel.


8) The big technology jump is when you get everything on 1 CPU,
i.e., at least 1-6 (sometimes 7 is separate).

9) Finally, note that with a million transistors, you don't get
to have a "big" on-chip cache, i.e., you get 8-16KB, and it's a
horrible impediment to the faster machines.  You'd really like
to have 64KB, but that has to wait a round or so, depending on
how much space the rest of your design takes.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

philip@beeblebrox.dle.dg.com (Philip Gladstone) (12/18/90)

>>>>> On 15 Dec 90 17:18:20 GMT, rob@dagmar.coyote.trw.com (Robert Heiss) said:

Robert> Okay, the software engineers demand a 32 bit processor, and the bean
Robert> counters give a 500 transistor budget.  Overconstrained?  Let's give
Robert> it a try.

[deleted]

Robert> PUTTING IT ALL TOGETHER

Robert>   Function		Bits@Trans	Subtotal
Robert>    full-adder		 1@28		  28
Robert>    nor			     1@4		   4
Robert>    shift-register	 33@8		 264
Robert>    dynamic-latch	 33@2		  66
Robert>    multiplexers		 40@2		  80
Robert>    counter		     6@18		 108
Robert>    control-pla		 n/a		 100 est.

Robert>   Total Transistors			 650

However, the performance of the above processor sucks. To solve this
problem, I propose placing 1024 of them (the SX variant with the
multiplexed bus) on a die. I estimate that no more than 100 extra
transistors would be required per processor to perform the interconnect. 

This leads to a fairly large die size consuming 800,000 transistors.

The additional complexity of having 1024 processors rather than one
would require an additional layer of micro (pico/nano/milli?) code.
However, this would keep the coders in business for some time yet.

The marketing advantages are also significant: you can multiply the
clock rate by the number of processors to get mips. This yields (at
least) 50 BIPS. Alternatively, you can have a low power version that
you only clock at 1 KHz and still get 1 MIP.

Philip
--
--
Philip Gladstone         Dev Lab Europe, Data General, Cambridge, UK

    Listen three eyes, don't you try and outweird me, I get
    stranger things than you free with my breakfast cereal.

jurquhart@acorn.co.uk (James S Urquhart) (12/18/90)

In article <1639@svin02.info.win.tue.nl> rcpieter@svin02.info.win.tue.nl (Tiggr) writes:

>hoelzle@Neon.Stanford.EDU (Urs Hoelzle) writes:
>
>>So: why do RISC implementations currently use fewer transistors??
>
>The ARM is a good example of a RISC with very few transistors (core is < .5k
>transistors).  The reason is that its designers wanted it to be CHEAP.
>
>Tiggr

 Although ARM2 (VL86C010) is fairly small it is not as small as stated
above. In fact the ARM2 chip has approximately 25K transistors and ARM3
(VL86C020) has 312K transistors. In ARM3 90% of the transistors make up the
4K-byte Cache and its associated control logic.

Jamie Urquhart (Advanced RISC Machines)