[comp.arch] RISC as a "technology window"?

rcd@ico.ISC.COM (Dick Dunn) (03/21/89)

In article <37196@bbn.COM>, slackey@bbn.com (Stan Lackey) writes:
> RISC is indeed a technology window, driven largely by the amount of
> stuff you can fit in a chip...

OK, fair 'nuff.  As soon as we can put an unlimited amount of stuff on a
chip (and do it without increasing delays or other limitations), we'll be
beyond that technology window, I guess...

>...Look at what is being added now that you
> can fit more than a simple CPU core in a chip:
[7 examples which are not in the least at variance with RISC approach]

> The trend in computer evolution is truly toward greater hardware
> complexity.  This has been demonstrated countless times...

Sure.  Just look how much more complex a CDC 6600 is than, say, a 7090.
(For the younger set:  It's several times less complex.)

No, the trends for faster machines have frequently involved producing
*simpler* designs because it wasn't possible to make a machine fast with
all the extra baggage.

> There is a true need for complexity...

not demonstrated.

>...How many times when reading this
> newsgroup do you see things like, "Yes but that chip doesn't have <my
> favorite feature>"...

Rarely, if ever...but even if I did, so what?  <my favorite feature> is a
lousy criterion for what to put in hardware.  Folks, we're not discussing
machines to indulge your whims; we're talking about what it takes to get a
job done.

> Companies must make money.  They will do this by making not tiny
> low-cost RISC micros, but the most complex thing they can fit in a
> chip...

Sure.  That's why Sun introduced the very complex SPARC as a successor
to the much simpler 680x0 machines...or why intel came out with the more
complex 860 to up the ante over the RISCy 386, right???
-- 
Dick Dunn      UUCP: {ncar,nbires}!ico!rcd           (303)449-2870
   ...Never offend with style when you can offend with substance.

jrg@Apple.COM (John R. Galloway) (03/22/89)

In article <15702@clover.ICO.ISC.COM>, rcd@ico.ISC.COM (Dick Dunn) writes:
> In article <37196@bbn.COM>, slackey@bbn.com (Stan Lackey) writes:
> > RISC is indeed a technology window, driven largely by the amount of
> > stuff you can fit in a chip...
> 
> OK, fair 'nuff.  As soon as we can put an unlimited amount of stuff on a
> chip (and do it without increasing delays or other limitations), we'll be
> beyond that technology window, I guess...

Well actually only while the "extra" space is less than a full cpu, as soon
as it is we will just get multiple cpus on a chip and they may well still be
RISC oriented.  In fact with the extra cost of packaging, I could imagine
that as soon as this point is approached all extras will be stripped off to
squeeze the extra one in.
apple!jrg	John R. Galloway, Jr.       contract programmer, San Jose, Ca

These are my views, NOT Apple's, I am a GUEST here, not an employee!!

mash@mips.COM (John Mashey) (03/22/89)

In article <27681@apple.Apple.COM> jrg@Apple.COM (John R. Galloway) writes:
>In article <15702@clover.ICO.ISC.COM>, rcd@ico.ISC.COM (Dick Dunn) writes:
>> In article <37196@bbn.COM>, slackey@bbn.com (Stan Lackey) writes:
>> > RISC is indeed a technology window, driven largely by the amount of
>> > stuff you can fit in a chip...
>> 
>> OK, fair 'nuff.  As soon as we can put an unlimited amount of stuff on a
>> chip (and do it without increasing delays or other limitations), we'll be
>> beyond that technology window, I guess...
....
>Well actually only while the "extra" space is less than a full cpu, as soon
>as it is we will just get multiple cpus on a chip and they may well still be
>RISC oriented.  In fact with the extra cost of packaging, I could imagine
>that as soon as this point is approached all extras will be stripped off to
>squeeze the extra one in.

Can anyone tell us where to get some of this kind of silicon?
(the kind you can put unlimited stuff on :-)  we want some.  I'm sure
Intel, Moto, Sun would like some also.

Seriously, I doubt that anyone has silicon to burn.  In particular,
the faster the chips get, the more it hurts you to go off chip.
Bigger on-chip caches [I, D, or TLB] keep you on-chip more, and are
therefore good.  With more hardware, you can make integer multiplies go
faster, make FP go faster, and maybe put in some multiple FP units,
a la CDC 6600s [and these things chew up area]. Note that Intel, with
a million transistors, said the space budget didn't leave room for
an IEEE divide..... (Compcon paper).

I suspect it will be some time before people replicate the CPUs on
a chip, just because there's nothing else to do with the silicon.
	a) It's hard to get enough bandwidth in and out of these chips,
		i.e., I/Os cost money.
	b) If you replicate CPUs on a chip, it would like more bandwidth.
	c) If you double the size of a giant-monster-chip, its yield
		might get a lot worse...
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (03/23/89)

In article <15695@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>faster, make FP go faster, and maybe put in some multiple FP units,
>a la CDC 6600s [and these things chew up area]. Note that Intel, with
>a million transistors, said the space budget didn't leave room for
>an IEEE divide..... (Compcon paper).

Things are getting to a very interesting stage, however.  I guess it is just
my big machine history showing again, but I keep wondering why, with a decent
sized register file, it wouldn't make more sense to put all FP/ALU hardware on
the same chip as the control unit, along with the instruction cache, 
and leave the MMU, and the data cache (which is almost always going to be 
larger than what you can put
on a chip no matter how large the chips get), to an external implementation.
This also leaves more flexibility in cache design/choice, which is reasonable
since data cache design is very dependent on what the chip is going to be used
for anyway.  I would expect to see a second MMU/cache chip available also for 
people who want to use it. Also, it would make some graphics/vector designs
easier to deal with also (at least potentially).

Last year, the answer always was: "High speed arithmetic (integer and/or FP)
is a specialty area."  I would think that the success of this years crop of
high-arithmetic-performance systems would have dispelled that notion by now.

So, my question is:  If you ASSUME that you have to have high speed arithmetic,
what is the best way to partition functions between chips?  I believe that the
best way is Control, ALU/FPU, and instruction cache on one chip, and data
cache/MMU on another chip.  Why doesn't the market agree with me?

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (03/23/89)

In article <22974@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:

>
> (About some of his favorite topics.)
>

Of course the Motorola 88K
does essentially what I asked, although the cache setup is slightly different.
They don't have AS MUCH FPU hardware as was being talked about in what I was
responding to, but conceptually, they are already doing it, so my point was
confused.  The Clipper also.  My point was to explore which designs minimize
the off-chip bandwidth required, and why, in the context of 1M+ transistor
chips, assuming that high performance arithmetic is a given. 

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117

davidsen@steinmetz.ge.com (Wm. E. Davidsen Jr) (03/24/89)

  It's getting harder to tell RISC from CISC in some cases. If a
computer has one instruction to do something like:
	*(++a) = (*b)++
I would feel that it it is CISC. If it executes that instruction in one
cycle and doesn't use microcode, I would find it hard to argue that it
is not RISC. Processors like the N10 are approaching 1 op/cycle, and the
i80486 is rumored to have an average of < 2 for non-F.P. operations.

  I believe that people are taking RISC as a personal issue in some
cases, rather than a method of getting real work (ie. that done by the
people who pay for the computer or on their behalf) in less time and/or
for less money. If the processor becomes so fast that it requires memory
bandwidth which is unachievable or unaffordable then the true speed of
the processor is reduced by the wait states introduced. I think that in
the next five years we will see processors which outrun off chip memory,
and certainly now there are a lot of processors running at less than
full speed due to the cost of fast memory.

  There will continue to be a demand for processors with a very high
instruction rate (call them RISC if you will), and also for processors
which will perform a given task faster with limited memory bandwidth.
Vendors will continue to pick the complexity which they feel provides
the greatest cost effectiveness for the entire product based on the CPU.
Cycles per operation will come down for all vendors, because they have
the techniques to use that approach.

  I also think that vendors who are now regarded as RISC vendors will
add complexity to their instruction sets, providing (1) it doesn't slow
the chip on other operations, (2) it doesn't take real estate which
could be used for things which would improve overall performance more,
and (3) that the benefit in code size and performance (due to fewer
instructions) would be readily measurable.
________________________________________________________________

The ultimate RISC machine: a one bit opcode; 0 = conditional branch, 1 =
nop to fill the delay slots ;-)


-- 
	bill davidsen		(wedu@crd.GE.COM)
  {uunet | philabs}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

w-colinp@microsoft.UUCP (Colin Plumb) (03/24/89)

lamaster@ames.arc.nasa.gov (Hugh LaMaster) wrote:
> So, my question is:  If you ASSUME that you have to have high speed
> arithmetic, what is the best way to partition functions between chips?
> I believe that the best way is Control, ALU/FPU, and instruction cache
> on one chip, and data cache/MMU on another chip.  Why doesn't the market
> agree with me?

Well, given that latency to memory is a serious problem these days, and
that MMU address translation is often on the critical path, moving
it off-chip doesn't sound like such a good idea.

I think the MIPS approach is the best: MMU and cache *control* on chip;
the actual data (which can be a trifle slower) can be put in external
SRAM.  SRAMs have a large market, so even ultra-fast ones are comparatively
cheap.  Associative memory is much more expensive.

I've said it before: I'm *astounded* nobody else has used this idea.
It's such a great Win.  Cache control is the custom bit, so do it
in custom logic.  With all the rest of the custom logic: on the
microprocessor.  Cache RAM is very generic.  So don't re-invent the
wheel.

Has anyone out there (other than MIPS, of course) considered this scheme
and then rejected it?  Is my enthusiasm blind to some Great Problem?
-- 
	-Colin (uunet!microsoft!w-colinp)

"Don't listen to me.  I never do." - The Doctor

marc@oahu.cs.ucla.edu (Marc Tremblay) (03/25/89)

In article <51@microsoft.UUCP> w-colinp@microsoft.uucp (Colin Plumb) writes:
>lamaster@ames.arc.nasa.gov (Hugh LaMaster) wrote:
>> So, my question is:  If you ASSUME that you have to have high speed
>> arithmetic, what is the best way to partition functions between chips?
>> I believe that the best way is Control, ALU/FPU, and instruction cache
>> on one chip, and data cache/MMU on another chip.  Why doesn't the market
>> agree with me?

I also believe that putting the Integer unit and the FPU on the same 
chip makes sense. These two units have to communicate quickly, possibly 
sharing registers, and the FPU depends on the core section for its 
flow of instructions. I think that the trend is toward putting them
on the same chip anyway. Floating-point coprocessors were very
detached from the processor when they first came out (although
surprisingly enough the 8087 was a little closer), especially when 
you think that just setting up FPU instructions could take around
10 cycles! The MIPS approach, i.e. to make the coprocessor (R3010)
closely coupled is a huge improvement, especially regarding the 
instruction-issuing overhead. 
The new trend? Because the FPU needs the core unit then put it on-chip, 
(both Motorola 88000 and Intel i860 have the FPU on-chip).
Since you *currently* have to go off-chip to access reasonably large
caches, you might as well put the MMU with the caches.
The idea of Hugh LeMaster's comment above, may introduce problems for
accessing the instruction cache though, especially if it is physical.

>Well, given that latency to memory is a serious problem these days, and
>that MMU address translation is often on the critical path, moving
>it off-chip doesn't sound like such a good idea.

My reasoning is:
	access to reasonable cache -> need to go off-chip
	MMU is used to access cache -> need to go off-chip
	since you need to go off-chip anyway -> put MMU off-chip

	floating-point computations -> can be done internally
	FPU *needs* the integer unit -> put it close to the processor
	close to the processor -> at least closely coupled, better on-chip.

>I've said it before: I'm *astounded* nobody else has used this idea.
>It's such a great Win.  Cache control is the custom bit, so do it
>in custom logic.  With all the rest of the custom logic: on the
>microprocessor.  Cache RAM is very generic.  So don't re-invent the
>wheel.

FPU is also quite custom! :-)  --> put it on the same chip!

>Has anyone out there (other than MIPS, of course) considered this scheme
>and then rejected it?  Is my enthusiasm blind to some Great Problem?

I think that one of the reasons why some companies have rejected it
is that the size of a chip with integer + FPU is HUGE. The R3010, a great 
FPU coprocessor, with all its custom logic and its 75000 transistors 
is quite large (about 8.4 * 8.8 mm) especially when you compare it to a MMU.
It is easier (in terms of area) to put an MMU on-chip than a FPU on-chip,
at least for a good FPU!
					Marc Tremblay
					marc@CS.UCLA.EDU
					Computer Science Department, UCLA

alan@rnms1.uucp (0000-Alan Lovejoy(0000)) (03/25/89)

In article <13404@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes:
> If [a processor] executes [high semantic content] instruction[s] in one
>cycle and doesn't use microcode, I would find it hard to argue that it
>is not RISC. Processors like the N10 are approaching 1 op/cycle, and the
>i80486 is rumored to have an average of < 2 for non-F.P. operations.

The issue here seems to be whether RISC is defined by the physical
characteristics of a CPU implementation (microcode, cycles per instruction,
cacheing, pipelining, number of instructions, instruction lengths...) or by 
the logical characteristics of an architecture (number of registers, addressing
modes, instruction semantics...).  The general consensus seems to be that
the primary determinant is the physical implementation, but that the logical
architecture heavily influences to what extent RISCy implementation techniques
can be used. Except that RISC is not just a set of implementation techniques,
but a design methodology and philosophy:  objectively determine what the
cost/benefit ratio of each proposed feature or mechanism is, and use this
ratio as the priority for deciding what to put in the architecture or in
the implementation.

> If the processor becomes so fast that it requires memory
>bandwidth which is unachievable or unaffordable[,] then the true speed of
>the processor is reduced by the wait states introduced. 

>  There will continue to be a demand for processors with a very high
>instruction rate (call them RISC if you will), and also for processors
>which will perform a given task faster with limited memory bandwidth.
>Vendors will continue to pick the complexity which they feel provides
>the greatest cost effectiveness for the entire product based on the CPU.
>Cycles per operation will come down for all vendors, because they have
>the techniques to use that approach.

Single-cycle instructions are not just a function of how much hardware,
or how many parallel functional units, you can put on a chip. Instructions
whose semantics require off-chip data accesses cannot be completed until
the off-chip data is fetched, no matter how many transistors you put on the
chip.  What should the CPU be doing while it waits for the off-chip data?
With a Harvard architecture, data and instruction fetching are independent,
so the fact that you fetched one instruction to do the work of three doesn't
help your off-chip data-access bandwidth at all.  The biggest performance 
bottleneck is in data fetching, not instruction fetching.

Instruction-fetching bottlenecks that do exist are much more easily addressed
by caches and pipelines than data-fetching bottlenecks are.

The other issue is the granularity of instruction semantics.  The greater
the granularity, the more opportunities there are for code optimization.
It also increases the generality of the instruction set, making it more
likely that all instructions will be used by (and useful to) more applications.

The greater the difference in semantic level (primitiveness) between
the machine instructions and high-level language statements, the easier it is
to have high-quality code generation for a wide variety of languages.

Alan Lovejoy; alan@pdn; 813-530-2211; AT&T Paradyne: 8550 Ulmerton, Largo, FL.
Disclaimer: I do not speak for AT&T Paradyne.  They do not speak for me. 
__American Investment Deficiency Syndrome => No resistance to foreign invasion.
Motto: If nanomachines will be able to reconstruct you, YOU AREN'T DEAD YET.

tada@athena.mit.edu (Michael Zehr) (03/25/89)

In article <15695@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>In article <27681@apple.Apple.COM> jrg@Apple.COM (John R. Galloway) writes:
>>In article <15702@clover.ICO.ISC.COM>, rcd@ico.ISC.COM (Dick Dunn) writes:
>>> In article <37196@bbn.COM>, slackey@bbn.com (Stan Lackey) writes:
>>> > RISC is indeed a technology window, driven largely by the amount of
>>> > stuff you can fit in a chip...
>>> OK, fair 'nuff.  As soon as we can put an unlimited amount of stuff on a
>>Well actually only while the "extra" space is less than a full cpu, as soon
>I suspect it will be some time before people replicate the CPUs on
>a chip, just because there's nothing else to do with the silicon.
>	a) It's hard to get enough bandwidth in and out of these chips,
>		i.e., I/Os cost money.

Professor Daly (sp?) of MIT has been saying something along those lines
for a couple years.  instead of having a whole bunch of memory chips
with one path to a fast CPU and have a cache to prevent slow accesses,
take each of those memory chips, halve the amoune of memory on them, and
put a CPU on it.  you no longer have a memory banchwith problem, because
the memory and CPU are on the same (small, easy-to-make) chip.  instead
of putting a cache on the chip (you don't need one), put a
communications circuit to transfer data to the other chips.

If you're interested in more information, he's working on something
called a J-machine, which will probably be in prototype stage sometime
this summer (i think?).

which would you rather have -- one CPU that runs at 50 MIPS with 72,
1Mbit memory chips (8 Megabytes * 9 chips per byte) or 72, 10 MIPS
processors and 4 Megabytes of memory split among them?

-michael j zehr

rodman@mfci.UUCP (Paul Rodman) (03/25/89)

In article <22974@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
>
>So, my question is:  If you ASSUME that you have to have high speed arithmetic,
>what is the best way to partition functions between chips?  I believe that the
>best way is Control, ALU/FPU, and instruction cache on one chip, and data
>cache/MMU on another chip.  Why doesn't the market agree with me?
>

Personally, I think the optimal partitioning for large f.p. problems would
be to split the f.p. unit and registers onto another chip. The amount of
comms required between the integer domain and floating domain is very
small and extra cycles to go from one to the other aren't a problem (speaking
from the our experience with partition the cpu in just this way). 

I haven't
thought about how to solve the problems in splitting integer data caches
and floating data caches, but I'm sure there would be an acceptable solution.
Assuming your compiler guys are up to it , :-)

The main advantage here are:

      - You can get more pins for the f.p. chip for more loads/stores per
        clock on the f-unit. Also you can get more than 16 d.p. registers 
        (which isn't enough, in our experience for two piped fu's). 

      - The i-chip, which made no use of the funit hardware, has more area
        for integer goodies, including a larger on-chip data cache for
        integer data. I would rather have the MMU on this chip to make sure
        that the memory pipeline for explicit loads is one cycle shorter,
        i.e. save a chip crossing here.

Now the guys that don't use floating point can just buy the i-chip, those
that want screaming f.p. perf buy both. 

I just don't see the point in doing hairy-chested cramming of f.p. hardware
on the same chip as the integer stuff, when the two functional units 
are so nicely seperable, to the benefit of each.

    Paul K. Rodman 
    rodman@mfci.uucp
    __... ...__    _.. .   _._ ._ .____ __.. ._

wendyt@pyrps5 (Wendy Thrash) (03/25/89)

In article <717@m3.mfci.UUCP> rodman@mfci.UUCP (Paul Rodman) writes:
>Now the guys that don't use floating point can just buy the i-chip, those
>that want screaming f.p. perf buy both. 

There's one large hidden cost here that people never seem to acknowledge:
If you sell even one system without floating-point hardware, some poor
programmer (or group) will spend the next twenty years (your product should
last so long) supporting floating-point operations on systems without the
floating-point hardware.

Hardware costs don't factor in the cost of phone calls from customers
complaining about slow (trapped and simulated) software floating point or
slow (-fswitch) hardware floating point or outmoded (someone wrote the
microcode years ago, and nobody understands it well enough to fix it now)
firmware floating point, or the costs of maintaining separate versions of
libraries, additional code in compilers, etc. (for compiler-generated
software floating point).

It's like the guy says on the commercial: You can pay me now (for extra
hardware) or pay me later (for extra support).  If you sell enough systems
at the lower price to cover the hidden costs, then you've made a good
decision, but do remember that the costs are merely hidden, not nonexistent.

carlton@betelgeuse (Mike Carlton) (03/25/89)

In article <51@microsoft.UUCP> w-colinp@microsoft.uucp (Colin Plumb) writes:
...
>I think the MIPS approach is the best: MMU and cache *control* on chip;
>the actual data (which can be a trifle slower) can be put in external
>SRAM.  SRAMs have a large market, so even ultra-fast ones are comparatively
>cheap.  Associative memory is much more expensive.
>
>I've said it before: I'm *astounded* nobody else has used this idea.
>It's such a great Win.  Cache control is the custom bit, so do it
>in custom logic.  With all the rest of the custom logic: on the
>microprocessor.  Cache RAM is very generic.  So don't re-invent the
>wheel.
>
>Has anyone out there (other than MIPS, of course) considered this scheme
>and then rejected it?  Is my enthusiasm blind to some Great Problem?
>-- 
>	-Colin (uunet!microsoft!w-colinp)
>
>"Don't listen to me.  I never do." - The Doctor

I agree that the MIPS scheme is nice, but it does have its drawbacks.  In
particular, they've fixed the cache control.  If they got it right (for your
application) then no problem.  Otherwise you're out of luck.  It would be
possible to make some of the details configurable, but I believe that MIPS 
doesn't allow this.

If I remember right (somebody borrowed my MIPS book so I can't verify), the 
MIPS cache control requires a write-through cache.  Personally, I don't want 
a write-through cache.  Another aspect is the write latency (i.e. how many 
cycles does your cache take to handle a write-hit?).  I think the MIPS 
controller assumes a single cycle.  This implies you've got to build a cache
to handle this, and this will get trickier when you can buy a 40 or 50MHz MIPS.

--
Mike Carlton, UC Berkeley Computer Science
Home: carlton@ji.berkeley.edu or ...!ucbvax!ji!carlton

henry@utzoo.uucp (Henry Spencer) (03/25/89)

In article <63866@pyramid.pyramid.com> wendyt@pyrps5.pyramid.com (Wendy Thrash) writes:
>If you sell even one system without floating-point hardware, some poor
>programmer (or group) will spend the next twenty years (your product should
>last so long) supporting floating-point operations on systems without the
>floating-point hardware...

The Sun 2/3 line is somewhat a worst case of this, with something like
four different floating-point-hardware configurations.  This is obviously
an enormous headache for Sun and third-party software suppliers.  Some
of the third-party suppliers simply support the 68881 and nothing else,
so their stuff won't run any faster with an FPA and won't run at all on
a 3/50 without 68881.  Sun hasn't got that escape.  It shows, too:  the
specs for the SPARC (at least, the ones I saw, some time ago) say that
there is *one* floating-point architecture, which the kernel must fake
if the hardware isn't there.  (Sun has sort of blown it on the hardware
end for the Sun 4, I gather, but architecturally it makes sense.)
-- 
Welcome to Mars!  Your         |     Henry Spencer at U of Toronto Zoology
passport and visa, comrade?    | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

cik@l.cc.purdue.edu (Herman Rubin) (03/25/89)

In article <717@m3.mfci.UUCP>, rodman@mfci.UUCP (Paul Rodman) writes:
> In article <22974@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
> >
> >So, my question is:  If you ASSUME that you have to have high speed arithmetic,
> >what is the best way to partition functions between chips?  I believe that the
> >best way is Control, ALU/FPU, and instruction cache on one chip, and data
> >cache/MMU on another chip.  Why doesn't the market agree with me?
> >
> 
> Personally, I think the optimal partitioning for large f.p. problems would
> be to split the f.p. unit and registers onto another chip. The amount of
> comms required between the integer domain and floating domain is very
> small and extra cycles to go from one to the other aren't a problem (speaking
> from the our experience with partition the cpu in just this way). 
> 
> I haven't
> thought about how to solve the problems in splitting integer data caches
> and floating data caches, but I'm sure there would be an acceptable solution.
> Assuming your compiler guys are up to it , :-)
> 
> The main advantage here are:
< 
<       - You can get more pins for the f.p. chip for more loads/stores per
<         clock on the f-unit. Also you can get more than 16 d.p. registers 
<         (which isn't enough, in our experience for two piped fu's). 
< 
<       - The i-chip, which made no use of the funit hardware, has more area
<         for integer goodies, including a larger on-chip data cache for
<         integer data. I would rather have the MMU on this chip to make sure
<         that the memory pipeline for explicit loads is one cycle shorter,
<         i.e. save a chip crossing here.
< 
< Now the guys that don't use floating point can just buy the i-chip, those
< that want screaming f.p. perf buy both. 
< 
< I just don't see the point in doing hairy-chested cramming of f.p. hardware
< on the same chip as the integer stuff, when the two functional units 
< are so nicely seperable, to the benefit of each.

I can see the point of having separate address arithmetic and low-precision
multiplication for address purposes.  But restricting the term "integer
arithmetic" to that is destructive of computing power.

I am not arguing one way or the other on partitioning functions among chips.
I suspect it is a good idea, but this is not the point.  A floating point
operation consists of separating the sxponents from the mantissas, differencing
the exponents and shifting for addition and subtraction, performing the
fixed point operation, and performing the necessary shifting and exponent
calculation.  The cost is greatest for multiplication and division, where
the similarities between fixed and floating point are greatest.  Indeed,
many architectures with a floating point accelerator do integer multiplication
in that unit.

But suppose you want high precision arithmetic, integer, fixed point, or
floating point?  You now want a good integer arithmetic machine; if floating
point arithmetic must be used, integer arithmetic must be emulated in it,
which is quite clumsy.  The computational equipment for high precision
multiplication and division is largely the same for integer, fixed point,
and floating point.  For high-precision addition and subtraction, the overlap
is still great.  An architecture, language, or programmer not capable of
taking advantage of this must be considered limited.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

bcase@cup.portal.com (Brian bcase Case) (03/26/89)

>The general consensus seems to be that
>the primary determinant is the physical implementation, but that the logical
>architecture heavily influences to what extent RISCy implementation techniques
>can be used.

Right, except that I don't agree that the primiary determinant is the
physical implementation.  A RISC is an architecture that *permits* a *clean*,
high-performance implementation.  A CISC architecture might be able to use
some high-performance implementation tricks, but cleanliness is next to
RISCyness.  The cleanliness becomes very important when superscalar
implementations, and probably multiple-processor-per-chip implementations,
are designed.

>The other issue is the granularity of instruction semantics.  The greater
>the granularity, the more opportunities there are for code optimization.
>It also increases the generality of the instruction set, making it more
>likely that all instructions will be used by (and useful to) more applications.
>The greater the difference in semantic level (primitiveness) between
>the machine instructions and high-level language statements, the easier it is
>to have high-quality code generation for a wide variety of languages.

I've never thought to use such pro-active phrasing in explaining the
advantage to software of simplicity, but I like it.  This is an excellent
way of saying it.

mash@mips.COM (John Mashey) (03/26/89)

In article <63866@pyramid.pyramid.com> wendyt@pyrps5.pyramid.com (Wendy Thrash) writes:
>In article <717@m3.mfci.UUCP> rodman@mfci.UUCP (Paul Rodman) writes:
>There's one large hidden cost here that people never seem to acknowledge:
>If you sell even one system without floating-point hardware, some poor
>programmer (or group) will spend the next twenty years (your product should
>last so long) supporting floating-point operations on systems without the
>floating-point hardware.

>Hardware costs don't factor in the cost of phone calls from customers
>complaining about slow (trapped and simulated) software floating point or
>slow (-fswitch) hardware floating point or outmoded (someone wrote the
>microcode years ago, and nobody understands it well enough to fix it now)
>firmware floating point, or the costs of maintaining separate versions of
>libraries, additional code in compilers, etc. (for compiler-generated
>software floating point).

>It's like the guy says on the commercial: You can pay me now (for extra
>hardware) or pay me later (for extra support).  If you sell enough systems
>at the lower price to cover the hidden costs, then you've made a good
>decision, but do remember that the costs are merely hidden, not nonexistent.

These comments are well-taken, i.e., worry about the cost of the entire
system and its support.

Fortunately, this is much less of an issue than it used to be, simply
because VLSI FPUs now have good performance, at low cost.
It has never been really that much of an issue in the larger-systems arena,
in that the FP was a small part of the enitre product.
It used to be a serious issue in the small-systems arena, especially when:
	a) The integer unit was cheap.
	b) The VLSI FPU was either nonexistent [like for the 68010], or
	"slow" (i.e., relative to the integer unit's performance).
	c) A fast FPU was a whole logic board.
In this case, the difference between b) and c) was a lot of $$$ as a fraction
of the entire prodduct.

Both Sun and MIPS came to the same conclusion, albeit from different directions,
for their RISC products: generate exactly 1 form of FP code,
and if no FPU is present, emulate it in the kernel.  Given that FP chips
are reasonably inexpensive, most people just buy them, but if somebody
really wants to save money, where they're deploying a bunch of machines in
a commercial application that doesn't use FP, they can.

This leads to an interesting question: the 80387 and Weitek (1167? whatever it
is that you use to replace a 387 at higher speed) are not directly
binary-compatible?  Can anybody out there give a comprehensive tutorial
on:
	a) How 386 UNIX systems deal with this?
	b) What compilers support both flavors?
	c) What sorts of software packages handle both?
Are there two separate versions?  Or is there code that checks at run-time
for the presence of the FPU?  What is typically done when there is no FPU at
all?
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mash@mips.COM (John Mashey) (03/26/89)

In article <11402@pasteur.Berkeley.EDU> carlton@betelgeuse (Mike Carlton) writes:

>I agree that the MIPS scheme is nice, but it does have its drawbacks.  In
>particular, they've fixed the cache control.  If they got it right (for your
>application) then no problem.  Otherwise you're out of luck.  It would be
>possible to make some of the details configurable, but I believe that MIPS 
>doesn't allow this.
Actually, R3000s allow a fair amount of flexibility.
1) I & D-caches can have different sizes.
2) The number of words refilled into the cache upon cache miss is settable
from 1 to 32 words.
3) You can use instruction-streaming, or not.
4) You can cause partial-word writes to invalidate the corresponding cache
word, or cause it to do a read-modify-write.

>If I remember right (somebody borrowed my MIPS book so I can't verify), the 
>MIPS cache control requires a write-through cache.  Personally, I don't want 
>a write-through cache.  Another aspect is the write latency (i.e. how many 
>cycles does your cache take to handle a write-hit?).  I think the MIPS 
>controller assumes a single cycle.  This implies you've got to build a cache
>to handle this, and this will get trickier when you can buy a 40 or 50MHz MIPS.

The first-level cache is a write-thru cache.  People often build 2nd-level
caches to be write-back for multi-processors.  It does expect single-cycle
caches;  it will get trickier.  Of course, the higher clock rates, sooner or
later, require everybody doing CMOS/BiCMOS micros to build integrated
"superchips" anyway if they want to be competitive.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

henry@utzoo.uucp (Henry Spencer) (03/26/89)

In article <1188@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>... An architecture, language, or programmer not capable of
>taking advantage of this must be considered limited.

All architectures, languages, and programmers are limited.  The question
is whether the limitations interfere with solving the problems you care
about.  Manufacturers, of necessity, care much more about the 5th-95th
percentile requirements than about the outliers (unless the outliers are
likely to buy lots of hardware).
-- 
Welcome to Mars!  Your         |     Henry Spencer at U of Toronto Zoology
passport and visa, comrade?    | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

rodman@mfci.UUCP (Paul Rodman) (03/26/89)

In article <22202@shemp.CS.UCLA.EDU> marc@cs.ucla.edu (Marc Tremblay) writes:

>I also believe that putting the Integer unit and the FPU on the same 
>chip makes sense. These two units have to communicate quickly....
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I don't understand why.

In most f.p. codes the integer unit generates 
addresses and program counter values, neither of which are needed
by the f.p. unit.  What the f.p. unit needs, on the other hand, is
a decent bandwidth to cache and/or memory. 

    Paul K. Rodman 
    rodman@mfci.uucp
    __... ...__    _.. .   _._ ._ .____ __.. ._

rodman@mfci.UUCP (Paul Rodman) (03/26/89)

In article <726@m3.mfci.UUCP> rodman@mfci.UUCP (Paul Rodman) writes:
>In article <22202@shemp.CS.UCLA.EDU> marc@cs.ucla.edu (Marc Tremblay) writes:
>
>>I also believe that putting the Integer unit and the FPU on the same 
>>chip makes sense. These two units have to communicate quickly....
>                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>I don't understand why.
>
>In most f.p. codes the integer unit generates 
>addresses and program counter values, neither of which are needed
>by the f.p. unit.  

And before millions of people trash on me, let me clairify by saying that
obviously an f.p. unit needs a source of control. Marc seems to value
a low latency path between the control part of the machine and the f.p. unit.

Given that the original request was for a partitioning that would lead to
high performance, I think that some of this latency is best traded off for
better bandwidth. For example, one might pipeline the icache address to
the f.p. unit, causing instruction issue to be delayed by a beat for 
f.p. ops. This increases the f.p. latency with respect to the pc, but
does not increase it with respect to other flops or memory loads/stores.

pkr

alan@rnms1.paradyne.com (0000-Alan Lovejoy(0000)) (03/27/89)

In article <16231@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes:
>>The general consensus seems to be that
>>the primary determinant is the physical implementation, but that the logical
>>architecture heavily influences to what extent RISCy implementation techniques
>>can be used.

>Right, except that I don't agree that the primiary determinant is the
>physical implementation.  A RISC is an architecture that *permits* a *clean*,
>high-performance implementation.  A CISC architecture might be able to use
>some high-performance implementation tricks, but cleanliness is next to
>RISCyness.  The cleanliness becomes very important when superscalar
>implementations, and probably multiple-processor-per-chip implementations,
>are designed.

Hmmm...  Isn't there an important connection between the chip implementation
technology (e.g., silicon transisitors, quantum circuits, photonics, vacuum
tubes...) and the problem of designing an efficient implementation of a
logical architecture?  Logical architectures that cannot be efficiently
implemented in NMOS might have no such problems in an optical computer.
If it is the logical architecture that is the primary determinant, then
whether something is a RISC depends on the current implementation technology.

Or am I missing something?

Alan Lovejoy; alan@pdn; 813-530-2211; AT&T Paradyne: 8550 Ulmerton, Largo, FL.
Disclaimer: I do not speak for AT&T Paradyne.  They do not speak for me. 
__American Investment Deficiency Syndrome => No resistance to foreign invasion.
Motto: If nanomachines will be able to reconstruct you, YOU AREN'T DEAD YET.

stevew@wyse.wyse.com (Steve Wilson xttemp dept303) (03/28/89)

In article <10078@bloom-beacon.MIT.EDU> tada@athena.mit.edu (Michael Zehr) writes:
>
>which would you rather have -- one CPU that runs at 50 MIPS with 72,
>1Mbit memory chips (8 Megabytes * 9 chips per byte) or 72, 10 MIPS
>processors and 4 Megabytes of memory split among them?
>
>-michael j zehr

One CPU that runs at 50 mips with 72 1Mbit memory chips. I already
know how to program a single CPU ;-)

Steve Wilson

The above is my opinion, not those of my employer.

bcase@cup.portal.com (Brian bcase Case) (03/29/89)

>Hmmm...  Isn't there an important connection between the chip implementation
>technology (e.g., silicon transisitors, quantum circuits, photonics, vacuum
>tubes...) and the problem of designing an efficient implementation of a
>logical architecture?  Logical architectures that cannot be efficiently
>implemented in NMOS might have no such problems in an optical computer.
>If it is the logical architecture that is the primary determinant, then
>whether something is a RISC depends on the current implementation technology.
>
>Or am I missing something?

Uh, I don't know, you lost me somewhere here.  All I want to say is that
RISC can be defined by a set of *architectural* features.  To be sure,
that set was constructed with the implementation implications in mind.
If a new set of implementation technologies, or techniques, comes along,
then we'll have to define a new, er, "post-RISC", or whatever, set of
*architectural* guidelines that leads to a set of architectures that is
consistent with the new implementation technology.  If optical technology
calls for instructions that have, oh, I don't know, 14 source operands,
then we need to change things!  However, note that technology differences
do not change the basic state-machine model of computation.  Until we do
change the basic model (to what?  I don't know!  I'm trying to think of
it actually...), instructions can't change a whole lot, at least not
qualitatively.

jesup@cbmvax.UUCP (Randell Jesup) (03/29/89)

In article <15695@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>>> In article <37196@bbn.COM>, slackey@bbn.com (Stan Lackey) writes:
>>> > RISC is indeed a technology window, driven largely by the amount of
>>> > stuff you can fit in a chip...
>....
>Seriously, I doubt that anyone has silicon to burn.  In particular,
>the faster the chips get, the more it hurts you to go off chip.
>Bigger on-chip caches [I, D, or TLB] keep you on-chip more, and are
>therefore good.  With more hardware, you can make integer multiplies go
>faster, make FP go faster, and maybe put in some multiple FP units,
>a la CDC 6600s [and these things chew up area]. Note that Intel, with
>a million transistors, said the space budget didn't leave room for
>an IEEE divide..... (Compcon paper).

	Quite true.  If CPUs continue to get faster (at the process level -
smaller in design rules) The relative overhead for off-chip access will
increase.  I think this will cause one of two things to happen, or maybe
a compromise between them: 1) Bigger caches, or more sophisticated caches;
2) More complex (relatively) instructions, either addressing modes or things
like multiple ALUs for address calculation or ..., in an attempt to reduce
the number of off-chip fetches.

	I think the i860 is a step down this path.

>	c) If you double the size of a giant-monster-chip, its yield
>		might get a lot worse...

	P(good 2-cpu chip) = P(good 1-cpu chip) ** 2

or something close to that.

-- 
Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup

jesup@cbmvax.UUCP (Randell Jesup) (03/29/89)

In article <22974@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
>Things are getting to a very interesting stage, however.  I guess it is just
>my big machine history showing again, but I keep wondering why, with a decent
>sized register file, it wouldn't make more sense to put all FP/ALU hardware on
>the same chip as the control unit, along with the instruction cache, 
>and leave the MMU, and the data cache (which is almost always going to be 
>larger than what you can put
>on a chip no matter how large the chips get), to an external implementation.

	The problem with this is that off-chip access is slower than on-chip,
due to signal load and pad static-protection capacitance.  If someone can do
away with some or most of the pad capacitance, then off-chip caches become
the way to go.  As it is, it's a balancing act, and things like branch-target
internal caches become very useful, with external caches to supply straight
instruction streams and data (or internal I & D caches, if you can fit
signifigant ones on-chip, or it's being designed for a small-chip-count
application).

-- 
Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup

jesup@cbmvax.UUCP (Randell Jesup) (03/29/89)

In article <13404@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes:
>  There will continue to be a demand for processors with a very high
>instruction rate (call them RISC if you will), and also for processors
>which will perform a given task faster with limited memory bandwidth.

	Quite true.  Not everyone designs workstations bought solely by
larger corporations.  There is _signifigant_ demand for CPUs that are
a) fairly cheap - say current '030/881 (16MHz) pricing, and b) do the
most _work_ given a specific memory bandwidth, determined by the speed of
jelly-bean memory parts - say 100-120ns currently, and without expensive
external caches.

	You could sell a lot of CPUs like that.

-- 
Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup

jesup@cbmvax.UUCP (Randell Jesup) (03/29/89)

In article <10078@bloom-beacon.MIT.EDU> tada@athena.mit.edu (Michael Zehr) writes:
>Professor Daly (sp?) of MIT has been saying something along those lines
>for a couple years.  instead of having a whole bunch of memory chips
>with one path to a fast CPU and have a cache to prevent slow accesses,
>take each of those memory chips, halve the amoune of memory on them, and
>put a CPU on it.  you no longer have a memory banchwith problem, because
>the memory and CPU are on the same (small, easy-to-make) chip.  instead
>of putting a cache on the chip (you don't need one), put a
>communications circuit to transfer data to the other chips.

	Sounds a lot like a transputer... how is it different? (other than
being implemented in a more modern process)

-- 
Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup

keith@mips.COM (Keith Garrett) (03/30/89)

In article <6416@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes:
>
>	The problem with this is that off-chip access is slower than on-chip,
>due to signal load and pad static-protection capacitance.

there are also serious speed-of-light considerations. chip to chip path lengths
are considerably longer than on-chip paths. this effect will become worse in
the future as transistor speeds become faster, and transistor spacings become
less. connect technologies that don't support terminated transmission lines
(can you say TTL??) have worse problems due to long bus settling times.
-- 
Keith Garrett        "This is *MY* opinion, OBVIOUSLY"
UUCP: keith@mips.com  or  {ames,decwrl,prls}!mips!keith
USPS: Mips Computer Systems,930 Arques Ave,Sunnyvale,Ca. 94086

frazier@oahu.cs.ucla.edu (Greg Frazier) (03/30/89)

In article <6418@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes:
>In article <10078@bloom-beacon.MIT.EDU> tada@athena.mit.edu (Michael Zehr) writes:
>>Professor Daly (sp?) of MIT has been saying something along those lines
>>for a couple years.  instead of having a whole bunch of memory chips
>>with one path to a fast CPU and have a cache to prevent slow accesses,
>>take each of those memory chips, halve the amoune of memory on them, and
>>put a CPU on it.  you no longer have a memory banchwith problem, because
>	Sounds a lot like a transputer... how is it different? (other than
>being implemented in a more modern process)
>

There are several dramatic differences.  The Message Driven Processor
(MDP) directly implements `actors' (I think that's the term).  The
chip actually contains two processors - one to handle incoming
and outgoing msgs, the other to execute code.  Arriving msgs directly
point to code in memory to be executed.  The machine as a whole
supports a global address space, which makes this possible.  Also,
the memory can be used as context-addressable, so that arriving
msgs can refer to code symbolically.  The idea is that this directly
supports object-oriented programming, where each object resides on
a node.  Each node is expected to send and receive msgs on the order
of every 20 inst'ns.  Msg routing and forwarding are handled by
a Torus Routing Chip-style-router which also resides on the chip.
The most obvious problem with this approach is that only 16k memory
can accompany a 4 MIP processor (the general rule of thumb is
1M of memory/1 MIP of processor).  Dally, et. al. claim that this
rule of thumb does not apply because of the global nature of the
memory, that with 64k nodes they have and address space of 2^30
bytes.  Of course, they also have 256k MIPS of processor, but
they didn't memtion that...

So, all in all, this is a VERY different beast from the Transputer.

Greg Frazier
****************************^^^^^^^^^^^^^^^^^^^^!!!!!!!!!!!!!!!!!!!
Greg Frazier	    o	Internet: frazier@CS.UCLA.EDU
CS dept., UCLA	   /\	UUCP: ...!{ucbvax,rutgers}!ucla-cs!frazier
	       ----^/----
		   /

bb@tetons.UUCP (Bob Blau) (03/31/89)

In article <16181@gumby.mips.COM>, keith@mips.COM (Keith Garrett) writes:
: In article <6416@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes:
: >
: >	The problem with this is that off-chip access is slower than on-chip,
: >due to signal load and pad static-protection capacitance.
: 
: there are also serious speed-of-light considerations. ...
: ...  this effect will become worse in
: the future as transistor speeds become faster ...

This may be true in the far future when everything is optically
interconnected (on and off chip.) Today, at very high speeds, on-chip 
delays rise up to bite you too!
Cray can achieve very fast cycle times with low levels of integration
and clever packaging - putting everything on-chip is not a panacea.

Apply: Std. disclaimer

Remember - it is possible for a particle to go faster than the speed
of light, just not in a vacuum!

-- 
  Bob Blau       Amdahl Corporation    143 N. 2 E., Rexburg, Idaho 83440
  UUCP:{ames,decwrl,sun,uunet}!amdahl!tetons!bb           (208) 356-8915
  INTERNET: bb@tetons.idaho.amdahl.com

levy@nsc.nsc.com (Jonathan Levy) (04/01/89)

In article <6417@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes:
>In article <13404@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes:
>>  There will continue to be a demand for processors with a very high
>>instruction rate (call them RISC if you will), and also for processors
>>which will perform a given task faster with limited memory bandwidth.
>
>	Quite true.  Not everyone designs workstations bought solely by
>larger corporations.  There is _signifigant_ demand for CPUs that are
>a) fairly cheap - say current '030/881 (16MHz) pricing, and b) do the
>most _work_ given a specific memory bandwidth, determined by the speed of
>jelly-bean memory parts - say 100-120ns currently, and without expensive
>external caches.
>
>	You could sell a lot of CPUs like that.

Funny you should mention that. National Semiconductor has just announced
a new processor for the Embedded market. This is the NS32GX32.
It has the NS32532 pipeline core and the same performance 
(18,335 d/s, 8 to 10 vax mips average at 30 MHZ).
The processor is *cheap* both in actual chip cost (for pricing you will need
to contact marketing...), and what is more important, in system cost.

System cost is minimized by providing the system a very simple 
interface (non multiplexed address and data bus, one of each), 
a harvard architecture which is *ON* chip (why burden the designer
with our problems), two on-chip caches (1K data cache, 0.5K instruction cache),
code density which is second to none, dynamic bus sizing for peripherals
and boot ROMs, long address to ready timing, long address to data timing, etc.
All this provides for a CPU which is very insensitive to wait states. 
We have a system which was designed with a single bank of DRAM (no interleave)
which provides the CPU with 2 wait states on the first access, and one
wait state on burst accesses. The performance of this system is more than
85% of the maximum! One cannot do that with a generic RISC.

Jonathan

mac@uvacs.cs.Virginia.EDU (Alex Colvin) (04/06/89)

> In most f.p. codes the integer unit generates 
> addresses and program counter values, neither of which are needed
> by the f.p. unit.  What the f.p. unit needs, on the other hand, is
> a decent bandwidth to cache and/or memory. 

Ah, but the FP has to have addresses to use on its bandwidth to memory.  In
particular, subscripting and such address arithmetic come from the integer
unit.

Wasn't there a simulation done many years ago of one of the high-end 360s
(the one with the multiple FP stations) which showed that the bottleneck
was the integer unit's generation of addresses?

Of course, the 360 required more address arithmetic than many newer
machines, and FORTRAN programs have to do all their address arithmetic as
subscripting, in contrast to newer languages.  I wonder if anyone has done
a study recently.

rodman@mfci.UUCP (Paul Rodman) (04/07/89)

In article <3070@uvacs.cs.Virginia.EDU> mac@uvacs.cs.Virginia.EDU (Alex Colvin) writes:
>> In most f.p. codes the integer unit generates 
>> addresses and program counter values, neither of which are needed
>> by the f.p. unit.  What the f.p. unit needs, on the other hand, is
>> a decent bandwidth to cache and/or memory. 
>
>Ah, but the FP has to have addresses to use on its bandwidth to memory.  In
>particular, subscripting and such address arithmetic come from the integer
>unit.
>

Oh really? What would I want to do with those addresses in the f.p. unit?
Convert them to floating point? The addresses go to memory, on behalf
of the f.p. unit, they don't go *to* the f.p. unit (!?%$*) No connection
between the two is required for this purpose!

>Wasn't there a simulation done many years ago of one of the high-end 360s
>(the one with the multiple FP stations) which showed that the bottleneck
>was the integer unit's generation of addresses?
>

Probably. And once you have that, you had better have a memory system that
will accept said addresses.

I know that on the Trace 7 we have the ability to do 4 integer ops for
every 2 flops. 2 of the integer ops can be load/stores with a base+scaled
offset type of add. The other two integer ops get used for loop exit 
tests (you do a lot for an unrolled loop), whacking induction varables,
random index arithmetic, random integer arith. , and anything else you 
need to do without burping the memory loads/store address being generated
on the other ialu.
It was obvious that this ialu bandwidth was required from very early 
simulation results.

>Of course, the 360 required more address arithmetic than many newer
>machines, and FORTRAN programs have to do all their address arithmetic as
>subscripting, in contrast to newer languages.  I wonder if anyone has done
>a study recently.

I'm curious why you think that Fortran would require more complicated
addressing modes than newer languages?

Bye,

    Paul K. Rodman 
    rodman@mfci.uucp
    __... ...__    _.. .   _._ ._ .____ __.. ._