[comp.arch] 486 and 68040

mslater@cup.portal.com (Michael Z Slater) (04/14/89)

> query about comments on 68040 and 80486

It is curious that the net has been so quiet about these new processors.

Here's my perspective:  The 040 and the 486 are very similar in approach.
Both chips use on-chip caches of 8 Kbytes, include snooping for cache
coherency, have on-chip floating-point coprocessors, use pipelining, bypass
gates and other tricks to reduce the average clocks/instruction, and are
supposed to be fully compatible with their predecessors.  Both are claimed to
be 2.5 to 3 times as fast as the previous versions (030 or 386) at the same
clock rate.

As to which is faster, there's just not enough public information to call
this one.  Intel has not yet released any real performance data.  They
have quoted 37,000 Dhrystones and 6.1 MWhetstones at 25 MHz, and 15 to 20
VAX MIPS. These figures, however, are from simulations.  I'll remain
skeptical until we see some measured data.

As for the 040, Motorola has not formally introduced the part; they have
made only an "architectural announcement", and have witheld all details.
(They don't even acknowledge the cache sizes officially.)  Until the formal
intro this fall, any real evaluation will be impossible.

The differences between the two chips are in several categories:

- The inherent differences in the efficiency of the instruction sets, which are
  essentially the same as their predecessors. This is essentially a religious
  argument, which is probably pointless to pursue.

- The degree to which clocks per instruction has been reduced.  Intel's 486
   provides single-clock loads, stores, and moves.  Assuming a cache hit,
   data can be used by the instruction immediately following the load, with
   no stall cycle at all.  It remains to be seen if the 040 will do this.

-  Cache architecture.  Intel uses an 8K bytes unified cache, which allows them
   to support self-modifying code, which is quite common in MS-DOS, Windows,
   and OS/2 software.  (No groans, please - despite the desirability or
   lack thereof of this programming technique, it only makes sense for Intel
   to support all the existing code.)  Motorola, on the other hand, uses
   separate 4K caches, which may be less efficient due to the fixed
   partitioning.  Intel avoids the bandwidth problem by using a 128-bit bus
   to read 16 instruction bytes at a time from the cache, and give priority
   to data accesses.

-  Multiprocessor support.  Both processors will provide snooping.  There are
   several issues about second-level cache support, etc., which we
   cannot compare until Moto releases full details.

The two chips will not compete head-to-head in many instances.  Obviously,
PC clone vendors will use the 486, and Apple will use the 040.  Vendors of
030-based Unix workstations are likely to use the 040, and Sun has said that
they will use the 486.  (I don't think Sun has said one way or the other
about their plans for the 040.)  HP has committed to using the 040.

Motorola has an edge in the workstation market because there is by far more
workstation software for the 68000 architecture than for any other. However,
ISVs are rapidly porting to RISC architectures and to the 386 architecture.
Furthermore, Intel has a very strong edge in being able to run DOS and OS/2
software very quickly, in a Unix window if desired.

Incidentally, it was striking how much Intel emphasized the 386 at the 486
announcement.  As part of the same event, they announced the 33-MHz 386 and
a (slightly) lower-power 386SX.  They drove home the point, over and over
again, that the 486 was a 386-architecture device, and that all software
written for the 386 will run on the 486.

Application programs written for the 486 will also run on the 386.  There is
one new user-mode instruction, which swaps the byte order of a 32-bit word,
but few programs are likely to use this.  Operating system kernels will have
to be modified for the 486, to support additional bits in the page tables for
cacheability control, plus to do things like set the control register bit
that enables the cache.  There are a couple new instructions, like compare-and-
swap, to support multitasking and multiprocessor operation; these instructions
will also be used in 486 kernels.

The lesson of both of these processors is that CISC can catch up to RISC
performance, it just takes a while.  I think RISC will stay a step ahead,
but CISC is not toppin out.  To drive this point home, Intel announced an
agreement with Prime Computer to develop an ECL implementation of the
486 architecture, which Intel will sell as a module.  (Think they said
10" x 3" x 3", and 120 MIPS, in 1992.)

In what seemed like a joke, but wasn't, Intel Pres Andy Grove said that in
they year 2000, they would be able to make a processor with tens of millions
of transistors (I forget exactly how many he said), and some thousands of
MIPS --- which will be fully compatible with the 386.  (The cynics among us
might also point out that such as processor will be architecturally compatible
with the 8008.)

Michael Slater, Microprocessor Report
550 California Ave., Suite 320, Palo Alto, CA 94306
415/494-2677   fax: 415/494-3718     mslater@cup.portal.com

mdr@reed.UUCP (Mike Rutenberg) (04/15/89)

Michael Slater writes:
>- The degree to which clocks per instruction has been reduced.  Intel's 486
>   provides single-clock loads, stores, and moves.  Assuming a cache hit,
>   data can be used by the instruction immediately following the load, with
>   no stall cycle at all.  It remains to be seen if the 040 will do this.

From my memory, the other things that stand out about the i486:
	* call and return now take significantly fewer clock cycles.
	* the on-chip fpu is much faster than the 80387.  It was unclear
	  if this was due simply to being on-chip or whether it involved
	  architecture changes to the fpu.

I belive the bus structure for the i860 and i486 is the same.

What support chips for the i486 and mc68040 were announced?

Mike
-- 
Mike Rutenberg      Reed College, Portland Oregon     (503)239-4434 (home)
BITNET: mdr@reed.bitnet      UUCP: uunet!tektronix!reed!mdr
Note: The preceding remarks are intended to represent no known organization

brooks@maddog.llnl.gov (Eugene Brooks) (04/16/89)

In article <12435@reed.UUCP> mdr@reed.UUCP (Mike Rutenberg) writes:
>From my memory, the other things that stand out about the i486:
>	* call and return now take significantly fewer clock cycles.
>	* the on-chip fpu is much faster than the 80387.  It was unclear
>	  if this was due simply to being on-chip or whether it involved
>	  architecture changes to the fpu.
How much faster?  Is it even close to one flop per clock?


brooks@maddog.llnl.gov, brooks@maddog.uucp

mslater@cup.portal.com (Michael Z Slater) (04/17/89)

Mike Rutenberg writes:
> I belive the bus structure for the i860 and i486 is the same.

> What support chips for the i486 and mc68040 were announced?

The i860 bus structure and the i486 bus structure are NOT the same.
The 860 bus is 64 bits wide, and the 486 is 32 bits.  The 860 provides
a "next-near" pin as a hint to page-mode memory controllers, and the
486 does not.  The 486 includes bus snooping, and the 860 does not.
There may be some similarity in the two buses, but clearly there are
many differences.  The forthcoming 960CA, on the other hand, may well
have a bus similar to the 486's.

As for support chips, Intel has announced (but provided no details on)
MCA and EISA chip sets for 486-based PCs, and that's about it.  Oh, there
is a new 32-bit-bus Ethernet chip, the 82596, which is available in a
386 bus flavor and a 486 bus flavor.  (The 486 does not offer a pipelined
bus mode, and is burst transfer oriented.)

Intel has also acknowledged that there will be a cache controller chip
for a second-level cache that will work with both the 860 and the 486, but
the part is far from being announced.  I expect it early next year, as a
guess.

Motorola has not announced any support chips for the 040.  For that matter,
they haven't really announced the 040.  Moto has never been strong on
support chips, though, and I'd be surprised to see much.

Michael Slater, Microprocessor Report       mslater@cup.portal.com
550 California Avenue, Suite 320, Palo Alto, CA 94036   415/494-2677

kds@blabla.intel.com (Ken Shoemaker) (04/18/89)

In article <12435@reed.UUCP> mdr@reed.UUCP (Mike Rutenberg) writes:
>Michael Slater writes:
>>- The degree to which clocks per instruction has been reduced.  Intel's 486
>>   provides single-clock loads, stores, and moves.  Assuming a cache hit,
>>   data can be used by the instruction immediately following the load, with
>>   no stall cycle at all.  It remains to be seen if the 040 will do this.

In addition, register to register "simple" arithmetic ops (i.e., everything
except multiply and divide) take one clock.  Pushes and pops take one clock.
Branch-not-taken takes one clock (if taken it is 3 clocks).  Immediate to
register operations take one clock, even though one operand is in memory.

A memory to register operation takes two clocks, but then again, this would
require two instructions plus a load delay (i.e., 3 clocks without code
reorganization) in most RISC processors.  A register to memory operation
takes three clocks, but this would take three instructions plus a load delay
in most RISC processors.  Of course, RISC processors would try to minimize
the number of memory operations required by keeping more results in on-chip
registers.

Two advantages of complex instructions that we could take advantage of here
is in the implementation of the push and pop and immediate operands.  As
said before, push/pop take one clock, even though that operation requires a
memory operation and an increment/decrement of a register.  By knowing the
kind of operation required, we were able to dedicate special hardware to do
these things concurrently.  The same is true with immediate variables.  And
it doesn't add critical paths to the chip.  Because these are defined complex
operations, one can perform multiple operations per clock in special, but
well defined (and frequently occuring) cases.

>
>From my memory, the other things that stand out about the i486:
>	* call and return now take significantly fewer clock cycles.
>	* the on-chip fpu is much faster than the 80387.  It was unclear
>	  if this was due simply to being on-chip or whether it involved
>	  architecture changes to the fpu.
>
>I belive the bus structure for the i860 and i486 is the same.

The bus structures of the i860 and the i486 microprocessors are similar, but
differ in two significant ways.  The first is that the i860 supports a
64-bit external data bus while the i486 supports a 32-bit, 16-bit and 8-bit
external data bus.  The second is that the i860, since it was designed with
an exposed pipeline and large data sets in mind, supports two levels 
of external memory pipelining (which allows the chip to operate at high 
speeds with real memory chips in an arena with many cache misses).  The i486
microprocessor, on the other hand, was designed with the idea that cache
hits would be the norm, and thus doesn't support pipelining, but rather
supports a burst mode on the bus.

However, the types of the signals, i.e., names and functions, and the 
fundamental nature of the bus, i.e., synchronous, 1X clock, are the same.
And they both have a KEN# pin...

>
>What support chips for the i486 and mc68040 were announced?
>

The i486 was announced with an Ethernet controller chip and an external
second-level cache controller.  It shouldn't be too difficult to get it to
work with most any of the other peripheral chips out there.

And now for the legal bit, UNIX is a trademark of whoever it is these days,
and i486 microprocessor is a trademark of Intel.  And I really only speak as
myself, not as a representative of Intel.  No, really!  If you want an
official position, seek elsewhere.  Call Intel.  We're in the phone book.
And if you want the i486 data "book" (at 176 pages, its difficult to call it
a data sheet), ask for order number 240440-001 "i486TM Microprocessor."
----------
I've decided to take George Bush's advice and watch his press conferences
	with the sound turned down...			-- Ian Shoales
Ken Shoemaker, Microprocessor Design, Intel Corp., Santa Clara, California
uucp: ...{hplabs|decwrl|pur-ee|hacgate|oliveb}!intelca!mipos3!kds

daveh@cbmvax.UUCP (Dave Haynie) (04/19/89)

in article <17131@cup.portal.com>, mslater@cup.portal.com (Michael Z Slater) says:

>> query about comments on 68040 and 80486

> Here's my perspective:  The 040 and the 486 are very similar in approach.
> Both chips use on-chip caches of 8 Kbytes, include snooping for cache
> coherency, have on-chip floating-point coprocessors, use pipelining, bypass
> gates and other tricks to reduce the average clocks/instruction, and are
> supposed to be fully compatible with their predecessors.  Both are claimed to
> be 2.5 to 3 times as fast as the previous versions (030 or 386) at the same
> clock rate.

While that's true, I was a bit more surprised about the '486 announcements
than that of the '040.  It certainly looked, on several major fronts, that
the '486 was trying to catch up to the Motorola chips, in terms of system
issues (caching, bus sizing, etc) while at the same time coddling the PC
clone industry.  While the clones are obviously the bread and butter for
the '486, it really appears that Intel's designs that work well in an IBM
environment also cripple the chip.  One can't help but wonder if Intel
really wanted the chip crippled anyway.  As long as it's faster than a '386,
they can help but sell millions, just based on software compatibility. And
they stay comfortably distant from any competition with their RISC efforts.

Motorola, on the other hand, seems to have taken everything they've learned
from RISC and applied that to the design of the '040.  If you don't believe
it, start at the MMU/Cache differences between the '030 and the '040, and
then take a look at the 88100/88200 system.

> As to which is faster, there's just not enough public information to call
> this one.  Intel has not yet released any real performance data.  They
> have quoted 37,000 Dhrystones and 6.1 MWhetstones at 25 MHz, and 15 to 20
> VAX MIPS. These figures, however, are from simulations.  I'll remain
> skeptical until we see some measured data.

That is the bottom line.  Based on what I know, and the fact that I'm
seriously Motorola biased, I'd guess that the '040 could seriously outspeed
the '486, and it's not just because the '040 can fetch from cache twice as
fast (the reason for separate I/D caches, and one of the things I feel was
an Intel "sellout to MS-DOS").

> As for the 040, Motorola has not formally introduced the part; they have
> made only an "architectural announcement", and have witheld all details.
> (They don't even acknowledge the cache sizes officially.)  Until the formal
> intro this fall, any real evaluation will be impossible.

That's true.  Though you really can't judge a chip until it's real, no matter
what.  For example, AMD told us the 29K would do some 42,000 Dhrystones, but
there's still no 29K release, last I heard, that had everything in order to
actually produce such results.  Simulations are a good guideline, but until
you have something really running, it's all basically academic.

> The differences between the two chips are in several categories:

> - The degree to which clocks per instruction has been reduced.  Intel's 486
>    provides single-clock loads, stores, and moves.  Assuming a cache hit,
>    data can be used by the instruction immediately following the load, with
>    no stall cycle at all.  It remains to be seen if the 040 will do this.

Regardless of what Moto's done for new on the '040, if you extend the '030
architecture, you get simultaneous fetch of instruction and data if both
caches hit.  That's impossible with the single ported cache of the '486.

> -  Cache architecture.  Intel uses an 8K bytes unified cache, which allows them
>    to support self-modifying code, which is quite common in MS-DOS, Windows,
>    and OS/2 software.  

That makes perfect sense, since that's what 95% of the '486 users would be doing
with the chip anyway.  Though it certainly can have an effect on the chip's
performance, which certainly makes Intel's RISC efforts look better (not that
they look bad -- I think everyone thinks the 80860 is a neat looking chip,
mainly because it doesn't look at all like a traditional Intel chip).  Most if
not all 680x0 systems outlawed self-modifing code many many moons ago, so the
new chips can resort to much more clever caching schemes.
> 
> -  Multiprocessor support.  Both processors will provide snooping.  There are
>    several issues about second-level cache support, etc., which we
>    cannot compare until Moto releases full details.

The announced '486 "snooping" is pretty primitive.  I would have liked it
(basically, the ability to invalidate an entry based on a hardware signal)
in the '030, but I expect considerably more, something on the order of the
88200 bus snooping (full cache consistency, not just write-through) on the
'040.  I'll take bets, if anyone's interested....

> Obviously, PC clone vendors will use the 486, and Apple will use the 040.  

We too.

> Vendors of 030-based Unix workstations are likely to use the 040, 

Especially now that the two top ones, Apollo and HP, are now one.

> Sun has said that they will use the 486.  (I don't think Sun has said one 
> way or the other about their plans for the 040.)  

Same thing they did with the '030 vs. '386 -- they made a big splash about
the 386i machine, and quietly introduced the '030 versions of the Sun 3.
They were also about the last on the market with '030 machines; maybe a
little afraid to compete with their current SPARC, whereas PRISM or the
HP Precision Architecture seem significantly distanced from the '030.

> Motorola has an edge in the workstation market because there is by far more
> workstation software for the 68000 architecture than for any other. However,
> ISVs are rapidly porting to RISC architectures and to the 386 architecture.
> Furthermore, Intel has a very strong edge in being able to run DOS and OS/2
> software very quickly, in a Unix window if desired.

I think that Motorola really has only RISC to worry about, but it's nothing
to sluff off, it's a serious threat.  Intel's '486 design shows that they're
unquestionably targeting the '486 for PCs.

> Incidentally, it was striking how much Intel emphasized the 386 at the 486
> announcement.  

Didn't surprise me a bit.  This is the absolute first time that Intel has
announced a new 80x86 design that didn't have a completely new architecture
piggy-backed on top of an old one.  The '486 is really a better '386, whereas
the '386 is really a better CISC microprocessor that happens to be able to
emulate the '286 and the '086.  That's significant; Motorola's been doing
virtually the same kind of compatible upgrade thing for years.

> The lesson of both of these processors is that CISC can catch up to RISC
> performance, it just takes a while.  I think RISC will stay a step ahead,
> but CISC is not toppin out.  

And both of these new guys have adopted techniques heretofore only associated
with RISC chips.  I could be wrong, but I think the culmination of the varied
RISC techniques really gives you one thing, when applied right -- a CPU that
can be implemented in substantially fewer gates.  That's really all that
matters.  Using the same basic techniques, a 1 micron CMOS 68040 could 
probably go as fast (or thereabouts) as a 1 micron 88k or MIPS or whatever.
But if I come up with some new and better process, there's no question that
a 100K design is going to fit in that process long before a 1.2M design.  I
think that's where RISC is really going to pay off, especially considering
how quickly process technology has been moving.

> Michael Slater, Microprocessor Report

			^^^^ Good Rag!

-- 
Dave Haynie  "The 32 Bit Guy"     Commodore-Amiga  "The Crew That Never Rests"
   {uunet|pyramid|rutgers}!cbmvax!daveh      PLINK: D-DAVE H     BIX: hazy
              Amiga -- It's not just a job, it's an obsession

daveh@cbmvax.UUCP (Dave Haynie) (04/19/89)

in article <12435@reed.UUCP>, mdr@reed.UUCP (Mike Rutenberg) says:

> From my memory, the other things that stand out about the i486:

> 	* the on-chip fpu is much faster than the 80387.  It was unclear
> 	  if this was due simply to being on-chip or whether it involved
> 	  architecture changes to the fpu.

If you did nothing else, moving the math chip on-board would speed things
up several times.  Just the delays in crossing an external bus make a big
difference.  Inside a chip, gate delays are on the order of 1ns or better,
outside on the order of 10ns.

> Mike Rutenberg      Reed College, Portland Oregon     (503)239-4434 (home)
-- 
Dave Haynie  "The 32 Bit Guy"     Commodore-Amiga  "The Crew That Never Rests"
   {uunet|pyramid|rutgers}!cbmvax!daveh      PLINK: D-DAVE H     BIX: hazy
              Amiga -- It's not just a job, it's an obsession

pb@idca.tds.PHILIPS.nl (Peter Brouwer) (04/21/89)

> As to which is faster, there's just not enough public information to call
> this one.  Intel has not yet released any real performance data.  They
> have quoted 37,000 Dhrystones and 6.1 MWhetstones at 25 MHz, and 15 to 20
> VAX MIPS. These figures, however, are from simulations.  I'll remain
> skeptical until we see some measured data.
Dhrystone figures are dangerous to use in comparing machines performances.
My experience is that the 386 dhrystone figures are much better than dhrystone
figures measured on a 68020 based machine, but the avarage performance of unix
on a 68020 based machine is better than on a 386 based machine.
I also have the impression that the code on an 386 machine becomes bigger than
on an 68020 generated code.
So an 68030 based unix machine would even perform better than an 386 based
unix machine.
Are there any information on how the 468 would compare to a 680xx based machine.
Would it be comparable to a 68040 or would it be comparable with a 68030 machine.



-- 
#  Peter Brouwer,                     ##
#  Philips TDS, Dept SSP-V2           ## voice +31 55 432523
#  P.O. Box 245                       ## UUCP address ..!mcvax!philapd!pb
#  7300 AE Apeldoorn, The Netherlands ## Internet pb@idca.tds.philips.nl

mash@mips.COM (John Mashey) (04/24/89)

In article <3913@mipos3.intel.com> kds@blabla.UUCP (Ken Shoemaker) writes:
>In article <12435@reed.UUCP> mdr@reed.UUCP (Mike Rutenberg) writes:
>>Michael Slater writes:
>>>- The degree to which clocks per instruction has been reduced.  Intel's 486
>>>   provides single-clock loads, stores, and moves.  Assuming a cache hit,
>>>   data can be used by the instruction immediately following the load, with
>>>   no stall cycle at all.  It remains to be seen if the 040 will do this.
>
>In addition, register to register "simple" arithmetic ops (i.e., everything
>except multiply and divide) take one clock.  Pushes and pops take one clock.
>Branch-not-taken takes one clock (if taken it is 3 clocks)....

All of this is indeed impressive (really).  Having been out of the country
awhile, I may have missed some things; I am curious about one thing:
how is it that, with apparently the same technology:

	the i860, with split I & D caches (2-set assoc), and a RISC-style
	instruction set,
	has a 1-cycle stall following a load if the data is referenced,
and
	the 486, with a joint cache (4-set), and more complex decoding,
	has no such stall.

The potential answers would appear to be:
	1) the i860 folks screwed up, and didn't take advantage of the
	same cache technology.
	OR
	2) The i860 folks were aiming for a higher potential clock rate,
	and although they could have built no-stall loads at 33MHz, they
	couldn't at 40/50, and so built it to go with coming cycle-time
	improvements, whereas the 486 folks didn't, or weren't aiming for
	as high eventual clock-rates.
	OR
	3) The 486 claims of 1-cycle loads included zero impact for
	instruction-fetching (from the joint cache).  (likewise on stores,
	pops,pushes, etc).  Note, of course, that we all beat up SPARC
	implementations for having a 2-cycle load / 3-cycle store for
	a similar (although not identical) reason......
	OR
	4) Somehow, the cache speed is so fast that there is plenty of
	time to do everything, i.e., the critical paths are elsewhere.

Can somebody who knows (KS?)  say anything about 3); in particular, there's
a note in EETimes article (April 17, p. 36) about "aligned instruction
access: 3-clock penalty for nonalignment"  (which sounds like a branch to
something not aligned on a quad-word boundary costs 3 cycles?)
Also, can anybody say anything about the cache-access, i.e., to get
16 bytes in one cycle, it presumably has a 128-bit bus to the decode unit.
(Does it?  or is it 2 8-byte accesses per pre-fetch? I'd guess 1 16-byte
access, but I haven't seen anything yet that says one way or another.)

(GUESS: above: 1) seems very unlikely.  2) seems possible.  3) seems likely.
4) Seems possible, but unlikely, unless there is really a LONG critical
path somewhere else, and this seems unlikely.)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

kds@blabla.intel.com (Ken Shoemaker) (04/26/89)

In article <17999@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>In article <3913@mipos3.intel.com> kds@blabla.UUCP (Ken Shoemaker) writes:
>>In article <12435@reed.UUCP> mdr@reed.UUCP (Mike Rutenberg) writes:
>>
>> blah blah blah
>>
>All of this is indeed impressive (really).  Having been out of the country
>awhile, I may have missed some things; I am curious about one thing:
>how is it that, with apparently the same technology:
>
>	the i860, with split I & D caches (2-set assoc), and a RISC-style
>	instruction set,
>	has a 1-cycle stall following a load if the data is referenced,
>and
>	the 486, with a joint cache (4-set), and more complex decoding,
>	has no such stall.
>
>The potential answers would appear to be:
>	1) the i860 folks screwed up, and didn't take advantage of the
>	same cache technology.

Not too likely

>	2) The i860 folks were aiming for a higher potential clock rate,
>	and although they could have built no-stall loads at 33MHz, they
>	couldn't at 40/50, and so built it to go with coming cycle-time
>	improvements, whereas the 486 folks didn't, or weren't aiming for
>	as high eventual clock-rates.

Not that either

>	3) The 486 claims of 1-cycle loads included zero impact for
>	instruction-fetching (from the joint cache).  (likewise on stores,
>	pops,pushes, etc).  Note, of course, that we all beat up SPARC
>	implementations for having a 2-cycle load / 3-cycle store for
>	a similar (although not identical) reason......

Well, data access have higher priority to the cache than instruction
accesses.  Instruction accesses happen 16 bytes at a time, and fill up a 32
byte circular instruction queue.  The actual instruction decoder works out
of this queue.  Because of the size of the queue and speed it is filled from
the cache, the amount of instruction/data conflicts with the cache are
relatively small.  However, best performance is achieved if branch and
especially subroutine jump targets are 16-byte aligned.  Also, the fetching
of the instructions at the target of a branch don't conflict with any data
accesses since the "data" access slot of the branch instruction is taken by
a speculative access of the instructions at the target of the branch.  The
comparison with the Sparc isn't especially relevant, since they have only a
single 32-bit path to memory, i.e., cache, and need access to that path to
fetch an instruction every clock they are going to execute a new
instruction.  We don't, because of the 32 byte queue and that we fetch an
average of 4 instructions every clock the instruction fetcher gets access to
the cache.

>	4) Somehow, the cache speed is so fast that there is plenty of
>	time to do everything, i.e., the critical paths are elsewhere.

The cache access path isn't the most critical path on the chip.

There is a fifth answer which wasn't advanced, however (and probably many
more, for that matter).  The one I'd like to mention is that the pipelines
are organized differently.  In most risc machines, you have a load delay
slot and a branch delay slot.  Both give you an idle clock that you attempt
to fill in with something that doesn't have anything to do with the branch
or the load.  On the i486, on the other hand, you don't get load delay
slots, and you don't get deferred branches.  You also get a two stage
instruction decode.  This means that you can run the memory cycle one clock
earlier with respect to the execution stage in the pipeline than you can on
most risc machines because the execution stage is one clock later in the
pipeline.  Thus no load delay slot.  This also means that you take another 
clock on branches taken, which is why a branch taken on the i486 requires 
3 clocks, whereas on most risc machines it takes 2 clocks (the second 
being the branch delay slot).  We think that this is a good tradeoff, since
we need the extra clock to decode the instructions anyway, and it also
improves the performance of all that object code out there for the x86
architecture which isn't going to get recompiled to take advantage of the
load delay slot if it were there.  This is simplified, and probably isn't
very clear.  I will try to put together a longer description of the i486
pipeline sometime and post it on the network.  In the meantime, the April
and May issues of Michael Slater's Microprocessor Report should have most 
of the gory details in John Wharton's articles.  Should have pictures and
diagrams and all that stuff!

>Can somebody who knows (KS?)  say anything about 3); in particular, there's
>a note in EETimes article (April 17, p. 36) about "aligned instruction
>access: 3-clock penalty for nonalignment"  (which sounds like a branch to
>something not aligned on a quad-word boundary costs 3 cycles?)

This has nothing to do with branches.  The i486 supports accesses to
non-aligned object in memory, just like all other x86 machines.  You will
get better performance if you keep all your objects in memory aligned.  That
is all it means.  The i486 also adds a segment attribute that will cause the
processor to trap all unaligned access, however.  You can use this to make
sure that you don't have any of these to insure "portability" of your
databases with most risc processors, to insure that you are getting the most
performance from your application, to give you cheap run-time tag checking,
etc.

>Also, can anybody say anything about the cache-access, i.e., to get
>16 bytes in one cycle, it presumably has a 128-bit bus to the decode unit.
>(Does it?  or is it 2 8-byte accesses per pre-fetch? I'd guess 1 16-byte
>access, but I haven't seen anything yet that says one way or another.)

I think this is covered above.  128-bits in one clock.  You want to use as
much of this as possible, especially at the target of a branch, so you want
to try to 16 byte align your branch targets.
---------------
I've decided to take George Bush's advice and watch his press conferences
	with the sound turned down...			-- Ian Shoales
Ken Shoemaker, Microprocessor Design, Intel Corp., Santa Clara, California
uucp: ...{hplabs|decwrl|pur-ee|hacgate|oliveb}!intelca!mipos3!kds

mash@mips.COM (John Mashey) (04/27/89)

In article <3975@mipos3.intel.com> kds@blabla.UUCP (Ken Shoemaker) writes:
(description of 486 stuff)
good comments.  thanx.  at least some of my guesses were right :-)

>of this queue.  Because of the size of the queue and speed it is filled from
>the cache, the amount of instruction/data conflicts with the cache are
>relatively small.  However, best performance is achieved if branch and
>especially subroutine jump targets are 16-byte aligned.
	This is where I'd gotten confused with the "aligned" penalties.
Also, the fetching
>of the instructions at the target of a branch don't conflict with any data
>accesses since the "data" access slot of the branch instruction is taken by
>a speculative access of the instructions at the target of the branch....
>....  We don't, because of the 32 byte queue and that we fetch an
>average of 4 instructions every clock the instruction fetcher gets access to
>the cache.
	Can you say anything about the actual conflict penalties, i.e., the
	percentage of time a load or store stalls due to this?  I.e.,
	one would grossly guess 25% of the time, but it wouldn't surprise me
	if the number was lower than that, given the things that could be done.

>There is a fifth answer which wasn't advanced, however (and probably many
>more, for that matter).  The one I'd like to mention is that the pipelines
>are organized differently....
>....  On the i486, on the other hand, you don't get load delay
>slots, and you don't get deferred branches.  You also get a two stage
>instruction decode.  This means that you can run the memory cycle one clock
>earlier with respect to the execution stage in the pipeline than you can on
>most risc machines because the execution stage is one clock later in the
>pipeline.  Thus no load delay slot.  This also means that you take another 
>clock on branches taken, which is why a branch taken on the i486 requires 
>3 clocks, whereas on most risc machines it takes 2 clocks (the second 
>being the branch delay slot).  We think that this is a good tradeoff, since
>we need the extra clock to decode the instructions anyway....
Yes, certainly a good tradeoff; loads are more frequent than branches.
>
>>Can somebody who knows (KS?)  say anything about 3); in particular, there's
>>a note in EETimes article (April 17, p. 36) about "aligned instruction
>>access: 3-clock penalty for nonalignment"  (which sounds like a branch to
>>something not aligned on a quad-word boundary costs 3 cycles?)
>
>This has nothing to do with branches.  The i486 supports accesses to
>non-aligned object in memory, ....
From your comment above, re subr. calls to 16-byte aligned things,
it sounds like the article may have gotten the 2 things mixed in together.

I'll look forward to the further postings, especially on the pipeline.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

tim@crackle.amd.com (Tim Olson) (04/27/89)

In article <18201@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
| In article <3975@mipos3.intel.com> kds@blabla.UUCP (Ken Shoemaker) writes:
| >....  On the i486, on the other hand, you don't get load delay
| >slots, and you don't get deferred branches.  You also get a two stage
| >instruction decode.  This means that you can run the memory cycle one clock
| >earlier with respect to the execution stage in the pipeline than you can on
| >most risc machines because the execution stage is one clock later in the
| >pipeline.  Thus no load delay slot.  This also means that you take another 
| >clock on branches taken, which is why a branch taken on the i486 requires 
| >3 clocks, whereas on most risc machines it takes 2 clocks (the second 
| >being the branch delay slot).  We think that this is a good tradeoff, since
| >we need the extra clock to decode the instructions anyway....
|
| Yes, certainly a good tradeoff; loads are more frequent than branches.

Interesting -- what kind of numbers do you see?  On the Am29000, we tend
to see just the opposite, although they are somewhat close:

--- compress (loads > branches) ---
          0.36% Calls
         12.51% Jumps
         16.17% Loads
          9.29% Stores

--- dhrystone 1.1 ---
          2.84% Calls
         13.97% Jumps
         13.76% Loads

--- diff ---
          0.34% Calls
         17.50% Jumps
         15.44% Loads
          6.31% Stores

--- grep ---
          1.13% Calls
         15.07% Jumps
         13.57% Loads
         3.04% Stores

--- nroff ---
         1.86% Calls
         15.65% Jumps
         10.73% Loads
          3.73% Stores

--- 29k assembler ---
          1.58% Calls
         19.21% Jumps
         10.95% Loads
          6.14% Stores


	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

mash@mips.COM (John Mashey) (04/28/89)

In article <25428@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
>In article <18201@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>| Yes, certainly a good tradeoff; loads are more frequent than branches.

>Interesting -- what kind of numbers do you see?  On the Am29000, we tend
>to see just the opposite, although they are somewhat close:

On R3000s, we see grossly similar effects, but the comment was directed to
the 386/486 chips.  I.e.:
	a) Across typical micro architectures, the NUMBER of branches
	would be grossly equal, even with different compiler technology,o		thje the major exception of loop-unrolling effects in real loopy code.
	b) The NUMBER of loads/stores, however, can vary quite a bit,
		affected by:
		1) The number of registers available at once
		2) Register windows/stack caches/etc for subroutine calls
		3) Global optimization technology
		4) The nature of the program, i.e., some loads and stores
		can be eliminated by optimizers or windows, some won't go
		away no matter what you do.
	c) The PERCENTAGES of such things depend a lot on the remainder of the
		instruction set architecture and compiler quality, i.e.,
		a good optimizer often drives the percentages of branches UP,
		because it generates better code for expression evaluation,
		for example, and the branches generally refuse to go away.
		The percentage of loads/stores can go up or down.
		Note that, although no one does this of course, if you want
		to drive the percentages of loads and branches down, just
		cripple your code generator! :-)
	d) Although I have no data on X86 instruction streams, I'd guess
		(Intel guys?) that the 486 made the correct choice for running
		X86 code, since it has less effective registers to play with,
		would tend to have a higher (# loads) / (# branches) ratio
		than the current crop of RISC machines.  For example, take
		this to an extreme: suppose you had 2 registers only: you'd
		do almost anything to avoid an extra cycle of load latency,
		even if it cost you in branches, because your load/store numbers
		would be rather high.

KDS's comment on preferring to optimize loads seems appropriate; it is also
a GOOD example of why architecture isn't a cookbook set of rules, especially
when continually improving architectures that started from different places.
I.e., what might be the wrong tradeoff in a MIPS R????, might be the correct
one for an 80?86.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

kenm@sci.UUCP (Ken McElvain) (04/30/89)

In article <18253@winchester.mips.COM>, mash@mips.COM (John Mashey) writes:
> In article <25428@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
> >In article <18201@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
> >| Yes, certainly a good tradeoff; loads are more frequent than branches.
> 
> >Interesting -- what kind of numbers do you see?  On the Am29000, we tend
> >to see just the opposite, although they are somewhat close:
> 
> On R3000s, we see grossly similar effects, but the comment was directed to
> the 386/486 chips.  I.e.:
> 	a) Across typical micro architectures, the NUMBER of branches
> 	would be grossly equal, even with different compiler technology
> ,o	thje the major exception of loop-unrolling effects in real loopy code.
> 	b) The NUMBER of loads/stores, however, can vary quite a bit,
> 		affected by:
> 		1) The number of registers available at once
> 		2) Register windows/stack caches/etc for subroutine calls
> 		3) Global optimization technology
> 		4) The nature of the program, i.e., some loads and stores
> 		can be eliminated by optimizers or windows, some won't go
> 		away no matter what you do.

One measurement that I have always disliked because of its incompleteness
is the cache hit percentage.  A better measure is the number of cache
misses during a run of a given program.  With a given cache size and
organization this should be relatively constant across different CPU
architectures (Within limits, changing sizes of data types would affect things).

Architectural changes that affect the number of load/stores would pretty
directly affect the cache hit percentage.  A small number of registers would
lead to an inflated percentage of hits.  Better use of registers should
lead to lower hit rates.

One should be able to get an idea of how well registers are doing
their job for a particular architecture/compiler from the cache size and
organization, the number of cache misses, and the percentage of hits.
A comparison of register windows and simple register files comes to mind.

How about some data?

Ken McElvain
{decwrl,weitek}!sci!kenm

kds@blabla.intel.com (Ken Shoemaker) (05/02/89)

With X86 code we have looked at, about 1 in 3 instructions does a memory
load, 1 in 6 does a memory write, and 1 in 6 is a branch.  Note that this
may not be the same as many risc machines because:

	1) we have fewer registers, thus more of the context is in memory
	2) we can use memory locations directly as operands in operations

As John said, you need to look at the problem you are solving before you can
solve it.  Using some else's solution may not be appropriate because they
may not have the same problem!

In the i486, a prefetch will never "bounce" a memory access: there is 1 clock
throughput for all memory targeting instructions.  We know enough in advance
when a memory read or write is going to occur that we can keep any prefetch
requests out of the way.  Any stalls that would occur with respect to reads
should only happen in relation to cache misses.

This isn't really new: the 386 does a similar thing.  Only there is a 2
clock throughput for memory targeting instructions on the 386 because all
references to external memory take at least two clocks.  But in this case,
too, we will stop any prefetch/instruction access bus cycle if they would
remain on the bus beyond when we are ready to send the read/write addresses.
This is true only for a zero wait state bus, however.  But it is one way
that the i486 is simpler than the 386.

And I lied.  The i486 article in Microprocessor Report won't be coming out
until the June issue.  The May issue will describe the i486 bus.
-----------
I've decided to take George Bush's advice and watch his press conferences
	with the sound turned down...			-- Ian Shoales
Ken Shoemaker, Microprocessor Design, Intel Corp., Santa Clara, California
uucp: ...{hplabs|decwrl|pur-ee|hacgate|oliveb}!intelca!mipos3!kds