mslater@cup.portal.com (Michael Z Slater) (04/14/89)
> query about comments on 68040 and 80486
It is curious that the net has been so quiet about these new processors.
Here's my perspective: The 040 and the 486 are very similar in approach.
Both chips use on-chip caches of 8 Kbytes, include snooping for cache
coherency, have on-chip floating-point coprocessors, use pipelining, bypass
gates and other tricks to reduce the average clocks/instruction, and are
supposed to be fully compatible with their predecessors. Both are claimed to
be 2.5 to 3 times as fast as the previous versions (030 or 386) at the same
clock rate.
As to which is faster, there's just not enough public information to call
this one. Intel has not yet released any real performance data. They
have quoted 37,000 Dhrystones and 6.1 MWhetstones at 25 MHz, and 15 to 20
VAX MIPS. These figures, however, are from simulations. I'll remain
skeptical until we see some measured data.
As for the 040, Motorola has not formally introduced the part; they have
made only an "architectural announcement", and have witheld all details.
(They don't even acknowledge the cache sizes officially.) Until the formal
intro this fall, any real evaluation will be impossible.
The differences between the two chips are in several categories:
- The inherent differences in the efficiency of the instruction sets, which are
essentially the same as their predecessors. This is essentially a religious
argument, which is probably pointless to pursue.
- The degree to which clocks per instruction has been reduced. Intel's 486
provides single-clock loads, stores, and moves. Assuming a cache hit,
data can be used by the instruction immediately following the load, with
no stall cycle at all. It remains to be seen if the 040 will do this.
- Cache architecture. Intel uses an 8K bytes unified cache, which allows them
to support self-modifying code, which is quite common in MS-DOS, Windows,
and OS/2 software. (No groans, please - despite the desirability or
lack thereof of this programming technique, it only makes sense for Intel
to support all the existing code.) Motorola, on the other hand, uses
separate 4K caches, which may be less efficient due to the fixed
partitioning. Intel avoids the bandwidth problem by using a 128-bit bus
to read 16 instruction bytes at a time from the cache, and give priority
to data accesses.
- Multiprocessor support. Both processors will provide snooping. There are
several issues about second-level cache support, etc., which we
cannot compare until Moto releases full details.
The two chips will not compete head-to-head in many instances. Obviously,
PC clone vendors will use the 486, and Apple will use the 040. Vendors of
030-based Unix workstations are likely to use the 040, and Sun has said that
they will use the 486. (I don't think Sun has said one way or the other
about their plans for the 040.) HP has committed to using the 040.
Motorola has an edge in the workstation market because there is by far more
workstation software for the 68000 architecture than for any other. However,
ISVs are rapidly porting to RISC architectures and to the 386 architecture.
Furthermore, Intel has a very strong edge in being able to run DOS and OS/2
software very quickly, in a Unix window if desired.
Incidentally, it was striking how much Intel emphasized the 386 at the 486
announcement. As part of the same event, they announced the 33-MHz 386 and
a (slightly) lower-power 386SX. They drove home the point, over and over
again, that the 486 was a 386-architecture device, and that all software
written for the 386 will run on the 486.
Application programs written for the 486 will also run on the 386. There is
one new user-mode instruction, which swaps the byte order of a 32-bit word,
but few programs are likely to use this. Operating system kernels will have
to be modified for the 486, to support additional bits in the page tables for
cacheability control, plus to do things like set the control register bit
that enables the cache. There are a couple new instructions, like compare-and-
swap, to support multitasking and multiprocessor operation; these instructions
will also be used in 486 kernels.
The lesson of both of these processors is that CISC can catch up to RISC
performance, it just takes a while. I think RISC will stay a step ahead,
but CISC is not toppin out. To drive this point home, Intel announced an
agreement with Prime Computer to develop an ECL implementation of the
486 architecture, which Intel will sell as a module. (Think they said
10" x 3" x 3", and 120 MIPS, in 1992.)
In what seemed like a joke, but wasn't, Intel Pres Andy Grove said that in
they year 2000, they would be able to make a processor with tens of millions
of transistors (I forget exactly how many he said), and some thousands of
MIPS --- which will be fully compatible with the 386. (The cynics among us
might also point out that such as processor will be architecturally compatible
with the 8008.)
Michael Slater, Microprocessor Report
550 California Ave., Suite 320, Palo Alto, CA 94306
415/494-2677 fax: 415/494-3718 mslater@cup.portal.com
mdr@reed.UUCP (Mike Rutenberg) (04/15/89)
Michael Slater writes: >- The degree to which clocks per instruction has been reduced. Intel's 486 > provides single-clock loads, stores, and moves. Assuming a cache hit, > data can be used by the instruction immediately following the load, with > no stall cycle at all. It remains to be seen if the 040 will do this. From my memory, the other things that stand out about the i486: * call and return now take significantly fewer clock cycles. * the on-chip fpu is much faster than the 80387. It was unclear if this was due simply to being on-chip or whether it involved architecture changes to the fpu. I belive the bus structure for the i860 and i486 is the same. What support chips for the i486 and mc68040 were announced? Mike -- Mike Rutenberg Reed College, Portland Oregon (503)239-4434 (home) BITNET: mdr@reed.bitnet UUCP: uunet!tektronix!reed!mdr Note: The preceding remarks are intended to represent no known organization
brooks@maddog.llnl.gov (Eugene Brooks) (04/16/89)
In article <12435@reed.UUCP> mdr@reed.UUCP (Mike Rutenberg) writes: >From my memory, the other things that stand out about the i486: > * call and return now take significantly fewer clock cycles. > * the on-chip fpu is much faster than the 80387. It was unclear > if this was due simply to being on-chip or whether it involved > architecture changes to the fpu. How much faster? Is it even close to one flop per clock? brooks@maddog.llnl.gov, brooks@maddog.uucp
mslater@cup.portal.com (Michael Z Slater) (04/17/89)
Mike Rutenberg writes: > I belive the bus structure for the i860 and i486 is the same. > What support chips for the i486 and mc68040 were announced? The i860 bus structure and the i486 bus structure are NOT the same. The 860 bus is 64 bits wide, and the 486 is 32 bits. The 860 provides a "next-near" pin as a hint to page-mode memory controllers, and the 486 does not. The 486 includes bus snooping, and the 860 does not. There may be some similarity in the two buses, but clearly there are many differences. The forthcoming 960CA, on the other hand, may well have a bus similar to the 486's. As for support chips, Intel has announced (but provided no details on) MCA and EISA chip sets for 486-based PCs, and that's about it. Oh, there is a new 32-bit-bus Ethernet chip, the 82596, which is available in a 386 bus flavor and a 486 bus flavor. (The 486 does not offer a pipelined bus mode, and is burst transfer oriented.) Intel has also acknowledged that there will be a cache controller chip for a second-level cache that will work with both the 860 and the 486, but the part is far from being announced. I expect it early next year, as a guess. Motorola has not announced any support chips for the 040. For that matter, they haven't really announced the 040. Moto has never been strong on support chips, though, and I'd be surprised to see much. Michael Slater, Microprocessor Report mslater@cup.portal.com 550 California Avenue, Suite 320, Palo Alto, CA 94036 415/494-2677
kds@blabla.intel.com (Ken Shoemaker) (04/18/89)
In article <12435@reed.UUCP> mdr@reed.UUCP (Mike Rutenberg) writes: >Michael Slater writes: >>- The degree to which clocks per instruction has been reduced. Intel's 486 >> provides single-clock loads, stores, and moves. Assuming a cache hit, >> data can be used by the instruction immediately following the load, with >> no stall cycle at all. It remains to be seen if the 040 will do this. In addition, register to register "simple" arithmetic ops (i.e., everything except multiply and divide) take one clock. Pushes and pops take one clock. Branch-not-taken takes one clock (if taken it is 3 clocks). Immediate to register operations take one clock, even though one operand is in memory. A memory to register operation takes two clocks, but then again, this would require two instructions plus a load delay (i.e., 3 clocks without code reorganization) in most RISC processors. A register to memory operation takes three clocks, but this would take three instructions plus a load delay in most RISC processors. Of course, RISC processors would try to minimize the number of memory operations required by keeping more results in on-chip registers. Two advantages of complex instructions that we could take advantage of here is in the implementation of the push and pop and immediate operands. As said before, push/pop take one clock, even though that operation requires a memory operation and an increment/decrement of a register. By knowing the kind of operation required, we were able to dedicate special hardware to do these things concurrently. The same is true with immediate variables. And it doesn't add critical paths to the chip. Because these are defined complex operations, one can perform multiple operations per clock in special, but well defined (and frequently occuring) cases. > >From my memory, the other things that stand out about the i486: > * call and return now take significantly fewer clock cycles. > * the on-chip fpu is much faster than the 80387. It was unclear > if this was due simply to being on-chip or whether it involved > architecture changes to the fpu. > >I belive the bus structure for the i860 and i486 is the same. The bus structures of the i860 and the i486 microprocessors are similar, but differ in two significant ways. The first is that the i860 supports a 64-bit external data bus while the i486 supports a 32-bit, 16-bit and 8-bit external data bus. The second is that the i860, since it was designed with an exposed pipeline and large data sets in mind, supports two levels of external memory pipelining (which allows the chip to operate at high speeds with real memory chips in an arena with many cache misses). The i486 microprocessor, on the other hand, was designed with the idea that cache hits would be the norm, and thus doesn't support pipelining, but rather supports a burst mode on the bus. However, the types of the signals, i.e., names and functions, and the fundamental nature of the bus, i.e., synchronous, 1X clock, are the same. And they both have a KEN# pin... > >What support chips for the i486 and mc68040 were announced? > The i486 was announced with an Ethernet controller chip and an external second-level cache controller. It shouldn't be too difficult to get it to work with most any of the other peripheral chips out there. And now for the legal bit, UNIX is a trademark of whoever it is these days, and i486 microprocessor is a trademark of Intel. And I really only speak as myself, not as a representative of Intel. No, really! If you want an official position, seek elsewhere. Call Intel. We're in the phone book. And if you want the i486 data "book" (at 176 pages, its difficult to call it a data sheet), ask for order number 240440-001 "i486TM Microprocessor." ---------- I've decided to take George Bush's advice and watch his press conferences with the sound turned down... -- Ian Shoales Ken Shoemaker, Microprocessor Design, Intel Corp., Santa Clara, California uucp: ...{hplabs|decwrl|pur-ee|hacgate|oliveb}!intelca!mipos3!kds
daveh@cbmvax.UUCP (Dave Haynie) (04/19/89)
in article <17131@cup.portal.com>, mslater@cup.portal.com (Michael Z Slater) says: >> query about comments on 68040 and 80486 > Here's my perspective: The 040 and the 486 are very similar in approach. > Both chips use on-chip caches of 8 Kbytes, include snooping for cache > coherency, have on-chip floating-point coprocessors, use pipelining, bypass > gates and other tricks to reduce the average clocks/instruction, and are > supposed to be fully compatible with their predecessors. Both are claimed to > be 2.5 to 3 times as fast as the previous versions (030 or 386) at the same > clock rate. While that's true, I was a bit more surprised about the '486 announcements than that of the '040. It certainly looked, on several major fronts, that the '486 was trying to catch up to the Motorola chips, in terms of system issues (caching, bus sizing, etc) while at the same time coddling the PC clone industry. While the clones are obviously the bread and butter for the '486, it really appears that Intel's designs that work well in an IBM environment also cripple the chip. One can't help but wonder if Intel really wanted the chip crippled anyway. As long as it's faster than a '386, they can help but sell millions, just based on software compatibility. And they stay comfortably distant from any competition with their RISC efforts. Motorola, on the other hand, seems to have taken everything they've learned from RISC and applied that to the design of the '040. If you don't believe it, start at the MMU/Cache differences between the '030 and the '040, and then take a look at the 88100/88200 system. > As to which is faster, there's just not enough public information to call > this one. Intel has not yet released any real performance data. They > have quoted 37,000 Dhrystones and 6.1 MWhetstones at 25 MHz, and 15 to 20 > VAX MIPS. These figures, however, are from simulations. I'll remain > skeptical until we see some measured data. That is the bottom line. Based on what I know, and the fact that I'm seriously Motorola biased, I'd guess that the '040 could seriously outspeed the '486, and it's not just because the '040 can fetch from cache twice as fast (the reason for separate I/D caches, and one of the things I feel was an Intel "sellout to MS-DOS"). > As for the 040, Motorola has not formally introduced the part; they have > made only an "architectural announcement", and have witheld all details. > (They don't even acknowledge the cache sizes officially.) Until the formal > intro this fall, any real evaluation will be impossible. That's true. Though you really can't judge a chip until it's real, no matter what. For example, AMD told us the 29K would do some 42,000 Dhrystones, but there's still no 29K release, last I heard, that had everything in order to actually produce such results. Simulations are a good guideline, but until you have something really running, it's all basically academic. > The differences between the two chips are in several categories: > - The degree to which clocks per instruction has been reduced. Intel's 486 > provides single-clock loads, stores, and moves. Assuming a cache hit, > data can be used by the instruction immediately following the load, with > no stall cycle at all. It remains to be seen if the 040 will do this. Regardless of what Moto's done for new on the '040, if you extend the '030 architecture, you get simultaneous fetch of instruction and data if both caches hit. That's impossible with the single ported cache of the '486. > - Cache architecture. Intel uses an 8K bytes unified cache, which allows them > to support self-modifying code, which is quite common in MS-DOS, Windows, > and OS/2 software. That makes perfect sense, since that's what 95% of the '486 users would be doing with the chip anyway. Though it certainly can have an effect on the chip's performance, which certainly makes Intel's RISC efforts look better (not that they look bad -- I think everyone thinks the 80860 is a neat looking chip, mainly because it doesn't look at all like a traditional Intel chip). Most if not all 680x0 systems outlawed self-modifing code many many moons ago, so the new chips can resort to much more clever caching schemes. > > - Multiprocessor support. Both processors will provide snooping. There are > several issues about second-level cache support, etc., which we > cannot compare until Moto releases full details. The announced '486 "snooping" is pretty primitive. I would have liked it (basically, the ability to invalidate an entry based on a hardware signal) in the '030, but I expect considerably more, something on the order of the 88200 bus snooping (full cache consistency, not just write-through) on the '040. I'll take bets, if anyone's interested.... > Obviously, PC clone vendors will use the 486, and Apple will use the 040. We too. > Vendors of 030-based Unix workstations are likely to use the 040, Especially now that the two top ones, Apollo and HP, are now one. > Sun has said that they will use the 486. (I don't think Sun has said one > way or the other about their plans for the 040.) Same thing they did with the '030 vs. '386 -- they made a big splash about the 386i machine, and quietly introduced the '030 versions of the Sun 3. They were also about the last on the market with '030 machines; maybe a little afraid to compete with their current SPARC, whereas PRISM or the HP Precision Architecture seem significantly distanced from the '030. > Motorola has an edge in the workstation market because there is by far more > workstation software for the 68000 architecture than for any other. However, > ISVs are rapidly porting to RISC architectures and to the 386 architecture. > Furthermore, Intel has a very strong edge in being able to run DOS and OS/2 > software very quickly, in a Unix window if desired. I think that Motorola really has only RISC to worry about, but it's nothing to sluff off, it's a serious threat. Intel's '486 design shows that they're unquestionably targeting the '486 for PCs. > Incidentally, it was striking how much Intel emphasized the 386 at the 486 > announcement. Didn't surprise me a bit. This is the absolute first time that Intel has announced a new 80x86 design that didn't have a completely new architecture piggy-backed on top of an old one. The '486 is really a better '386, whereas the '386 is really a better CISC microprocessor that happens to be able to emulate the '286 and the '086. That's significant; Motorola's been doing virtually the same kind of compatible upgrade thing for years. > The lesson of both of these processors is that CISC can catch up to RISC > performance, it just takes a while. I think RISC will stay a step ahead, > but CISC is not toppin out. And both of these new guys have adopted techniques heretofore only associated with RISC chips. I could be wrong, but I think the culmination of the varied RISC techniques really gives you one thing, when applied right -- a CPU that can be implemented in substantially fewer gates. That's really all that matters. Using the same basic techniques, a 1 micron CMOS 68040 could probably go as fast (or thereabouts) as a 1 micron 88k or MIPS or whatever. But if I come up with some new and better process, there's no question that a 100K design is going to fit in that process long before a 1.2M design. I think that's where RISC is really going to pay off, especially considering how quickly process technology has been moving. > Michael Slater, Microprocessor Report ^^^^ Good Rag! -- Dave Haynie "The 32 Bit Guy" Commodore-Amiga "The Crew That Never Rests" {uunet|pyramid|rutgers}!cbmvax!daveh PLINK: D-DAVE H BIX: hazy Amiga -- It's not just a job, it's an obsession
daveh@cbmvax.UUCP (Dave Haynie) (04/19/89)
in article <12435@reed.UUCP>, mdr@reed.UUCP (Mike Rutenberg) says: > From my memory, the other things that stand out about the i486: > * the on-chip fpu is much faster than the 80387. It was unclear > if this was due simply to being on-chip or whether it involved > architecture changes to the fpu. If you did nothing else, moving the math chip on-board would speed things up several times. Just the delays in crossing an external bus make a big difference. Inside a chip, gate delays are on the order of 1ns or better, outside on the order of 10ns. > Mike Rutenberg Reed College, Portland Oregon (503)239-4434 (home) -- Dave Haynie "The 32 Bit Guy" Commodore-Amiga "The Crew That Never Rests" {uunet|pyramid|rutgers}!cbmvax!daveh PLINK: D-DAVE H BIX: hazy Amiga -- It's not just a job, it's an obsession
pb@idca.tds.PHILIPS.nl (Peter Brouwer) (04/21/89)
> As to which is faster, there's just not enough public information to call > this one. Intel has not yet released any real performance data. They > have quoted 37,000 Dhrystones and 6.1 MWhetstones at 25 MHz, and 15 to 20 > VAX MIPS. These figures, however, are from simulations. I'll remain > skeptical until we see some measured data. Dhrystone figures are dangerous to use in comparing machines performances. My experience is that the 386 dhrystone figures are much better than dhrystone figures measured on a 68020 based machine, but the avarage performance of unix on a 68020 based machine is better than on a 386 based machine. I also have the impression that the code on an 386 machine becomes bigger than on an 68020 generated code. So an 68030 based unix machine would even perform better than an 386 based unix machine. Are there any information on how the 468 would compare to a 680xx based machine. Would it be comparable to a 68040 or would it be comparable with a 68030 machine. -- # Peter Brouwer, ## # Philips TDS, Dept SSP-V2 ## voice +31 55 432523 # P.O. Box 245 ## UUCP address ..!mcvax!philapd!pb # 7300 AE Apeldoorn, The Netherlands ## Internet pb@idca.tds.philips.nl
mash@mips.COM (John Mashey) (04/24/89)
In article <3913@mipos3.intel.com> kds@blabla.UUCP (Ken Shoemaker) writes: >In article <12435@reed.UUCP> mdr@reed.UUCP (Mike Rutenberg) writes: >>Michael Slater writes: >>>- The degree to which clocks per instruction has been reduced. Intel's 486 >>> provides single-clock loads, stores, and moves. Assuming a cache hit, >>> data can be used by the instruction immediately following the load, with >>> no stall cycle at all. It remains to be seen if the 040 will do this. > >In addition, register to register "simple" arithmetic ops (i.e., everything >except multiply and divide) take one clock. Pushes and pops take one clock. >Branch-not-taken takes one clock (if taken it is 3 clocks).... All of this is indeed impressive (really). Having been out of the country awhile, I may have missed some things; I am curious about one thing: how is it that, with apparently the same technology: the i860, with split I & D caches (2-set assoc), and a RISC-style instruction set, has a 1-cycle stall following a load if the data is referenced, and the 486, with a joint cache (4-set), and more complex decoding, has no such stall. The potential answers would appear to be: 1) the i860 folks screwed up, and didn't take advantage of the same cache technology. OR 2) The i860 folks were aiming for a higher potential clock rate, and although they could have built no-stall loads at 33MHz, they couldn't at 40/50, and so built it to go with coming cycle-time improvements, whereas the 486 folks didn't, or weren't aiming for as high eventual clock-rates. OR 3) The 486 claims of 1-cycle loads included zero impact for instruction-fetching (from the joint cache). (likewise on stores, pops,pushes, etc). Note, of course, that we all beat up SPARC implementations for having a 2-cycle load / 3-cycle store for a similar (although not identical) reason...... OR 4) Somehow, the cache speed is so fast that there is plenty of time to do everything, i.e., the critical paths are elsewhere. Can somebody who knows (KS?) say anything about 3); in particular, there's a note in EETimes article (April 17, p. 36) about "aligned instruction access: 3-clock penalty for nonalignment" (which sounds like a branch to something not aligned on a quad-word boundary costs 3 cycles?) Also, can anybody say anything about the cache-access, i.e., to get 16 bytes in one cycle, it presumably has a 128-bit bus to the decode unit. (Does it? or is it 2 8-byte accesses per pre-fetch? I'd guess 1 16-byte access, but I haven't seen anything yet that says one way or another.) (GUESS: above: 1) seems very unlikely. 2) seems possible. 3) seems likely. 4) Seems possible, but unlikely, unless there is really a LONG critical path somewhere else, and this seems unlikely.) -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
kds@blabla.intel.com (Ken Shoemaker) (04/26/89)
In article <17999@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >In article <3913@mipos3.intel.com> kds@blabla.UUCP (Ken Shoemaker) writes: >>In article <12435@reed.UUCP> mdr@reed.UUCP (Mike Rutenberg) writes: >> >> blah blah blah >> >All of this is indeed impressive (really). Having been out of the country >awhile, I may have missed some things; I am curious about one thing: >how is it that, with apparently the same technology: > > the i860, with split I & D caches (2-set assoc), and a RISC-style > instruction set, > has a 1-cycle stall following a load if the data is referenced, >and > the 486, with a joint cache (4-set), and more complex decoding, > has no such stall. > >The potential answers would appear to be: > 1) the i860 folks screwed up, and didn't take advantage of the > same cache technology. Not too likely > 2) The i860 folks were aiming for a higher potential clock rate, > and although they could have built no-stall loads at 33MHz, they > couldn't at 40/50, and so built it to go with coming cycle-time > improvements, whereas the 486 folks didn't, or weren't aiming for > as high eventual clock-rates. Not that either > 3) The 486 claims of 1-cycle loads included zero impact for > instruction-fetching (from the joint cache). (likewise on stores, > pops,pushes, etc). Note, of course, that we all beat up SPARC > implementations for having a 2-cycle load / 3-cycle store for > a similar (although not identical) reason...... Well, data access have higher priority to the cache than instruction accesses. Instruction accesses happen 16 bytes at a time, and fill up a 32 byte circular instruction queue. The actual instruction decoder works out of this queue. Because of the size of the queue and speed it is filled from the cache, the amount of instruction/data conflicts with the cache are relatively small. However, best performance is achieved if branch and especially subroutine jump targets are 16-byte aligned. Also, the fetching of the instructions at the target of a branch don't conflict with any data accesses since the "data" access slot of the branch instruction is taken by a speculative access of the instructions at the target of the branch. The comparison with the Sparc isn't especially relevant, since they have only a single 32-bit path to memory, i.e., cache, and need access to that path to fetch an instruction every clock they are going to execute a new instruction. We don't, because of the 32 byte queue and that we fetch an average of 4 instructions every clock the instruction fetcher gets access to the cache. > 4) Somehow, the cache speed is so fast that there is plenty of > time to do everything, i.e., the critical paths are elsewhere. The cache access path isn't the most critical path on the chip. There is a fifth answer which wasn't advanced, however (and probably many more, for that matter). The one I'd like to mention is that the pipelines are organized differently. In most risc machines, you have a load delay slot and a branch delay slot. Both give you an idle clock that you attempt to fill in with something that doesn't have anything to do with the branch or the load. On the i486, on the other hand, you don't get load delay slots, and you don't get deferred branches. You also get a two stage instruction decode. This means that you can run the memory cycle one clock earlier with respect to the execution stage in the pipeline than you can on most risc machines because the execution stage is one clock later in the pipeline. Thus no load delay slot. This also means that you take another clock on branches taken, which is why a branch taken on the i486 requires 3 clocks, whereas on most risc machines it takes 2 clocks (the second being the branch delay slot). We think that this is a good tradeoff, since we need the extra clock to decode the instructions anyway, and it also improves the performance of all that object code out there for the x86 architecture which isn't going to get recompiled to take advantage of the load delay slot if it were there. This is simplified, and probably isn't very clear. I will try to put together a longer description of the i486 pipeline sometime and post it on the network. In the meantime, the April and May issues of Michael Slater's Microprocessor Report should have most of the gory details in John Wharton's articles. Should have pictures and diagrams and all that stuff! >Can somebody who knows (KS?) say anything about 3); in particular, there's >a note in EETimes article (April 17, p. 36) about "aligned instruction >access: 3-clock penalty for nonalignment" (which sounds like a branch to >something not aligned on a quad-word boundary costs 3 cycles?) This has nothing to do with branches. The i486 supports accesses to non-aligned object in memory, just like all other x86 machines. You will get better performance if you keep all your objects in memory aligned. That is all it means. The i486 also adds a segment attribute that will cause the processor to trap all unaligned access, however. You can use this to make sure that you don't have any of these to insure "portability" of your databases with most risc processors, to insure that you are getting the most performance from your application, to give you cheap run-time tag checking, etc. >Also, can anybody say anything about the cache-access, i.e., to get >16 bytes in one cycle, it presumably has a 128-bit bus to the decode unit. >(Does it? or is it 2 8-byte accesses per pre-fetch? I'd guess 1 16-byte >access, but I haven't seen anything yet that says one way or another.) I think this is covered above. 128-bits in one clock. You want to use as much of this as possible, especially at the target of a branch, so you want to try to 16 byte align your branch targets. --------------- I've decided to take George Bush's advice and watch his press conferences with the sound turned down... -- Ian Shoales Ken Shoemaker, Microprocessor Design, Intel Corp., Santa Clara, California uucp: ...{hplabs|decwrl|pur-ee|hacgate|oliveb}!intelca!mipos3!kds
mash@mips.COM (John Mashey) (04/27/89)
In article <3975@mipos3.intel.com> kds@blabla.UUCP (Ken Shoemaker) writes: (description of 486 stuff) good comments. thanx. at least some of my guesses were right :-) >of this queue. Because of the size of the queue and speed it is filled from >the cache, the amount of instruction/data conflicts with the cache are >relatively small. However, best performance is achieved if branch and >especially subroutine jump targets are 16-byte aligned. This is where I'd gotten confused with the "aligned" penalties. Also, the fetching >of the instructions at the target of a branch don't conflict with any data >accesses since the "data" access slot of the branch instruction is taken by >a speculative access of the instructions at the target of the branch.... >.... We don't, because of the 32 byte queue and that we fetch an >average of 4 instructions every clock the instruction fetcher gets access to >the cache. Can you say anything about the actual conflict penalties, i.e., the percentage of time a load or store stalls due to this? I.e., one would grossly guess 25% of the time, but it wouldn't surprise me if the number was lower than that, given the things that could be done. >There is a fifth answer which wasn't advanced, however (and probably many >more, for that matter). The one I'd like to mention is that the pipelines >are organized differently.... >.... On the i486, on the other hand, you don't get load delay >slots, and you don't get deferred branches. You also get a two stage >instruction decode. This means that you can run the memory cycle one clock >earlier with respect to the execution stage in the pipeline than you can on >most risc machines because the execution stage is one clock later in the >pipeline. Thus no load delay slot. This also means that you take another >clock on branches taken, which is why a branch taken on the i486 requires >3 clocks, whereas on most risc machines it takes 2 clocks (the second >being the branch delay slot). We think that this is a good tradeoff, since >we need the extra clock to decode the instructions anyway.... Yes, certainly a good tradeoff; loads are more frequent than branches. > >>Can somebody who knows (KS?) say anything about 3); in particular, there's >>a note in EETimes article (April 17, p. 36) about "aligned instruction >>access: 3-clock penalty for nonalignment" (which sounds like a branch to >>something not aligned on a quad-word boundary costs 3 cycles?) > >This has nothing to do with branches. The i486 supports accesses to >non-aligned object in memory, .... From your comment above, re subr. calls to 16-byte aligned things, it sounds like the article may have gotten the 2 things mixed in together. I'll look forward to the further postings, especially on the pipeline. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
tim@crackle.amd.com (Tim Olson) (04/27/89)
In article <18201@winchester.mips.COM> mash@mips.COM (John Mashey) writes: | In article <3975@mipos3.intel.com> kds@blabla.UUCP (Ken Shoemaker) writes: | >.... On the i486, on the other hand, you don't get load delay | >slots, and you don't get deferred branches. You also get a two stage | >instruction decode. This means that you can run the memory cycle one clock | >earlier with respect to the execution stage in the pipeline than you can on | >most risc machines because the execution stage is one clock later in the | >pipeline. Thus no load delay slot. This also means that you take another | >clock on branches taken, which is why a branch taken on the i486 requires | >3 clocks, whereas on most risc machines it takes 2 clocks (the second | >being the branch delay slot). We think that this is a good tradeoff, since | >we need the extra clock to decode the instructions anyway.... | | Yes, certainly a good tradeoff; loads are more frequent than branches. Interesting -- what kind of numbers do you see? On the Am29000, we tend to see just the opposite, although they are somewhat close: --- compress (loads > branches) --- 0.36% Calls 12.51% Jumps 16.17% Loads 9.29% Stores --- dhrystone 1.1 --- 2.84% Calls 13.97% Jumps 13.76% Loads --- diff --- 0.34% Calls 17.50% Jumps 15.44% Loads 6.31% Stores --- grep --- 1.13% Calls 15.07% Jumps 13.57% Loads 3.04% Stores --- nroff --- 1.86% Calls 15.65% Jumps 10.73% Loads 3.73% Stores --- 29k assembler --- 1.58% Calls 19.21% Jumps 10.95% Loads 6.14% Stores -- Tim Olson Advanced Micro Devices (tim@amd.com)
mash@mips.COM (John Mashey) (04/28/89)
In article <25428@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes: >In article <18201@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >| Yes, certainly a good tradeoff; loads are more frequent than branches. >Interesting -- what kind of numbers do you see? On the Am29000, we tend >to see just the opposite, although they are somewhat close: On R3000s, we see grossly similar effects, but the comment was directed to the 386/486 chips. I.e.: a) Across typical micro architectures, the NUMBER of branches would be grossly equal, even with different compiler technology,o thje the major exception of loop-unrolling effects in real loopy code. b) The NUMBER of loads/stores, however, can vary quite a bit, affected by: 1) The number of registers available at once 2) Register windows/stack caches/etc for subroutine calls 3) Global optimization technology 4) The nature of the program, i.e., some loads and stores can be eliminated by optimizers or windows, some won't go away no matter what you do. c) The PERCENTAGES of such things depend a lot on the remainder of the instruction set architecture and compiler quality, i.e., a good optimizer often drives the percentages of branches UP, because it generates better code for expression evaluation, for example, and the branches generally refuse to go away. The percentage of loads/stores can go up or down. Note that, although no one does this of course, if you want to drive the percentages of loads and branches down, just cripple your code generator! :-) d) Although I have no data on X86 instruction streams, I'd guess (Intel guys?) that the 486 made the correct choice for running X86 code, since it has less effective registers to play with, would tend to have a higher (# loads) / (# branches) ratio than the current crop of RISC machines. For example, take this to an extreme: suppose you had 2 registers only: you'd do almost anything to avoid an extra cycle of load latency, even if it cost you in branches, because your load/store numbers would be rather high. KDS's comment on preferring to optimize loads seems appropriate; it is also a GOOD example of why architecture isn't a cookbook set of rules, especially when continually improving architectures that started from different places. I.e., what might be the wrong tradeoff in a MIPS R????, might be the correct one for an 80?86. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
kenm@sci.UUCP (Ken McElvain) (04/30/89)
In article <18253@winchester.mips.COM>, mash@mips.COM (John Mashey) writes: > In article <25428@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes: > >In article <18201@winchester.mips.COM> mash@mips.COM (John Mashey) writes: > >| Yes, certainly a good tradeoff; loads are more frequent than branches. > > >Interesting -- what kind of numbers do you see? On the Am29000, we tend > >to see just the opposite, although they are somewhat close: > > On R3000s, we see grossly similar effects, but the comment was directed to > the 386/486 chips. I.e.: > a) Across typical micro architectures, the NUMBER of branches > would be grossly equal, even with different compiler technology > ,o thje the major exception of loop-unrolling effects in real loopy code. > b) The NUMBER of loads/stores, however, can vary quite a bit, > affected by: > 1) The number of registers available at once > 2) Register windows/stack caches/etc for subroutine calls > 3) Global optimization technology > 4) The nature of the program, i.e., some loads and stores > can be eliminated by optimizers or windows, some won't go > away no matter what you do. One measurement that I have always disliked because of its incompleteness is the cache hit percentage. A better measure is the number of cache misses during a run of a given program. With a given cache size and organization this should be relatively constant across different CPU architectures (Within limits, changing sizes of data types would affect things). Architectural changes that affect the number of load/stores would pretty directly affect the cache hit percentage. A small number of registers would lead to an inflated percentage of hits. Better use of registers should lead to lower hit rates. One should be able to get an idea of how well registers are doing their job for a particular architecture/compiler from the cache size and organization, the number of cache misses, and the percentage of hits. A comparison of register windows and simple register files comes to mind. How about some data? Ken McElvain {decwrl,weitek}!sci!kenm
kds@blabla.intel.com (Ken Shoemaker) (05/02/89)
With X86 code we have looked at, about 1 in 3 instructions does a memory load, 1 in 6 does a memory write, and 1 in 6 is a branch. Note that this may not be the same as many risc machines because: 1) we have fewer registers, thus more of the context is in memory 2) we can use memory locations directly as operands in operations As John said, you need to look at the problem you are solving before you can solve it. Using some else's solution may not be appropriate because they may not have the same problem! In the i486, a prefetch will never "bounce" a memory access: there is 1 clock throughput for all memory targeting instructions. We know enough in advance when a memory read or write is going to occur that we can keep any prefetch requests out of the way. Any stalls that would occur with respect to reads should only happen in relation to cache misses. This isn't really new: the 386 does a similar thing. Only there is a 2 clock throughput for memory targeting instructions on the 386 because all references to external memory take at least two clocks. But in this case, too, we will stop any prefetch/instruction access bus cycle if they would remain on the bus beyond when we are ready to send the read/write addresses. This is true only for a zero wait state bus, however. But it is one way that the i486 is simpler than the 386. And I lied. The i486 article in Microprocessor Report won't be coming out until the June issue. The May issue will describe the i486 bus. ----------- I've decided to take George Bush's advice and watch his press conferences with the sound turned down... -- Ian Shoales Ken Shoemaker, Microprocessor Design, Intel Corp., Santa Clara, California uucp: ...{hplabs|decwrl|pur-ee|hacgate|oliveb}!intelca!mipos3!kds