clif@intelca.intel.com (Ken Shoemaker) (03/04/89)
The following information is taken from the i860 TM 64-Bit Microprocessor data sheet order number 240296-001. I hope that this posting does not generate a meta-discussion about appropriateness of the posting. I believe that it contains more technical information than a typical comp.arch posting. We, Intel, will try to answer questions regarding the architecture . However due to work pressures and the need for approval prior to posting non-technical information their will probably be a delay. i860 64-bit Microprocessor Highlights: Parallel Architecture: 3 instructions Clock - one integer or control instruction - up to to Floating Point Instructions High Performance Design - 33.3/40 MHz Clock Rate - 80 MFLOP Peak Single Precision MFLOPs - 60 MFLOP Peak Double Precision MFLOPs - 64-bit External Data Bus - 64-bit Internal Instruction Cache Bus - 128-bit Internal Data Cache Bus Measured Performance with Current Compilers - 24 Megawhetsones (40 MHz) - 83K Dhrystones (40 MHz) Highly Integrated - 32/64-bit Pipelined Floating-Point Adder and Multipler - 32-bit Integer and Control Unit - 64-Bit 3-D Graphics Unit - Paging Unitg with TLB - 4K Byte Instruction Cache - 8K Byte Data Cache The core execution unit controls overall operation of the i860 TM CPU. The core unit executes load, store, integer, bit, and control-transfer operations, and fetches instructions for the floating-point unit as well. A set of 32 32-bit general-purpose registers are provided for the manipulation of integer data. Load and store instructions move 8-, 16-, and 32-bit data to and from these registers. Its full set of integer, logical, and control-transfer instructions give the core unit the ability to execute complete systems software and applications programs. A trap mechanism provides rapid response to exceptions and external interrupts. Debugging is supported by the ability to trap on data or instruction reference. The floating-point hardware is connected to a separate set of floating-point registers, which can be accessed as 16 64-bit registers, or 32 32-bit registers. Special load and store instructions can also access these same registers as 8 128-bit registers. All floating-point instructions use these registers as their source and destination operands. The floating-point control unit controls both the floating-point adder and the floating-point multiplier, issuing instructions, handling all source and result exceptions, and updating status bits in the floating-point status register. The adder and multiplier can operate in parallel, producing up to two results per clock. The floating-point data types, floating-point instructions, and exception handling all support the IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985). The floating-point adder performs addition, subtraction, comparison, and conversions on 64- and 32-bit floating-point values. An adder instruction executes in three to four clocks; however, in pipelined mode, a new result is generated every clock. The floating-point multiplier performs floating-point and integer multiply and floating-point reciprocal operations on 64- and 32-bit floating-point values. A multiplier instruction executes in three to four clocks; however, in pipelined mode, a new result can be generated every clock for single-precision and every other clock for double precision. The graphics unit has special integer logic that supports three-dimensional drawing in a graphics frame buffer, with color intensity shading and hidden surface elimination via the Z-buffer algorithm. The graphics unit recognizes the pixel as an 8-, 16-, or 32-bit data type. It can compute individual red, blue, and green color intensity values within a pixel; but it does so with parallel operations that take advantage of the 64-bit internal word size and 64-bit external bus. The graphics features of the i860 microprocessor assume that the surface of a solid object is drawn with polygon patches whose shapes approximate the original object. The color intensities of the vertices of the polygon and their distances from the viewer are known, but the distances and intensities of the other points must be calculated by interpolation. The graphics instructions of 860 CPU the directly aid such interpolation. The paging unit implements protected, paged, virtual memory via a 64-entry, four-way set-associative memory called the TLB (Translation Lookaside Buffer). The paging unit uses the TLB to perform the translation of logical address to physical address, and to check for access violations. The access protection scheme employs two levels of privilege: user and supervisor. {Editors note the i860 CPU's paging mechanism is the same as the 386 CPU.} The instruction cache is a two-way set-associative memory of four Kbytes, with 32-byte blocks. It transfers up to 64 bits per clock (266 Mbyte/sec at 33.3 MHz). The data cache is a two-way set-associative memory of eight Kbytes, with 32-byte blocks. It transfers up to 128 bits per clock (533 Mbyte/sec at 33.3 MHz). The 860 CPU normally uses writeback caching, i.e. memory writes update the cache (if applicable) without necessarily updating memory immediately; however, caching can be inhibited by software where necessary. The bus and cache control unit performs data and instruction accesses for the core unit. It receives cycle requests and specifications from the core unit, performs the data-cache or instruction-cache miss processing, controls TLB translation, and provides the interface to the external bus. Its pipelined structure supports up to three outstanding bus cycles. Clif Purkiser Intel Corp, Santa Clara Microcomputer Division
mark@mips.COM (Mark G. Johnson) (03/05/89)
Thanks for Clif Purkiser for an informative posting! <208@intelca.intel.com> did raise a question, though: >Highly Integrated > - 32/64-bit Pipelined Floating-Point Adder and Multipler > - 32-bit Integer and Control Unit > - 64-Bit 3-D Graphics Unit > - Paging Unitg with TLB > - 4K Byte Instruction Cache > - 8K Byte Data Cache Perhaps the list above is simply incomplete; by an omission it leads to speculations like: 1. Is there a Floating-Point Divider in hardware? 2. Are there Floating-Point Divide instructions (IEEE 32b & 64b) in the 80860 architecture? 3. How many clocks does it take to do an IEEE 32b divide? 64b? Thanks. -- -- Mark Johnson MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086 ...!decwrl!mips!mark (408) 991-0208
mash@mips.COM (John Mashey) (03/05/89)
In article <208@intelca.intel.com> clif@intelca.intel.com (Ken Shoemaker) writes: >The following information is taken from the i860 TM 64-Bit Microprocessor >data sheet order number 240296-001. I hope that this posting does not >generate a meta-discussion about appropriateness of the posting.... Appropriate posting; thanx; it's much better than seeing random rumors and misinformation, and there's plenty of technical content. There a few questions though: I suspect this was just an oversight, as somebody MUST know the answers, but 2 of the numbers need clarification, or they are almost meaningless: >Measured Performance with Current Compilers I assume this was measured on real hardware, so are you allowed to say what the memory system looks like? i.e., read latency and write retirement rates, for example? (of course, for these particular benchmarks it probably doesn't matter too much, since their cache miss rates are neglible :-) > - 24 Megawhetsones (40 MHz) 1) Was this single precision or double precision? 2) Whichever it was, what was the other one? > - 83K Dhrystones (40 MHz) 1) Which version: 1.1 or 2.1? I assume this wasn't 1.0, whose numbers are 15% better than 1.1. 2) What level of optimization? any inlining? any unusual options? (like, for example: the manual shows normal use of a frame pointer, which costs 4 cycle/call, but could be suppressed if you know things like alloca won't be used. Since a typical 32-bit RISC would use 30-40 cycles/call, suppressing the fp-manipulation gains about 10%.) -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
seanf@sco.COM (Sean Fagan) (03/06/89)
In article <14616@obiwan.mips.COM> mark@mips.COM (Mark G. Johnson) writes: >Perhaps the list above is simply incomplete; by an omission it leads to >speculations like: > 1. Is there a Floating-Point Divider in hardware? No. > 2. Are there Floating-Point Divide instructions (IEEE 32b & 64b) > in the 80860 architecture? No. > 3. How many clocks does it take to do an IEEE 32b divide? 64b? Depends. I think it might be somewhere around 30-40 (40-50?), but I'm not sure. It doesn't have divide in hardware; what it has is reciporacal approximations (1.0/x), so you do that (plus a little bit to get rid of the errors), then multiply. Kinda like a Cray, right? 8-) -- Sean Eric Fagan | "What the caterpillar calls the end of the world, seanf@sco.UUCP | the master calls a butterfly." -- Richard Bach (408) 458-1422 | Any opinions expressed are my own, not my employers'.
earl@wright.mips.com (Earl Killian) (03/11/89)
In article <1895@oakhill.UUCP> tomj@oakhill.UUCP (Tom Johnson) and article <21570@shemp.CS.UCLA.EDU>, marc@oahu (Marc Tremblay) discuss the choice of I-cache and D-cache size on the i860. The reason that most systems have identical sized caches is just for simplicity, and secondarily because it is often the best overall performance choice. On a large suite of benchmarks you'll find a large number miss more in the i-cache, and a large number miss more in the d-cache. The average results depends on the set of benchmarks chosen, and on other cache parameters, such as the number of words refilled on a cache miss. There is no universal truth. For example, the M/500 had a 16KB I-cache and 8KB D-cache because its performance was more I-cache limited. But if it had a larger cache refill size, it might have become more D-limited because large refill works better for I than D. As for why the i860 D-cache is twice the size of the I-cache, my guess is a very simple explanation: the D-cache had to be twice as wide in order to support 128-bit load/store. I've now simulated i860-style caches on a wide range of programs, and the size of both caches is a problem, with I-cache more of a bottleneck. The overall cpi is in the range of 2.2-2.7, resulting in >native< MIPS of 12 to 15 at 33.3MHz. A tad far from "150-mips" eh? -- UUCP: {ames,decwrl,prls,pyramid}!mips!earl USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086
mash@mips.COM (John Mashey) (03/11/89)
In article <21570@shemp.CS.UCLA.EDU> marc@cs.ucla.edu (Marc Tremblay) writes: >For other systems, such as the MIPS M/2000 processor board, or >the Tadpole VME board based on the 88000, cache sizes for both >instructions and data are the same. >Although the sizes of the caches for these system are independent, >none of them have the "opposite" (having a larger instruction cache). >Last time I heard, they intended to run UNIX on these machines :-) . >Based on the choices made at MIPS (2 64k caches), they probably >obtained simulations that show that performance/cost was optimized >for equal size caches. The MIPS M/500 uses 16K I + 8K D cache, having grown up from the 8KI + 8K D predecessor (about 1 BVUP difference by doubling the I-cache). No, the reason was to get the largest caches that were electrically reasonable. We'd lessen the D-cache first.... -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
jeff@Alliant.COM (Jeff Collins) (03/11/89)
In article <21570@shemp.CS.UCLA.EDU> marc@cs.ucla.edu (Marc Tremblay) writes: :In article <1895@oakhill.UUCP> tomj@oakhill.UUCP (Tom Johnson) writes: ::4) Still on the subject of the caches: There is no way to externally :: invalidate cache lines. This makes the part virtually unusable in multi- :: processing configurations, since cache coherency cannot be maintained. : :Invalidating cache lines externally is not an absolute requirement for :using caches in a multi-processor environment. :There are policies that do not require this feature at all. Which policies are these? How does one share memory in a multiprocessor when you can't have external bus watchers? Actually never mind that, how do you switch a process from one processor to another (I don't count flushing the D-cache on each context switch as a viable answer)?
newton@kahuna.UUCP (Mike Newton) (03/13/89)
In article <1895@oakhill.UUCP> tomj@oakhill.UUCP (Tom Johnson) writes: >2) The ISSCC paper indicates a Dhrystone _1.1_ rating of 105,000 Dhrystones > at 50 MHz. Why was the 1.1 version used? WARNING: (mild) flaming Intel is (in)famous for this. For some good (though biased!) comments on this aspect of Intel marketing, see Motorola's Performance briefs for the 680x0 series micros. Over the years i have repeatedly been shocked by Intel's peverse use of benchmarks. Dhrystone 1.1 can have VERY special optimizations done to it to make it run much faster than you would expect. Remember that in this industry (currently) rule #2: If you are behind, promise everything and confuse everyone as much as possible. By announcing this "fast" micro now, they will put doubts into everyone's minds about using a different machine. Later, when the part is "real", it doesnt matter so much if it isn't really that great. Just look at initial promises of the 386 vs. reality. There were a lot of people who waited for it, rather than use the 68020 -- even though it caused product delays. Note that I am NOT saying that this (i860) is not a good chip. I dont know: [1] the specs [2] whether it is "real"! - mike PS: Disclaimer Disclaimer >tomj@oakhill.UUCP >Disclaimer: Motorola does NOT NECESSARILY ENDORSE ANY OR ALL COMMENTS > CONTAINED HEREIN. ALL THOUGHTS/COMMENTS ARE PURELY MY OWN. ps: from who you appear to work for (Motorola), i suspect (but do not imply!) that you already assumed what i wrote above, but could not say it in those words without being blasted to death. I, however, have no (known to me) stake in this -- and have more friends who have worked for Intel than Motorola. -- From the bit bucket in the middle of the Pacific... Mike Newton newton@csvax.caltech.edu Caltech Submillimeter Observatory kahuna!newton@csvax.caltech.edu Post Office Box 4339 Hilo Hawaii 96720 808 935 1909 "Reality is a lie that hasn't been found out yet..."
mash@mips.COM (John Mashey) (03/14/89)
In article <3024@alliant.Alliant.COM> jeff@alliant.Alliant.COM (Jeff Collins) writes: ...... >:There are policies that do not require this feature at all. .... > Actually never mind that, how do you switch a process from one > processor to another (I don't count flushing the D-cache on > each context switch as a viable answer)? As I posted in <15016@winchester>, you have to flush the caches on context-switch in a single CPU, much less a multiprocessor. As the details of the i860 become clearer, it seems obvious that the Intel engineers designed it as a very powerful numerics coprocessor, (and at least the Intel engineers we know say so, also) and all the tradeoffs were made in that direction. Another way to put it is that all the design tradeoffs one sees cry out "mini-super"- style design (not supermini or mainframe-style). I'm hardly privy to the internal decisions, but it wouldn't surprise me if it hadn't been "redesigned" (i.e. relabeled) by marketing very late in the game to become a general-purpose chip, for such things as multi-user systems.:-) But don't zing the engineers too much.... -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
mash@mips.COM (John Mashey) (03/15/89)
In article <175@kahuna.UUCP> newton@kahuna.UUCP (Mike Newton) writes: >In article <1895@oakhill.UUCP> tomj@oakhill.UUCP (Tom Johnson) writes: >>2) The ISSCC paper indicates a Dhrystone _1.1_ rating of 105,000 Dhrystones >> at 50 MHz. Why was the 1.1 version used? Despite the existence of Dhrystone 2.1, it turns out that most people, seeing unlabeled Dhrystones, think it's 1.1, if only because when you backtrack what they meant, that's what you find. At least it was labeled, and their performance document gives nubmers for both, and they have a typical ratio. [Now, I still find the number puzzling, but..] -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
brooks@maddog.llnl.gov (Eugene Brooks) (03/15/89)
In article <3024@alliant.Alliant.COM> jeff@alliant.Alliant.COM (Jeff Collins) writes: > Which policies are these? How does one share memory in a > multiprocessor when you can't have external bus watchers? > Actually never mind that, how do you switch a process from one > processor to another (I don't count flushing the D-cache on > each context switch as a viable answer)? Although you don't count flushing the D-cache as a viable answer, for large scalable multiprocessor systems such a solution is indeed viable. In particular, if you are not talking about a bus architecture at all where snooing is not very easy to do, write-through caches which use the volatile keyword in C as a mechanism to force a cache miss on a read are quite useful. Non-bus systems don't have a memory bandwidth problem, they generally have a memory latency problem. The write-through cache can be tolerated because the needed bandwidth is available and the cache flush is quick because cache lines only need to be forgotten, not written to memory. A single instruction could nail the whole cache. This is of course somewhat of a tangent to the issue for the i860. The i860's cache is not write-through with explicit user code management of the cache lines so it does not fit in the "non-bus" scheme mentioned above very well, and the current on chip cache does not have a coherence strategy useful for bus based shared memory multiprocessors (other than to keep shared data out of the on chip cache). For such a small on chip cache the "flush the cache on a context switch" is not really much of an issue. I suspect that you have to do this anyway for a single cpu machine as the cache is a virtual memory cache. Is the news software incompatible with your mailer too? brooks@maddog.llnl.gov, brooks@maddog.uucp, uunet!maddog.llnl.gov!brooks
doug@ross.UUCP (doug carmean) (03/15/89)
In article <21923@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes: >Although you don't count flushing the D-cache as a viable answer, for >large scalable multiprocessor systems such a solution is indeed viable. >In particular, if you are not talking about a bus architecture at all >where snooing is not very easy to do, write-through caches which >use the volatile keyword in C as a mechanism to force a cache miss >on a read are quite useful. Non-bus systems don't have a memory >bandwidth problem, they generally have a memory latency problem. . It seems to me that you are proposing a multiprocessing system that implements a cache but never actually uses the cache. Your cache scheme uses write through and then forces misses on the reads. Why even bother implementing a D-cache? . >This is of course somewhat of a tangent to the issue for the i860. >The i860's cache is not write-through with explicit user code >management of the cache lines so it does not fit in the "non-bus" >scheme mentioned above very well, and the current on chip cache >does not have a coherence strategy useful for bus based shared >memory multiprocessors (other than to keep shared data out of the >on chip cache). For such a small on chip cache the "flush >the cache on a context switch" is not really much of an issue. . I think what you really mean here is that using such a cache system in an application with heavy context switching is not really much of an issue - you would never want to do it! From what I understand of the i860, you must flush the I-cache, the D-cache and the TLB on a context switch. This seems like a very big penalty to pay every single time you want to switch contexts. . >I suspect that you have to do this anyway for a single cpu machine >as the cache is a virtual memory cache. . A virtual memory cache does not necessarily imply that you must flush the cache on every context switch. An easy solution to this problem is to store the context number along with the cache tag entry. This way the context number is compared along with the tag entry to determine if a chache entry is a hit or not. Also note that there are alternative solutions to vitual caches in a multiprocessing system other than the solution you have presented here. A virtual cache that implements a copyback scheme with bus snooping is very feasible in a multprocessing environment. -- -doug carmean -ROSS Technology, 7748 Hwy 290 West Suite 400, Austin, TX 78736 -ross!doug@cs.utexas.edu
rodman@mfci.UUCP (Paul Rodman) (03/15/89)
In article <15213@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >In article <3024@alliant.Alliant.COM> jeff@alliant.Alliant.COM (Jeff Collins) writes: >...... >>:There are policies that do not require this feature at all. >.... >> Actually never mind that, how do you switch a process from one >> processor to another (I don't count flushing the D-cache on >> each context switch as a viable answer)? > >As I posted in <15016@winchester>, you have to flush the caches >on context-switch in a single CPU, much less a multiprocessor. > Hmmmm. I'm not sure I understand this, I haven't read your posting so forgive my ignorance here. On our single cpu system we use a process id tagged icache and we assign the process a new hardware "pid" (or "asid", if you prefer :-), which takes essentially no time. Obviously, on a cmiss the cache is written with the hardware pid and on a read this tag is checked against the current hardware pid. [ Sorry for boring those of you that already know this from basic comp.arch 101] Once you've cycled around all the hardware pids, you *then* have to actually flush the cache. But this is only one time in 256,1024 ,etc,etc.. So, I assume this is what you guys mean when you say "flush the cache". Bye, Paul K. Rodman rodman@mfci.uucp __... ...__ _.. . _._ ._ .____ __.. ._
marc@oahu.cs.ucla.edu (Marc Tremblay) (03/16/89)
In article <3024@alliant.Alliant.COM> jeff@alliant.Alliant.COM (Jeff Collins) writes: >In article <21570@shemp.CS.UCLA.EDU> marc@cs.ucla.edu (Marc Tremblay) writes: >:Invalidating cache lines externally is not an absolute requirement for >:using caches in a multi-processor environment. >:There are policies that do not require this feature at all. > > Which policies are these? How does one share memory in a > multiprocessor when you can't have external bus watchers? > Actually never mind that, how do you switch a process from one > processor to another (I don't count flushing the D-cache on > each context switch as a viable answer)? 1) As mentioned in another article making sharable pages uncachable can be a viable answer for various configurations. Remember not everything is bus-based, (i.g. interconnection networks), so a broadcasts may not be allowed. 2) Directory-based cache coherency scheme: several papers have been published about this method. One of them is: "A New Solution to Coherence Problems in Multicache Systems" L.M. Censier and P. Feautrier, IEEE Transactions on Computers, TC-12, Dec. 1978, pp. 1112-1118. The basic trick consists of appending a small vector of bits to each main memory block. The vector needs to be N+1 bits long where N is the number of processors in the system. One bit/cache is necessary to indicate the presence or absence of the block in each cache and an extra bit indicates if the block has been modified or not. Reads, Writes, Misses, are explained in the paper. A "cheaper" solution has been proposed in: "An Economical Solution to the Cache Coherence Problem" James Archibald and Jean Loup Baer, 11th Annual Symposium on Computer Architecture, June 1984, Ann Arbor MI, pp. 355-371 In this paper the overhead of adding a vector to each memory block is reduced to 2 bits/block. Bus watchers are nice but are not an absolute necessity for implementing efficient multiprocessor systems. Marc Tremblay marc@CS.UCLA.EDU Computer Science Department, UCLA
loving@lanai.cs.ucla.edu (Mike Loving) (03/16/89)
In article <15213@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >In article <3024@alliant.Alliant.COM> jeff@alliant.Alliant.COM (Jeff Collins) writes: >...... >>:There are policies that do not require this feature at all. >.... >> Actually never mind that, how do you switch a process from one >> processor to another (I don't count flushing the D-cache on >> each context switch as a viable answer)? > >As I posted in <15016@winchester>, you have to flush the caches >on context-switch in a single CPU, much less a multiprocessor. A popular misconception. It is NOT necessary to flush the cache on a context switch. If your cache is physically addressed and you do not include PIDs in the tags, then yes you do have to. An example circumventing this is the new HP machines which use a virtually addressed (no address xlat delay) cache and do not flush the cache on context switches. ------------------------------------------------------------------------------- Mike Loving loving@lanai.cs.ucla.edu . . . {hplabs,ucbvax,uunet}!cs.ucla.edu!loving -------------------------------------------------------------------------------
brooks@vette.llnl.gov (Eugene Brooks) (03/16/89)
In article <222@ross.UUCP> doug@ross.UUCP (doug carmean) writes: With regard to a write through explicitly managed cache system. >It seems to me that you are proposing a multiprocessing system that >implements a cache but never actually uses the cache. Your cache >scheme uses write through and then forces misses on the reads. Why >even bother implementing a D-cache? You don't force misses on all the reads. You force misses only on reads in your shared memory parallel program for which you know are data communicated from another processor. As an example, you can consider a parallel linear system solver using Gauss Elimination. It is quite easy to write a parallel algorithm which explicitly manages the cache. The last row of the matrix is reused N times, where N is the matrix dimension, before it is communicated to the rest of the processors. Even explicit cache flushing for the communication could be used, but the cost of doing this is horrible context switching overhead for cache sizes large enough to be useful. Your notion of including a context descriptor in the cache line is useful for this, but one will still pay a cost when you need to clear the cache of a specific descriptor upon process death. At least, it is a better situation than having to clear the cache on every context switch. >I think what you really mean here is that using such a cache system >in an application with heavy context switching is not really much >of an issue - you would never want to do it! From what I understand >of the i860, you must flush the I-cache, the D-cache and the TLB on >a context switch. This seems like a very big penalty to pay every >single time you want to switch contexts. Agreed, or as in the i860, the size of the cache you are willing to treat in this manner would be quite small. >other than the solution you have presented here. A virtual cache >that implements a copyback scheme with bus snooping is very >feasible in a multprocessing environment. No doubt there are several ways to do this. I made no claim that you couldn't. I only indicated what one might want to do for a multiprocessor using a multistage interconnection network to the memory modules where snooping might not be that practical of an option. Microprocessor speeds are cranking up to the point that the number you could hang on a bus will be very limited, even with the very best of write back coherent cache protocols. The fact that the whole processor, memory management, cache, etc, is now appearing on one chip is making systems with large numbers of processors VERY feasable. The processors are free and its the memory subsystem which will cost the bucks. This will likely drive the commercial development of scalable shared memory systems soon. Scalable message passing systems, of course, are already here. Is the news software incompatible with your mailer too? brooks@maddog.llnl.gov, brooks@maddog.uucp, uunet!maddog.llnl.gov!brooks
tim@crackle.amd.com (Tim Olson) (03/16/89)
In article <21795@shemp.CS.UCLA.EDU> loving@cs.ucla.edu (Mike Loving) writes: | In article <15213@winchester.mips.COM> mash@mips.COM (John Mashey) writes: | >As I posted in <15016@winchester>, you have to flush the caches | >on context-switch in a single CPU, much less a multiprocessor. | | | A popular misconception. It is NOT necessary to flush the cache on a context | switch. If your cache is physically addressed and you do not include PIDs in | the tags, then yes you do have to. An example circumventing this is the new | HP machines which use a virtually addressed (no address xlat delay) cache and | do not flush the cache on context switches. The context (no pun intended ;-) of John Mashey's posting was the i860, which must have its caches flushed on context switches, even in uniprocessor configurations. I believe you have swapped the terms "physically" and "virtually" above -- you need PIDs in *virtually* addressed caches to avoid address aliasing problems. This assumes that you have a write-through cache. Has anyone tried to use virtual, copy-back caches with PIDs to prevent flushing like the i860 requires? Problems of how to save modified data, as well as the consistency of shared memory come to mind... -- Tim Olson Advanced Micro Devices (tim@amd.com)
mash@mips.COM (John Mashey) (03/16/89)
In article <706@m3.mfci.UUCP> rodman@mfci.UUCP (Paul Rodman) writes: >In article <15213@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >>As I posted in <15016@winchester>, you have to flush the caches >>on context-switch in a single CPU, much less a multiprocessor. >> > >Hmmmm. I'm not sure I understand this, I haven't read your posting >so forgive my ignorance here. ...Description of classic process-tagged cache, etc... The <15016@winchester> posting was a discussion of the i860 mechanism. There are, of course, numerous ways to avoid cache flushes per context-switch or more often, even with virtual caches. I probably should have written the following: "on context-switch in a single i860, much less a multiprocessor i860." -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
mash@mips.COM (John Mashey) (03/16/89)
In article <21795@shemp.CS.UCLA.EDU> loving@cs.ucla.edu (Mike Loving) writes: >In article <15213@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >>As I posted in <15016@winchester>, you have to flush the caches >>on context-switch in a single CPU, much less a multiprocessor. > > >A popular misconception. It is NOT necessary to flush the cache on a context >switch. If your cache is physically addressed and you do not include PIDs in >the tags, then yes you do have to. An example circumventing this is the new >HP machines which use a virtually addressed (no address xlat delay) cache and >do not flush the cache on context switches. Apparently <15016@winchester> got lost, or people 'n'd it. This is NOT a populatr misconception. If you have a "simple" virtual-addressed/virtual-tagged cache, i.e., as in the i860, with neither pids/asids, nor the (more complex) segment-style scheme of HP PA, then you will flush the caches on context switches, and you might do it more often, depending on the TLB/cache interactions, and how tricky the OS wants to get in deferring flushes. Physically-addressed caches don't need any flushes on context-switch or re-maps; you must have meant "If your cache is virtually addressed" -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
doug@ross.UUCP (doug carmean) (03/17/89)
In article <24869@amdcad.AMD.COM> tim@amd.com (Tim Olson) asks: >Has anyone tried to use virtual, copy-back caches with PIDs to prevent >flushing like the i860 requires? Problems of how to save modified data, >as well as the consistency of shared memory come to mind... . Cypress will be offering two parts that both implement a virtual, copy-back cache with context numbers. The CY7C604 supports 4096 contexts in an on chip 2K tag. The '605 will be a multiprocessing version of the '604 which will have both virtual and physical tags to support bus snooping. Note that both parts support both copy-back and write through policies. -- -doug carmean -ROSS Technology, 7748 Hwy 290 West Suite 400, Austin, TX 78736 -ross!doug@cs.utexas.edu
jeff@Alliant.COM (Jeff Collins) (03/17/89)
In article <706@m3.mfci.UUCP> rodman@mfci.UUCP (Paul Rodman) writes: >In article <15213@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >>As I posted in <15016@winchester>, you have to flush the caches >>on context-switch in a single CPU, much less a multiprocessor. >> >Once you've cycled around all the hardware pids, you *then* have to >actually flush the cache. But this is only one time in 256,1024 ,etc,etc.. > >So, I assume this is what you guys mean when you say "flush the cache". Yep, that is the way most people do it, but my understanding (I could be wrong) is that the i860 does not have hardware pids. In other words the TLB only maintains the virtual address - meaning that you have to flush on EACH context switch (I wouldn't have brought it up if there were hardware pids). It is hard for me to believe that this is really the case, but that is the impression that I have gotten (my only information is the net and _Microprocessor Report_).
jeff@Alliant.COM (Jeff Collins) (03/17/89)
In article <21784@shemp.CS.UCLA.EDU> marc@cs.ucla.edu (Marc Tremblay) writes: >In article <3024@alliant.Alliant.COM> jeff@alliant.Alliant.COM (Jeff Collins) writes: >>In article <21570@shemp.CS.UCLA.EDU> marc@cs.ucla.edu (Marc Tremblay) writes: >>:Invalidating cache lines externally is not an absolute requirement for >>:using caches in a multi-processor environment. >> >> Which policies are these? How does one share memory in a >> multiprocessor when you can't have external bus watchers? > >1) As mentioned in another article making sharable pages uncachable > can be a viable answer for various configurations. Remember not > everything is bus-based, (i.g. interconnection networks), > so a broadcasts may not be allowed. > You are right. My bias for shared memory, symmetric, bus-based, multiprocessors with snoopy caches is so strong that I forget about interconnection networks. >2) Directory-based cache coherency scheme: several papers have been > published about this method. One of them is: > >Bus watchers are nice but are not an absolute necessity for >implementing efficient multiprocessor systems. Actually I know how directory based cache coherency works, but you still need the ability to invalidate the internal cache from external logic. The memory will detect that a cache location on a processor needs to be invalidated and send an invalidate signal (with the physical address) to the CPU card. The CPU card must then (because the internal cache is virtual) do a reverse TLB lookup and then signal the internal cache to invalidate. All of this works just fine with two exceptions on the i860. 1). it is not possible to do the reverse TLB look up (not enough information external to the processor) and 2). there is no way to send the invalidate (even if you could figure out the correct virtual address to invalidate). I stand by my claim. There is no way to put this processor into a shared memory, bus-based, symmetric multiprocessor unless you disable the D-cache or make all software that might share data manage the D-cache.
viggy@hpsal2.HP.COM (Viggy Mokkarala) (03/18/89)
John Mashey writes: >This is NOT a populatr misconception. If you have a "simple" >virtual-addressed/virtual-tagged cache, i.e., as in the i860, with >neither pids/asids, nor the (more complex) segment-style scheme of HP PA, >then you will flush the caches on context switches, and you might do it >more often, depending on the TLB/cache interactions, and how tricky the >OS wants to get in deferring flushes. >Physically-addressed caches don't need any flushes on context-switch >or re-maps; you must have meant "If your cache is virtually addressed" >-- >-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> Just to set the record straight about HP PA, I'd like to make the following observations: 1. The "synonym" problem is usually not present. Only the most privileged software can create a synonym situation, and it has the responsibility of managing the aliasing situations it creates. 2. The "segment-style" scheme that John refers to actually alludes to the fact that all processes share the same global virtual address space but that individual processes are given their own SPACES (segments, if you want to think of them that way) for instructions and data except for when they share data (in which case, the virtual address is the same). Specified in space registers, each space is 4 G Bytes long (in HP PA, there can be upto 4 G spaces). The space portion of the address is a part of the virtual address which can be upto 64 bits wide. 3. Since the caches are virtually indexed, this scheme prevents the occurrence of the synonym problem (no two virtual addresses will map to the same physical address) and therefore the need to sweep the caches on a context switch. Viggy Mokkarala, Hewlett Packard Company viggy@hpda.hp.com
robertb@june.cs.washington.edu (Robert Bedichek) (03/19/89)
In article <3040@alliant.Alliant.COM> jeff@alliant.Alliant.COM (Jeff Collins) writes: > Yep, that is the way most people do it, but my understanding >(I could be wrong) is that the i860 does not have hardware pids. In >other words the TLB only maintains the virtual address - meaning that >you have to flush on EACH context switch (I wouldn't have brought it >up if there were hardware pids). It is hard for me to believe that >this is really the case, but that is the impression that I have gotten >(my only information is the net and _Microprocessor Report_). What benefit would hardware PIDs give the i860? Hardware PIDs make sense if you have a large cache and frequent context switches, where it's likely that data from a process will stay in the cache long enough so that it is still in the cache when the process is resumed, IMHE (In My Humble Estimation). Otherwise, the hardware PID decreases performance slightly because you have to maintain them, and just spreads the cache-flush and reload time over a longer period. If you find your system spending a lot of time flushing the cache, then try to decrease the number of context switches per second. I think some workstations have a higher context switch rate than is necessary for good response. Also, if you have a multiprocessor, you don't need as high a switch rate. In fact while the number of processes is less than or equal to the number of processors, you don't need to context switch at all. OS people find it too easy to solve response time problems by just turning up the context switch rate. There are other ways to do it, like figuring up which task that is ready to run is likely to be the one that the user will is waiting for. Now this doesn't mean that you can't interrupt frequently for the purposes of, say, profiling. You don't have to flush the i860's cache/tlb when you switch to supervisor mode do you? That would be pretty bad, IMHE. Rob Bedichek (robertb@cs.washington.edu) "When the last snickerdoodle is eaten, and the last Safeway is closed, you will discover that you can not eat money."
marc@oahu.cs.ucla.edu (Marc Tremblay) (03/19/89)
In article <3041@alliant.Alliant.COM> jeff@alliant.Alliant.COM (Jeff Collins) writes: > Actually I know how directory based cache coherency works, but you >still need the ability to invalidate the internal cache from external logic. >The memory will detect that a cache location on a processor needs to be >invalidated and send an invalidate signal (with the physical address) to >the CPU card. The CPU card must then (because the internal cache is virtual) >do a reverse TLB lookup and then signal the internal cache to invalidate. Using a directory based cache coherency method with processors that do not have the capability to have their internal cache entries invalidated externally, the memory controller has to send a vector interrupt to one of the the processors which can then invalidate the correct cache line. I don't know how much flexibility the "flush" instruction of the i860 offers (the data sheet does not say much), but it should be possible to do it, if not, then too bad :-) . Marc Tremblay marc@CS.UCLA.EDU Computer Science Department, UCLA
mash@mips.COM (John Mashey) (03/19/89)
In article <2280004@hpsal2.HP.COM> viggy@hpsal2.HP.COM (Viggy Mokkarala) writes: >John Mashey writes: >>This is NOT a populatr misconception. If you have a "simple" >>virtual-addressed/virtual-tagged cache, i.e., as in the i860, with >>neither pids/asids, nor the (more complex) segment-style scheme of HP PA, >>then you will flush the caches on context switches, and you might do it >>more often, depending on the TLB/cache interactions, and how tricky the >>OS wants to get in deferring flushes. >Just to set the record straight about HP PA, I'd like to make the following >observations: (description of HP PA scheme - thanx) Sorry if my complex sentence caused confusion. It was trying to say: IF "simple" scheme, THEN flush caches on context switch ELSE /* PIDS or segment/space scheme of HP PA */ don't flush caches. HP PA is clearly one of the RISCs designed with attention to multi-tasking, since it, for example, manages to use virtual-addressed caches while still sharing sharable code & data in the cache efficently, unlike simple PID schemes. For example, suppose, in a multi-user system, using a simple cache-PID scheme, you have several people using the same program (such as an editor, compiler, DBMS client, etc, or one other program described later): Suppose you have executed process A, using that program, and you context-switch to program B, using that same program. You switch to the new program's PID, and even though it may be executing the exact same code, it I-cache-misses on all of it, because it has the wrong PID. Same thing happens for shared libraries. Same thing happens if it uses shared data, which things like DBMS do, or perhaps, on some systems X-clients & X-server; only this time it D-cache misses. This is especially exciting for dirty data: A writes some data. B attempts to read it. Since the PID's mismatch, you flush it to memory, then you re-read it to get one there with the right copy. Does this matter? If the caches are really small, then not much, as far as I can tell, although there isn't a lot of data pbulished on this kind of think, for good reason. {If you know a lot about this, you're probably a vendor who treats this as serious competitive info.} However, high-performance systems don't have small caches, and the caches MUST continue to grow, since DRAM refuses to offer seriously-better access times. Note that HP PA avoids most of this problem via the space indentifiers, which are, in some sense a collection of PIDs or ASIDs. Thus, if you switch to a different process that's using the same program, the second process's space-maps (or whatever they call them) can at least point at the same code as did the first process, and there can be shared data regions, etc, without requiring cache-misses to refill the cache. The obvious issue that's left is doing fork-with-copy-on-write, as that looks a bit difficult to do in the classical way. Still, the scheme OBVIOUSLY was thinking of multi-tasking environments with efficiently-shared code and data, in the presence of large caches. (Ross Bott of Pyramid gave a good talk at Uniforum that included some discussions of cache issues in OLTP environments.) Oh, the one other program that's real important is the UNIX kernel itself, which, in some sense, is often treated as a giant shared library. It has terrible locality, and so you have to kick and scratch to get every % of hit rate you can. Note that the most straightforward of PID schemes causes considerable extra thrashing around in the presence of serious context-switch rates. The typical choices come down to letting the kernel use the same PID as the current user program, which means you communicate well with it, or using a specific pid for the kernel itself, which means better hit rates for the kernel code itself, but more overhead in dealing with user processes, or combinations that switch back and forth. (maybe somebody that does this kind of thing can post some useful info) Of course, numerous variations are possible, but the basic idea is: if you're using simple virtual caches, with no PIDs, or even with PIDs, you're probably thinking more about single-user performance than you are about multi-tasking performance. (There's nothing wrong with that tradeoff, of course, if that's what you're trying to do.) -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
jeff@Alliant.COM (Jeff Collins) (03/20/89)
In article <7630@june.cs.washington.edu> robertb@uw-june.UUCP (Robert Bedichek) writes: >In article <3040@alliant.Alliant.COM> > jeff@alliant.Alliant.COM (Jeff Collins) writes: >> Yep, that is the way most people do it, but my understanding >>(I could be wrong) is that the i860 does not have hardware pids. > >What benefit would hardware PIDs give the i860? Hardware PIDs make >sense if you have a large cache and frequent context switches, where >it's likely that data from a process will stay in the cache long enough >so that it is still in the cache when the process is resumed, IMHE (In My >Humble Estimation). Otherwise, the hardware PID decreases performance >slightly because you have to maintain them, and just spreads the >cache-flush and reload time over a longer period. > > <comments on increasing context switch time deleted> > It is true that an 8k data cache is fairly small, and hardware PIDs for a cache of that size may not be a win - all my mail was attempting to say, however, was that the cache must be flushed at each context switch. I, personally, am not a fan of caches that must be flushed at each context switch - this is independent of the rate of context switches (having to flush the caches will always be slower than not having to flush them). Yes, you can decrease the average cost of the cache flush by having longer times between context switches. The real objection is more from an architectural point of view: what happens if/when Intel increases the size of the D-cache? Will they then add hardware PIDs?
doug@ross.UUCP (doug carmean) (03/21/89)
In article <15531@winchester.mips.COM> John Mashey writes: ... > Suppose you have executed process A, using that program, > and you context-switch to program B, using that same program. > You switch to the new program's PID, and even though it may > be executing the exact same code, it I-cache-misses on all of it, > because it has the wrong PID. Same thing happens for shared libraries. > Same thing happens if it uses shared data, which things like > DBMS do, or perhaps, on some systems X-clients & X-server; > only this time it D-cache misses. This is especially exciting > for dirty data: A writes some data. B attempts to read it. > Since the PID's mismatch, you flush it to memory, then you > re-read it to get one there with the right copy. ... Mr. Mashey's description is entirely accurate for a system that uses a very simple cache controller, i.e. one that does not detect aliases. Higher performance systems will check the physical address of the requested virtual address against the physical address of the cache line that is to be replaced. If the two physical addresses match, the cache is used to handle the access, else the cache miss processed as a normal miss. In most implementations, the aliased data/instruction can be provided with only a one cycle penalty over the cache hit access. An implementation might choose to overwrite the cache tag context with the new context so that subsequent accesses will hit cache without the alias detection penalty (assuming the context switching scenario that Mr. Mashey proposed). If yet another context switch occurs, the same sequence of operations would occur with the first access incurring the alias detection penalty and subsequent accesses hitting the cache. This approach may not offer quite the performance or the glamour that the HP PA offers, but it is considerably higher performance than the approach Mr. Mashey outlined. -- -doug carmean ross!doug@cs.utexas.edu -ROSS Technology, 7748 Hwy 290 West Suite 400, Austin, TX 78736 -disclaimer: <insert anything you like here>
w-colinp@microsoft.UUCP (Colin Plumb) (03/21/89)
jeff@alliant.Alliant.COM (Jeff Collins) wrote: > Yep, that is the way most people do it, but my understanding > (I could be wrong) is that the i860 does not have hardware pids. In > other words the TLB only maintains the virtual address - meaning that > you have to flush on EACH context switch (I wouldn't have brought it > up if there were hardware pids). It is hard for me to believe that > this is really the case, but that is the impression that I have gotten > (my only information is the net and _Microprocessor Report_). This is the case. You have to flush the cache every context switch. (Since it's a write-back cache, you have to at least flush the pending writes before messing with the page table, anyway.) For those interested, as far as I can tell, the i860's flush instruction is half a load, that forces the write-back (you play with bits in a status register to force a given set ito be used) but loads a bogus value into the cache and the destination register. I don't think the cache has valid bits. -- -Colin (uunet!microsoft!w-colinp) "Don't listen to me. I never do." - The Doctor
jeff@Alliant.COM (Jeff Collins) (03/21/89)
In article <21972@shemp.CS.UCLA.EDU> marc@cs.ucla.edu (Marc Tremblay) writes: >Using a directory based cache coherency method with processors that do not >have the capability to have their internal cache entries invalidated externally, >the memory controller has to send a vector interrupt to one of the >the processors which can then invalidate the correct cache line. >I don't know how much flexibility the "flush" instruction of the i860 offers >(the data sheet does not say much), but it should be possible to do it, >if not, then too bad :-) . Good point. But still all that the memory can do is send the physical address to invalidate (it is all that it knows), this still means that the interrupt handler has to perform a reverse translation from physical to virtual (after determining which virtual space to use - actually I quess the current virtual space is all that needs to be looked at - as the D-cache must have been flushed at the last context switch), and only then it can flush the appropriate tage line. (I infer (from the fact that a complete D-cache flush is a loop) that it is possible to flush particular line from the D-cache.) Doing this conversion is possible, but I think the conversion involves an exhaustive search of the processes page table (this can probably be optimized). Oh, by the way, while you are doing the vectored interrupt, the reverse translation and the invalidate, don't you have to hold off the memory request that initiated this whole process (the memory has to wait until the flush or invalidate happen before it can reply)? Can you say memory latency?
andrew@frip.wv.tek.com (Andrew Klossner) (03/22/89)
[] "Still on the subject of the caches: There is no way to externally invalidate cache lines. This makes the part virtually unusable in multi-processing configurations, since cache coherency cannot be maintained." "Invalidating cache lines externally is not an absolute requirement for using caches in a multi-processor environment. There are policies that do not require this feature at all." Furthermore, you can achieve the necessary effect by sending the other CPU a message telling it to invalidate its cache lines. If exception handling (get in, do it, get out) is fast enough, and if you don't do this every few nanoseconds, then the performance degradation shouldn't be a big deal. You don't absolutely need an external invalidation signal in hardware for multiprocesisng. -=- Andrew Klossner (uunet!tektronix!orca!frip!andrew) [UUCP] (andrew%frip.wv.tek.com@relay.cs.net) [ARPA]
w-colinp@microsoft.UUCP (Colin Plumb) (03/22/89)
marc@cs.ucla.edu (Marc Tremblay) wrote: > Using a directory based cache coherency method with processors that do not > have the capability to have their internal cache entries invalidated > externally, the memory controller has to send a vector interrupt to one > of the processors which can then invalidate the correct cache line. > I don't know how much flexibility the "flush" instruction of the i860 offers > (the data sheet does not say much), but it should be possible to do it, > if not, then too bad :-) . The flush instruction is half a load; it basically forces the data in the cache line it conflicts with to be written out and removed from the cache. I believe doing a flush specifying an address already in the cache is useless; the point is to create a fake address that conflicts with the real one (using knowledge of how the cache works) and some magic bits in a status register that let you control which set a cache miss chooses to replace. (Default is random replacement.) So, to force a flush of the data at a certain address, you'd have to flush all 4 lines that might contain with that address. (There's no way to examine the cache's contents.) With the status register munging, a bit ugly. Add my existing comments about the horribleness of the i860's exception handling, and I don't think the software approach would be acceptably fast. (BTW, no vectored interrupts.) -- -Colin (uunet!microsoft!w-colinp) "Don't listen to me. I never do." - The Doctor