sritacco@hpdml93.HP.COM (Steve Ritacco) (11/10/89)
The R3000 (mips) has support for necessary multiprocessing features built in. There is support for reading from the data cache and invalidating data cache entries. This would allow a snooping system to be built fairly efficiently.
aperez@cvbnet.UUCP (Arturo Perez x6739) (11/10/89)
From article <30969@winchester.mips.COM>, by mash@mips.COM (John Mashey): > In article <516@baird.cs.strath.ac.uk> jim@cs.strath.ac.uk writes: > ....... > This is all true, but just to make sure there is no ambiguity, > and to head off a potential argument: > 1) FAST processors require faster busses, or else getting away > from busses in the direction of mainframe-style architectures. Mainframes don't have busses? What do they use instead? Or do you mean that they don't have mini- or micro- style busses? That I can understand but I would still like to know what they use; I've only ever worked on minis and workstations. Arturo Perez aperez@cvbnet.prime.com ComputerVision, a Prime Business Unit (617) 275-1800 x6739 "Too much information, like a bullet through my brain!" - The Police
rec@dg.dg.com (Robert Cousins) (11/10/89)
In article <1989Nov8.185006.13346@Solbourne.COM> stevec@momma.UUCP (Steve Cox) writes: >In article <1015@maxim.erbe.se> prc@erbe.se (Robert Claeson) writes: >>Which brings up another question -- which RISC chip is the best one to >>build a shared-memory parallel machine around? >the m88k of course! 8-) >what other risc chips (sets) where designed from the outset with shared >memory multiprocessing as a system parameter? >at least the m88k wins for simplicity of hardware design. Actually, our experiences with the 88K have been quite positive. The 88K supports both 'private' and 'global' data transactions on the bus, a burst mode bus and the standard MP features (cache coherency and interlocked bus operations). However it is important to remember that any Harvard architecture machine with independent caches has many of the characteristics of a multiprocessor machine. The bottom line is that the 88K has been successfully used in a number of multiprocessor machines, that there exists generally available MP operating systems which are compliant with the Binary Compatibility Standard and that these machines have been shipping for (in computer terms) a long time now. >- stevec > >-- >-------------------------------------------------------------------------------- >steve cox >solbourne computer, inc. >1900 pike, longmont, co GO BUFFS !!! Robert Cousins Dept. Mgr, Workstation Dev't. Data General Corp. Speaking for myself alone.
mac@ardent.com (11/11/89)
In article <13319@pur-ee.UUCP> csh@pur-ee.UUCP (Craig S Hughes) writes: >Does anyone know of commericial implementations of bus-based >multiprocessor RISC machines? > There are a number of commercial implementations of bus-based multiprocessor MIPS R2000 & MIPS R3000 based machines available. Some even come with Vector units and Graphic processors. That particular one has been shipping since May, 1988. -mac -- Michael McNamara (St)ardent, Inc. mac@ardent.com
mac@ardent.com (11/11/89)
In article <280004@hpdml93.HP.COM> sritacco@hpdml93.HP.COM (Steve Ritacco) writes: > > The R3000 (mips) has support for necessary multiprocessing features > built in. There is support for reading from the data cache and invalidating > data cache entries. This would allow a snooping system to be built fairly > efficiently. > Yeah, the cache works, but MIPSCo provides no support for MP synchronization, forcing you to either do it as a system call (Slow; OS doesn't get to use it.) or in hardware ( what we did ). Things will (presumably) get better with newer chips... -- Michael McNamara (St)ardent, Inc. mac@ardent.com
irf@kuling.UUCP (Bo Thide') (11/12/89)
In article <516@baird.cs.strath.ac.uk> jim@cs.strath.ac.uk writes: >In article <1012@maxim.erbe.se> prc@erbe.se (Robert Claeson) writes: >>In article <13319@pur-ee.UUCP> csh@pur-ee.UUCP (Craig S Hughes) writes: [text deleted] >A RISC based multiprocessor machine would be an exciting prospect, but >is likely to be difficult and expensive to build. If it used the same [text deleted] Hewlett-Packard has announced that they will introduce a multi HP-PA (RISC) processor HP9000 machine during 1990 and that this is the reason they will not provide a pure OSF/1, but rather make HP-UX OSF/1 compliant. OSF multiprocessor support will not be present until OSF/2 (by utilising MACH?). With this capability and the Hitachi CMOS (?) HP-PA processor they say that they are aiming at 400 MIPS for their top level workstations. This is a very bold statement and I am really curious to hear more about the details. Anybody from HP, or anyone else, who would care to comment on this? Bo ^ Bo Thide'-------------------------------------------------------------- | | Swedish Institute of Space Physics, S-755 91 Uppsala, Sweden |I| [In Swedish: Institutet f|r RymdFysik, Uppsalaavdelningen (IRFU)] |R| Phone: (+46) 18-403000. Telex: 76036 (IRFUPP S). Fax: (+46) 18-403100 /|F|\ INTERNET: bt@irfu.se UUCP: ...!uunet!sunic!irfu!bt ~~U~~ -----------------------------------------------------------------sm5dfw
mslater@cup.portal.com (Michael Z Slater) (11/13/89)
>The R3000 (mips) has support for necessary multiprocessing features >built in. There is support for reading from the data cache and invalidating >data cache entries. This would allow a snooping system to be built fairly >efficiently. Snooping directly on the primary R3000 cache is indeed possible, but as I understand it, the degradation on CPU performance due to contention for the cache is signficant. Plus, the R3000 cache is write-through, and a reasonable multiprocessor system needs write-back caches. All multiprocessor R3000 system I'm aware of use second-level write-back caches. Michael Slater, Microprocessor Report mslater@cup.portal.com
stan@squazmo.solbourne.com (Stan Hanks) (11/13/89)
In article <516@baird.cs.strath.ac.uk> jim@cs.strath.ac.uk writes: >A RISC based multiprocessor machine would be an exciting prospect, but >is likely to be difficult ... Excuse me, but we seem to have done that some time back. A couple of different models, even.... 8{) >... it would need a *very* fast bus. Yup. And it has one! Well, pretty fast anyway -- ~128 MB/sec. Alas, other factors (like max backplane length, etc) tend to limit the number of processors and otherwise constrain bus speeds, but it can be pushed up significantly from that point. Or at least some variant could. >Of course, fancy cacheing can reduce the demands on bus bandwidth, but >that will make cache consistency harder And we have "fancy cacheing" and cache consistency. Of course, you don't *NEED* cache consistency when you build multiprocessors with cache, but determinism is SUCH a nice feature.... Seriously, it's not too hard to do something like this. The real trick is going to be supporting an arbitrarily large number of processors with interconnections fast enough so as to make all memory appear to be shared. Eugene Brooks' "KILLER MICROS" type scenario. Just imagine: a 64K-node hypercube with one's-of-nanoseconds message times.... Now THAT would be a seriously difficult to construct AND exciting prospect.... Regards, -- Stanley P. Hanks Science Advisor Solbourne Computer, Inc. Phone: Corporate: (303) 772-3400 Houston: (713) 964-6705 E-mail: ...!{boulder,sun,uunet}!stan!stan stan@solbourne.com
swarren@eugene.uucp (Steve Warren) (11/13/89)
In article <183@cvbnet.Prime.COM> aperez@cvbnet.UUCP (Arturo Perez x6739) writes: >From article <30969@winchester.mips.COM>, by mash@mips.COM (John Mashey): >> In article <516@baird.cs.strath.ac.uk> jim@cs.strath.ac.uk writes: >> ....... >> This is all true, but just to make sure there is no ambiguity, >> and to head off a potential argument: >> 1) FAST processors require faster busses, or else getting away >> from busses in the direction of mainframe-style architectures. > >Mainframes don't have busses? > >What do they use instead? The buss allows more than one device to communicate over the same path. This technology is used for improved economics. Connectivity will easily dominate the expense of a system where busses are not used. The 'buss-less' connecting scheme provides seperate ports for each device. Thus there is a dedicated communication path so that no device has to share its path with other devices. If there is no provision to hang more than one device off of it, then it's not a buss, it's just a point-to-point connection. --Steve ------------------------------------------------------------------------- {uunet,sun}!convex!swarren; swarren@convex.COM
bader+@andrew.cmu.edu (Miles Bader) (11/13/89)
mslater@cup.portal.com (Michael Z Slater) writes: > Plus, the R3000 cache is write-through, and a reasonable > multiprocessor system needs write-back caches. Why? -Miles
jas@postgres.uucp (James Shankland) (11/14/89)
In article <sZLhtay00UkaIj6al2@andrew.cmu.edu> bader+@andrew.cmu.edu (Miles Bader) writes: >mslater@cup.portal.com (Michael Z Slater) writes: >> Plus, the R3000 cache is write-through, and a reasonable >> multiprocessor system needs write-back caches. > >Why? Bus bandwidth limits. (Hey, at least we're being economical with *net* bandwidth here :-)). jas
stevec@momma.Solbourne.COM (Steve Cox) (11/15/89)
In article <23963@cup.portal.com> mslater@cup.portal.com (Michael Z Slater) writes: >Snooping directly on the primary R3000 cache is indeed possible, but as I >understand it, the degradation on CPU performance due to contention for the >cache is signficant. Plus, the R3000 cache is write-through, and a reasonable >multiprocessor system needs write-back caches. All multiprocessor R3000 system >I'm aware of use second-level write-back caches. second-level write-back caches? so (correct me if i am wrong), there is a first level cache that is not connected to the shared memory bus. how do these systems support cache coherency for data that is cached in the first level cache? sounds pretty hairy to me. or am i missing something? -- steve cox stevec@solbourne.com solbourne computer, inc. 1900 pike, longmont, co GO BUFFS !!! ... (303)772-3400
alan@encore (Alan Langerman) (11/15/89)
In article <1236@kuling.UUCP>, irf@kuling (Bo Thide') writes: >Hewlett-Packard has announced that they will introduce a multi HP-PA >(RISC) processor HP9000 machine during 1990 and that this is the reason they >will not provide a pure OSF/1, but rather make HP-UX OSF/1 compliant. >OSF multiprocessor support will not be present until OSF/2 (by utilising >MACH?). Pure OSF/1 will have rather extensive tightly-coupled, shared-memory multiprocessor support based on Mach. Perhaps HP will re-analyze their situation. Alan
dfields@urbana.mcd.mot.com (David Fields) (11/15/89)
In article <sZLhtay00UkaIj6al2@andrew.cmu.edu>, bader+@andrew.cmu.edu (Miles Bader) writes: > mslater@cup.portal.com (Michael Z Slater) writes: > > Plus, the R3000 cache is write-through, and a reasonable > > multiprocessor system needs write-back caches. > > Why? If you don't have a write-back cache then you will stall every time you fill up your write-post buffer. Think about the number of cycles to real memory, the burstiness of write traffic (a function call with several args and register variables, although one would hope the args are in registers, some of them will probably need to be written) and the depth of the write-post buffer (2-4 words are reasonable). Then play around with the numbers and you will understand. Dave Fields // Motorola MCD // !uiucuxc!udc!dfields
aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) (11/15/89)
>If you don't have a write-back cache then you will stall every time >you fill up your write-post buffer. So make the write post buffers big enough that you don't stall very much. (I rather like the idea of a trickle-back cache, where write-back data is left in the cache for subsequent access, with only a tag put in the write-back queue. Plus combining in the write-back queue (two writes to same location do not both need to go through (modulo your consistency model)) this gets close to write-back, possibly with less control. But then, write-back isn't *that* hard to do. Norm Jouppi says it's easier than write buffering... -- Andy "Krazy" Glew, UIUC ECE aglew@uiuc.edu (afgg6490@uxa.cso.uiuc.edu) (Formerly of Motorola MCD Urbana)
jmb@patton.sgi.com (Jim Barton) (11/16/89)
In article <1989Nov15.040039.28570@Solbourne.COM>, stevec@momma.Solbourne.COM (Steve Cox) writes: > In article <23963@cup.portal.com> mslater@cup.portal.com (Michael Z Slater) writes: > >Snooping directly on the primary R3000 cache is indeed possible, but as I > >understand it, the degradation on CPU performance due to contention for the > >cache is signficant. Plus, the R3000 cache is write-through, and a reasonable > >multiprocessor system needs write-back caches. All multiprocessor R3000 system > >I'm aware of use second-level write-back caches. > > second-level write-back caches? so (correct me if i am wrong), there > is a first level cache that is not connected to the shared memory bus. > how do these systems support cache coherency for data that is > cached in the first level cache? sounds pretty hairy to me. > or am i missing something? > > > -- > steve cox stevec@solbourne.com > solbourne computer, inc. > 1900 pike, longmont, co GO BUFFS !!! ... > (303)772-3400 The Stardent Titan machines snoop directly on the first level cache. The R3000 has explicit lines (if you give up 128K caches and stick to 64K) to stall and to allow you to invalidate the caches. These machines also suffer a significant penalty for invalidate traffic, causing less than stellar (pun intended) performance. The effect is mitigated by the dual-bus scheme of the Titan. Instructions and read-only data pass on a separate bus which is not snooped, and read/write data passes on a bus which is. For instance, the vector units pick up their operands from the read-only bus and write them to the read/write bus. Obviously, this scheme doesn't work too well. We may note also that there is only one R3000 based multiprocessor announced and shipping. The Titan III has been announced in Japan, but not here. The current Titan products are R2000 based. The SGI POWERSeries has a second level cache which performs all the snooping operations and does the writeback. In effect, it acts as a "filter" which operates asynchronously to the processor. When a hit occurs on an invalidate, and since the first level cache is (necessarily) a subset of the second level cache, the second level cache turns around and invalidates the first level cache. So, stevec missed something too. -- Jim Barton Silicon Graphics Computer Systems "UNIX: Live Free Or Die!" jmb@sgi.sgi.com, sgi!jmb@decwrl.dec.com, ...{decwrl,sun}!sgi!jmb
lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (11/16/89)
In article <1989Nov15.040039.28570@Solbourne.COM> stevec@solbourne.com (Steve Cox) writes: >second-level write-back caches? so (correct me if i am wrong), there >is a first level cache that is not connected to the shared memory bus. >how do these systems support cache coherency for data that is >cached in the first level cache? sounds pretty hairy to me. >or am i missing something? Yes, it's reasonably hairy. A good introduction would be the paper by Baer's group, which appeared in this year's Computer Architecture Conference proceedings (i.e. the June 1989 SigARCH). His scheme is not the only possible one, but the other schemes have roughly similar complexity. As for the data in the first level cache: there are two answers. One, make the first level use writethrough, so that the second level always gets a copy. This gives the "inclusion" property, whereby the second level always contains a strict superset of the first level. The second level occasionally has to invalidate data which is in both levels, and this means that it has to be able to reach in and nuke something that is in the first level. Two, make the first level use writeback, but inform the second level of each write. The second level creates a hole (if necessary), which the first level can later write the data to. This allows the second level to do all the snoopy/coherence things, as before. Another fun issue is the question of synonyms. Some operating systems (such as Mach) want nonunique inverse mappings: that is, one physical page present in N virtual spaces, N > 1. If the cache(s) use physical addresses, no problem. If the cache(s) are flushed on context switch, no problem. Otherwise, there is a nasty problem: the same data could be in two places in the same cache! -- Don D.C.Lindsay Carnegie Mellon Computer Science
toms@omews44.intel.com (Tom Shott) (11/16/89)
I can think of two methods of supporting cache coherence for multi level caches. These all assume that there is a first level internal cache w/ a some method of invalidating (or flushing) a line and a second level cache connected to the bus. The easy method is to use write through cacheing for both levels. Every time a write occurs on the system bus flush the address from both caches. This has performance penalties because typically every invalidate cycle on the internal cache blocks the execution unit from access it. A second harder method is to keep the internal cache a subset of the external cache. Every time a line is removed from the external cache invalidate that line in the internal cache. All system bus access are looked up in the second level cache directory. Only those access that are contained in the external cache are invalidated from the internal cache so there is less contention for the internal cache. The second method could be expanded to keep track of whats in the internal cache in the external directory so only lines in the internal cache are invalidated. Problem with this is knowing whats in the internal directory. You don't see all the processor access so if your using a LRU type replacement strategy you have no idea outside the chip what's going to be replaced in the internal cache. If the internal cache always signaled the outside cache on replacements, the ouside cache would know and could filter the invalidate traffic. You can layer a writeback external cache on this protocal and even layer a internal writeback cache. -- ----------------------------------------------------------------------------- Tom Shott INTeL, 2111 NE 25th Ave., Hillsboro, OR 97123, (503) 696-4520 toms@omews44.intel.com OR toms%omews44.intel.com@csnet.relay.com INTeL.. Designers of the 960 Superscalar uP and other uP's
freudent@eric.nyu.edu (Eric Freudenthal) (11/16/89)
There is another well known solution to the problem of dual-porting a cache between a shared bus and a pe (processor). The idea is to use some sort of filter to keep bus transactions which do not affect the cache from reaching the cache. This solution is cheaper than building two identical caches and is equally effective. Build a conventional cache augmented with an extra copy of tag-store, which will be used as a filter. This is updated every time the real one is. Clearly, in the absence of cache-misses, the extra tag store is never updated. Bus transactions are looked up in this extra tag-store without disturbing the real cache if the address does not not match. If they do, then the real cache entry is updated or invalidated (similarly changing the tar-store copy). -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Eric Freudenthal NYU Ultracompter Lab 715 Broadway, 10th floor New York, NY 10012 Phone:(212) 998-3345 work (718) 789-4486 home Email:freudent@ultra.nyu.edu
deraadt@cpsc.ucalgary.ca (Theo Deraadt) (11/20/89)
In article <23963@cup.portal.com>, mslater@cup.portal.com (Michael Z Slater) writes: > Plus, the R3000 cache is write-through, and a reasonable > multiprocessor system needs write-back caches. > Michael Slater, Microprocessor Report mslater@cup.portal.com As long as they are physical write back caches and not virtual. Unless you like flushing huge virtual writeback caches everytime you context switch and mapin your next process. <tdr.
petolino@joe.Sun.COM (Joe Petolino) (11/23/89)
>> Plus, the R3000 cache is write-through, and a reasonable >> multiprocessor system needs write-back caches. >As long as they are physical write back caches and not virtual. Unless you >like flushing huge virtual writeback caches everytime you context switch >and mapin your next process. The solution to this problem is obvious: use the process ID (or something smaller which maps to the process ID) as part of the Virtual Address tag. People have been doing it for years. -Joe