[comp.arch] RISC multiprocessors

sritacco@hpdml93.HP.COM (Steve Ritacco) (11/10/89)

The R3000 (mips) has support for necessary multiprocessing features
built in.  There is support for reading from the data cache and invalidating
data cache entries.  This would allow a snooping system to be built fairly
efficiently.

aperez@cvbnet.UUCP (Arturo Perez x6739) (11/10/89)

From article <30969@winchester.mips.COM>, by mash@mips.COM (John Mashey):
> In article <516@baird.cs.strath.ac.uk> jim@cs.strath.ac.uk writes:
> .......
> This is all true, but just to make sure there is no ambiguity,
> and to head off a potential argument:
> 	1) FAST processors require faster busses, or else getting away
> 	from busses in the direction of mainframe-style architectures.

Mainframes don't have busses?  

What do they use instead?

Or do you mean that they don't have mini- or micro- style busses?  That I can
understand but I would still like to know what they use; I've only
ever worked on minis and workstations.


Arturo Perez						aperez@cvbnet.prime.com
ComputerVision, a Prime Business Unit			(617) 275-1800 x6739
"Too much information, like a bullet through my brain!" - The Police

rec@dg.dg.com (Robert Cousins) (11/10/89)

In article <1989Nov8.185006.13346@Solbourne.COM> stevec@momma.UUCP (Steve Cox) writes:
>In article <1015@maxim.erbe.se> prc@erbe.se (Robert Claeson) writes:
>>Which brings up another question -- which RISC chip is the best one to
>>build a shared-memory parallel machine around?
>the m88k of course!  8-)
>what other risc chips (sets) where designed from the outset with shared
>memory multiprocessing as a system parameter?
>at least the m88k wins for simplicity of hardware design.

Actually, our experiences with the 88K have been quite positive.  The 88K
supports both 'private' and 'global' data transactions on the bus, a burst mode
bus and the standard MP features (cache coherency and interlocked bus operations).
However it is important to remember that any Harvard architecture machine with
independent caches has many of the characteristics of a multiprocessor machine.

The bottom line is that the 88K has been successfully used in a number of multiprocessor
machines, that there exists generally available MP operating systems which are
compliant with the Binary Compatibility Standard and that these machines have
been shipping for (in computer terms) a long time now.

>- stevec
>
>-- 
>--------------------------------------------------------------------------------
>steve cox
>solbourne computer, inc.
>1900 pike, longmont, co			GO BUFFS !!!

Robert Cousins
Dept. Mgr, Workstation Dev't.
Data General Corp.

Speaking for myself alone.

mac@ardent.com (11/11/89)

In article <13319@pur-ee.UUCP> csh@pur-ee.UUCP (Craig S Hughes) writes:

>Does anyone know of commericial implementations of bus-based 
>multiprocessor RISC machines?
>

	There are a number of commercial implementations of bus-based
multiprocessor MIPS R2000 & MIPS R3000 based machines available.  Some
even come with Vector units and Graphic processors.  That particular
one has been shipping since May, 1988.

	-mac



--
Michael McNamara	(St)ardent, Inc.		mac@ardent.com

mac@ardent.com (11/11/89)

  In article <280004@hpdml93.HP.COM> sritacco@hpdml93.HP.COM (Steve
  Ritacco) writes:

>  
>  The R3000 (mips) has support for necessary multiprocessing features
>  built in.  There is support for reading from the data cache and invalidating
>  data cache entries.  This would allow a snooping system to be built fairly
>  efficiently.
>
	Yeah, the cache works, but MIPSCo provides no support for MP
synchronization, forcing you to either do it as a system call (Slow;
OS doesn't get to use it.) or in hardware ( what we did ).
	
	Things will (presumably) get better with newer chips...
--
Michael McNamara	(St)ardent, Inc.		mac@ardent.com

irf@kuling.UUCP (Bo Thide') (11/12/89)

In article <516@baird.cs.strath.ac.uk> jim@cs.strath.ac.uk writes:
>In article <1012@maxim.erbe.se> prc@erbe.se (Robert Claeson) writes:
>>In article <13319@pur-ee.UUCP> csh@pur-ee.UUCP (Craig S Hughes) writes:

[text deleted]

>A RISC based multiprocessor machine would be an exciting prospect, but
>is likely to be difficult and expensive to build. If it used the same

[text deleted]

Hewlett-Packard has announced that they will introduce a multi HP-PA
(RISC) processor HP9000 machine during 1990 and that this is the reason they
will not provide a pure OSF/1, but rather make HP-UX OSF/1 compliant.
OSF multiprocessor support will not be present until OSF/2 (by utilising
MACH?).  With this capability and the Hitachi CMOS (?) HP-PA processor
they say that they are aiming at 400 MIPS for their top level workstations.
This is a very bold statement and I am really curious to hear more about
the details.  Anybody from HP, or anyone else, who would care to comment
on this?

Bo

   ^   Bo Thide'--------------------------------------------------------------
  | |       Swedish Institute of Space Physics, S-755 91 Uppsala, Sweden
  |I|    [In Swedish: Institutet f|r RymdFysik, Uppsalaavdelningen (IRFU)]
  |R|  Phone: (+46) 18-403000.  Telex: 76036 (IRFUPP S).  Fax: (+46) 18-403100 
 /|F|\        INTERNET: bt@irfu.se       UUCP: ...!uunet!sunic!irfu!bt
 ~~U~~ -----------------------------------------------------------------sm5dfw

mslater@cup.portal.com (Michael Z Slater) (11/13/89)

>The R3000 (mips) has support for necessary multiprocessing features
>built in.  There is support for reading from the data cache and invalidating
>data cache entries.  This would allow a snooping system to be built fairly
>efficiently.

Snooping directly on the primary R3000 cache is indeed possible, but as I
understand it, the degradation on CPU performance due to contention for the
cache is signficant.  Plus, the R3000 cache is write-through, and a reasonable
multiprocessor system needs write-back caches.  All multiprocessor R3000 system
I'm aware of use second-level write-back caches.

Michael Slater, Microprocessor Report   mslater@cup.portal.com

stan@squazmo.solbourne.com (Stan Hanks) (11/13/89)

In article <516@baird.cs.strath.ac.uk> jim@cs.strath.ac.uk writes:
>A RISC based multiprocessor machine would be an exciting prospect, but
>is likely to be difficult ...

Excuse me, but we seem to have done that some time back. A couple of
different models, even.... 8{)

>... it would need a *very* fast bus.

Yup. And it has one! Well, pretty fast anyway -- ~128 MB/sec. Alas, other
factors (like max backplane length, etc) tend to limit the number of processors
and otherwise constrain bus speeds, but it can be pushed up significantly
from that point. Or at least some variant could.

>Of course, fancy cacheing can reduce the demands on bus bandwidth, but
>that will make cache consistency harder 

And we have "fancy cacheing" and cache consistency. Of course, you don't 
*NEED* cache consistency when you build multiprocessors with cache, but
determinism is SUCH a nice feature....

Seriously, it's not too hard to do something like this. The real trick
is going to be supporting an arbitrarily large number of processors
with interconnections fast enough so as to make all memory appear to
be shared. Eugene Brooks' "KILLER MICROS" type scenario. Just imagine: 
a 64K-node hypercube with one's-of-nanoseconds message times....
Now THAT would be a seriously difficult to construct AND exciting
prospect....

Regards,

-- 
Stanley P. Hanks   Science Advisor                    Solbourne Computer, Inc.
Phone:             Corporate: (303) 772-3400           Houston: (713) 964-6705
E-mail:            ...!{boulder,sun,uunet}!stan!stan        stan@solbourne.com

swarren@eugene.uucp (Steve Warren) (11/13/89)

In article <183@cvbnet.Prime.COM> aperez@cvbnet.UUCP (Arturo Perez x6739) writes:
>From article <30969@winchester.mips.COM>, by mash@mips.COM (John Mashey):
>> In article <516@baird.cs.strath.ac.uk> jim@cs.strath.ac.uk writes:
>> .......
>> This is all true, but just to make sure there is no ambiguity,
>> and to head off a potential argument:
>> 	1) FAST processors require faster busses, or else getting away
>> 	from busses in the direction of mainframe-style architectures.
>
>Mainframes don't have busses?  
>
>What do they use instead?

The buss allows more than one device to communicate over the same path.  This
technology is used for improved economics.  Connectivity will easily dominate
the expense of a system where busses are not used.

The 'buss-less' connecting scheme provides seperate ports for each device.
Thus there is a dedicated communication path so that no device has to share
its path with other devices.  If there is no provision to hang more than
one device off of it, then it's not a buss, it's just a point-to-point
connection.

--Steve
-------------------------------------------------------------------------
	  {uunet,sun}!convex!swarren; swarren@convex.COM

bader+@andrew.cmu.edu (Miles Bader) (11/13/89)

mslater@cup.portal.com (Michael Z Slater) writes:
> Plus, the R3000 cache is write-through, and a reasonable
> multiprocessor system needs write-back caches.

Why?

-Miles

jas@postgres.uucp (James Shankland) (11/14/89)

In article <sZLhtay00UkaIj6al2@andrew.cmu.edu> bader+@andrew.cmu.edu (Miles Bader) writes:
>mslater@cup.portal.com (Michael Z Slater) writes:
>> Plus, the R3000 cache is write-through, and a reasonable
>> multiprocessor system needs write-back caches.
>
>Why?

Bus bandwidth limits.

(Hey, at least we're being economical with *net* bandwidth here :-)).

jas

stevec@momma.Solbourne.COM (Steve Cox) (11/15/89)

In article <23963@cup.portal.com> mslater@cup.portal.com (Michael Z Slater) writes:
>Snooping directly on the primary R3000 cache is indeed possible, but as I
>understand it, the degradation on CPU performance due to contention for the
>cache is signficant.  Plus, the R3000 cache is write-through, and a reasonable
>multiprocessor system needs write-back caches.  All multiprocessor R3000 system
>I'm aware of use second-level write-back caches.

second-level write-back caches?  so (correct me if i am wrong), there 
is a first level cache that is not connected to the shared memory bus.
how do these systems support cache coherency for data that is 
cached in the first level cache?  sounds pretty hairy to me. 
or am i missing something?

--
steve cox			stevec@solbourne.com
solbourne computer, inc.
1900 pike, longmont, co			GO BUFFS !!!   ...
(303)772-3400

alan@encore (Alan Langerman) (11/15/89)

In article <1236@kuling.UUCP>, irf@kuling (Bo Thide') writes:
>Hewlett-Packard has announced that they will introduce a multi HP-PA
>(RISC) processor HP9000 machine during 1990 and that this is the reason they
>will not provide a pure OSF/1, but rather make HP-UX OSF/1 compliant.
>OSF multiprocessor support will not be present until OSF/2 (by utilising
>MACH?).

Pure OSF/1 will have rather extensive tightly-coupled, shared-memory
multiprocessor support based on Mach.  Perhaps HP will re-analyze their
situation.

Alan

dfields@urbana.mcd.mot.com (David Fields) (11/15/89)

In article <sZLhtay00UkaIj6al2@andrew.cmu.edu>, bader+@andrew.cmu.edu
(Miles Bader) writes:
> mslater@cup.portal.com (Michael Z Slater) writes:
> > Plus, the R3000 cache is write-through, and a reasonable
> > multiprocessor system needs write-back caches.
> 
> Why?

If you don't have a write-back cache then you will stall every time
you fill up your write-post buffer.  Think about the number of cycles
to real memory, the burstiness of write traffic (a function call with
several args and register variables, although one would hope the args
are in registers, some of them will probably need to be written) and
the depth of the write-post buffer (2-4 words are reasonable).

Then play around with the numbers and you will understand.

Dave Fields // Motorola MCD //  !uiucuxc!udc!dfields

aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) (11/15/89)

>If you don't have a write-back cache then you will stall every time
>you fill up your write-post buffer. 

So make the write post buffers big enough that you don't stall very much.

(I rather like the idea of a trickle-back cache, where write-back data
is left in the cache for subsequent access, with only a tag put in the
write-back queue.  Plus combining in the write-back queue (two writes
to same location do not both need to go through (modulo your consistency
model)) this gets close to write-back, possibly with less control.

But then, write-back isn't *that* hard to do. Norm Jouppi says it's
easier than write buffering...
--
Andy "Krazy" Glew,  UIUC ECE
aglew@uiuc.edu  (afgg6490@uxa.cso.uiuc.edu)

(Formerly of Motorola MCD Urbana)

jmb@patton.sgi.com (Jim Barton) (11/16/89)

In article <1989Nov15.040039.28570@Solbourne.COM>, stevec@momma.Solbourne.COM (Steve Cox) writes:
> In article <23963@cup.portal.com> mslater@cup.portal.com (Michael Z Slater) writes:
> >Snooping directly on the primary R3000 cache is indeed possible, but as I
> >understand it, the degradation on CPU performance due to contention for the
> >cache is signficant.  Plus, the R3000 cache is write-through, and a reasonable
> >multiprocessor system needs write-back caches.  All multiprocessor R3000 system
> >I'm aware of use second-level write-back caches.
> 
> second-level write-back caches?  so (correct me if i am wrong), there 
> is a first level cache that is not connected to the shared memory bus.
> how do these systems support cache coherency for data that is 
> cached in the first level cache?  sounds pretty hairy to me. 
> or am i missing something?
> 
> 
> --
> steve cox			stevec@solbourne.com
> solbourne computer, inc.
> 1900 pike, longmont, co			GO BUFFS !!!   ...
> (303)772-3400

The Stardent Titan machines snoop directly on the first level cache.  The
R3000 has explicit lines (if you give up 128K caches and stick to 64K) to
stall and to allow you to invalidate the caches.  These machines also
suffer a significant penalty for invalidate traffic, causing less than
stellar (pun intended) performance.  The effect is mitigated by the
dual-bus scheme of the Titan.  Instructions and read-only data pass
on a separate bus which is not snooped, and read/write data passes on a bus
which is.  For instance, the vector units pick up their operands from
the read-only bus and write them to the read/write bus.  Obviously, this
scheme doesn't work too well.

We may note also that there is only one R3000 based multiprocessor
announced and shipping.  The Titan III has been announced in Japan,
but not here.  The current Titan products are R2000 based.

The SGI POWERSeries has a second level cache which performs all the 
snooping operations and does the writeback.  In effect, it acts as a
"filter" which operates asynchronously to the processor.  When a hit
occurs on an invalidate, and since the first level cache is (necessarily)
a subset of the second level cache, the second level cache turns around
and invalidates the first level cache.  So, stevec missed something too.

-- Jim Barton
Silicon Graphics Computer Systems    "UNIX: Live Free Or Die!"
jmb@sgi.sgi.com, sgi!jmb@decwrl.dec.com, ...{decwrl,sun}!sgi!jmb

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (11/16/89)

In article <1989Nov15.040039.28570@Solbourne.COM> stevec@solbourne.com 
	(Steve Cox) writes:
>second-level write-back caches?  so (correct me if i am wrong), there 
>is a first level cache that is not connected to the shared memory bus.
>how do these systems support cache coherency for data that is 
>cached in the first level cache?  sounds pretty hairy to me. 
>or am i missing something?

Yes, it's reasonably hairy. A good introduction would be the paper by
Baer's group, which appeared in this year's Computer Architecture
Conference proceedings (i.e. the June 1989 SigARCH). His scheme is
not the only possible one, but the other schemes have roughly similar
complexity.

As for the data in the first level cache: there are two answers.

One, make the first level use writethrough, so that the second level
always gets a copy. This gives the "inclusion" property, whereby the
second level always contains a strict superset of the first level.
The second level occasionally has to invalidate data which is in both
levels, and this means that it has to be able to reach in and nuke
something that is in the first level.

Two, make the first level use writeback, but inform the second level
of each write. The second level creates a hole (if necessary), which
the first level can later write the data to. This allows the second
level to do all the snoopy/coherence things, as before.

Another fun issue is the question of synonyms. Some operating systems
(such as Mach) want nonunique inverse mappings: that is, one physical
page present in N virtual spaces, N > 1.  If the cache(s) use
physical addresses, no problem.  If the cache(s) are flushed on
context switch, no problem.  Otherwise, there is a nasty problem: the
same data could be in two places in the same cache!
-- 
Don		D.C.Lindsay 	Carnegie Mellon Computer Science

toms@omews44.intel.com (Tom Shott) (11/16/89)

I can think of two methods of supporting cache coherence for multi level
caches. These all assume that there is a first level internal cache w/ a
some method of invalidating (or flushing) a line and a second level cache
connected to the bus.

The easy method is to use write through cacheing for both levels. Every
time a write occurs on the system bus flush the address from both caches.
This has performance penalties because typically every invalidate cycle on
the internal cache blocks the execution unit from access it.

A second harder method is to keep the internal cache a subset of the
external cache. Every time a line is removed from the external cache
invalidate that line in the internal cache. All system bus access are
looked up in the second level cache directory. Only those access that are
contained in the external cache are invalidated from the internal cache so
there is less contention for the internal cache.

The second method could be expanded to keep track of whats in the internal
cache in the external directory so only lines in the internal cache are
invalidated. Problem with this is knowing whats in the internal directory.
You don't see all the processor access so if your using a LRU type
replacement strategy you have no idea outside the chip what's going to be
replaced in the internal cache. If the internal cache always signaled the
outside cache on replacements, the ouside cache would know and could filter
the invalidate traffic.

You can layer a writeback external cache on this protocal and even layer a
internal writeback cache.

--
-----------------------------------------------------------------------------
Tom Shott    INTeL, 2111 NE 25th Ave., Hillsboro, OR 97123, (503) 696-4520
	     toms@omews44.intel.com OR toms%omews44.intel.com@csnet.relay.com
	INTeL.. Designers of the 960 Superscalar uP and other uP's

freudent@eric.nyu.edu (Eric Freudenthal) (11/16/89)

There is another well known solution to the problem of dual-porting a
cache between a shared bus and a pe (processor).  The idea is to use
some sort of filter to keep bus transactions which do not affect the
cache from reaching the cache.  This solution is cheaper than building
two identical caches and is equally effective.

Build a conventional cache augmented with an extra copy of tag-store,
which will be used as a filter.  This is updated every time the real
one is.  Clearly, in the absence of cache-misses, the extra tag store
is never updated.  Bus transactions are looked up in this extra
tag-store without disturbing the real cache if the address does not
not match.  If they do, then the real cache entry is updated or
invalidated (similarly changing the tar-store copy).
--
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
				Eric Freudenthal
				NYU Ultracompter Lab
				715 Broadway, 10th floor
				New York, NY  10012
				
				Phone:(212) 998-3345 work
				      (718) 789-4486 home
				Email:freudent@ultra.nyu.edu

deraadt@cpsc.ucalgary.ca (Theo Deraadt) (11/20/89)

In article <23963@cup.portal.com>, mslater@cup.portal.com (Michael Z Slater) writes:
> Plus, the R3000 cache is write-through, and a reasonable
> multiprocessor system needs write-back caches. 
> Michael Slater, Microprocessor Report   mslater@cup.portal.com

As long as they are physical write back caches and not virtual. Unless you
like flushing huge virtual writeback caches everytime you context switch
and mapin your next process.
 <tdr.

petolino@joe.Sun.COM (Joe Petolino) (11/23/89)

>> Plus, the R3000 cache is write-through, and a reasonable
>> multiprocessor system needs write-back caches. 

>As long as they are physical write back caches and not virtual. Unless you
>like flushing huge virtual writeback caches everytime you context switch
>and mapin your next process.

The solution to this problem is obvious: use the process ID (or something
smaller which maps to the process ID) as part of the Virtual Address tag.
People have been doing it for years.

-Joe