[comp.arch] shared memory multiproc. question

johnson@uiucdcsp.UUCP (01/31/87)

There are a number of shared memory multiprocessors on the market today
that consist of a number of high-end microprocessors on a single bus.
Decent performance is obtained by using fancy cache technology.  Making
a single board computer is pretty trivial now-a-days, and making a
multiprocessor system without the caches is not much harder but fairly
pointless.  How hard is it to make these caches?  Are there chips on
the market that do most of the work for you?

I am mildly interested in building a shared memory multiprocessor to
attatch to a Sun or some other workstation for experimental purposes.
Is anybody selling boards from which one could build such a system
easily and inexpensively?

Ralph Johnson      ihnp4!uiucdcs!johnson      johnson@p.cs.uiuc.edu

crowl@rochester.UUCP (02/02/87)

In article <76700001@uiucdcsp> johnson@uiucdcsp.UUCP writes:
>There are a number of shared memory multiprocessors on the market today that
>consist of a number of high-end microprocessors on a single bus.  

Single bus multiprocessors tend to not scale much past 32 processors.  Other
interconnection topologies scale better.  The Intel Hypercube and the BBN
Butterfly scale with O(n log n) interconnection costs.  Meshes and rings
scale with O(n) interconnection costs.  

>Decent performance is obtained by using fancy cache technology.  Making a
>single board computer is pretty trivial now-a-days, and making a
>multiprocessor system without the caches is not much harder but fairly
>pointless.   

It is pointless only if you have a shared bus.  The BBN Butterfly has up to
256 68000s each with local memory, but without caches.  A switch handles
memory references to remote nodes.  

-- 
  Lawrence Crowl		716-275-5766	University of Rochester
			crowl@rochester.arpa	Computer Science Department
 ...!{allegra,decvax,seismo}!rochester!crowl	Rochester, New York,  14627

johnson@uiucdcsp.UUCP (02/04/87)

/* Written  9:10 am  Feb  2, 1987 by crowl@rochester.ARPA  */

>>Single bus multiprocessors tend to not scale much past 32 processors. 

I would be happy with a dozen.

>>Other interconnection topologies scale better.  The Intel Hypercube and 
>>the BBN Butterfly scale with O(n log n) interconnection costs. 

This is all fine, but the Butterfly is pretty expensive, and the various
hypercubes are hard to use.  The Sequent and Multimax (to name the ones
that I know about) are quite good and much more cost effective than a
supermini for the things that we use them for, i.e. student computing and
text processing.  Different kinds of machines are good for different kinds
of things, and I want a very cheap shared memory multiprocessor with only
a handful of processors.

markp@valid.UUCP (02/05/87)

> 
> There are a number of shared memory multiprocessors on the market today
> that consist of a number of high-end microprocessors on a single bus.
> Decent performance is obtained by using fancy cache technology.  Making
> a single board computer is pretty trivial now-a-days, and making a
> multiprocessor system without the caches is not much harder but fairly
> pointless.  How hard is it to make these caches?  Are there chips on
> the market that do most of the work for you?
> 
> I am mildly interested in building a shared memory multiprocessor to
> attatch to a Sun or some other workstation for experimental purposes.
> Is anybody selling boards from which one could build such a system
> easily and inexpensively?
> 
> Ralph Johnson      ihnp4!uiucdcs!johnson      johnson@p.cs.uiuc.edu

Silicon support for caches is close to non-existent, with a few notable
exceptions.  The most obvious are "cache-tag RAMs," which are just fast
static RAMs with built-in expandable comparators for implementing the
cache tag directory.  The only cache management chips I know of off hand
are some Signetics 689xx (sic.) which do address translation and virtual
cache management, but only work on 68010's, hardly state-of-the-art.
Some processor designers have actually started to think about smart
caches, but all we see for the most part is a bit in the page table
controlling cacheability.  The 80386 doesn't even have that.  A notable
highlight is the MIPS R2000, which has on-chip cache control for
direct-mapped I/D caches, but which doesn't support hardware invalidates.
The 68030's on-chip data cache won't support invalidates either.  Yech!
Pure software-enforcement.  Might just as well use message-passing. :-)

Now, if you want something off the shelf, check out the Fairchild Clipper.
The Clipper modules have built-in instruction and data write-back caches,
but consistency is only enforced in write-through mode (programmable on a
per-page basis).  The modules are fairly expensive (well over a thousand
dollars in small quantities last time I checked), but are good for about
4-5 MIPS each.  You hook them in parallel on a proprietary synchronous bus
and would then need to build memory boards that plug into the same form
factor.  Unless you REALLY want to spend the time designing a cache-based
CPU, the Clippers might do the trick for you.  They might even give you a
good deal on off-speed parts for research use.  Call your local rep, or
(800)423-5516 (Fairchild advanced processor division in Palo Alto, CA).

It is not inherently difficult or expensive to design a cache-based CPU board,
particularly direct-mapped with a block size of 1 word and write-through.
The only modification to an otherwise-conventional CPU board design that you
make is a bank of cache tag RAMs and cache data RAMs, hooked up in a fashion
left as an exercise for the reader, and a circuit to steal cycles in the
cache directory whenever somebody else does a write on the external bus.
When this happens, you query the cache directory and invalidate.

The suitability of this approach depends on how many boards you want to
build, and what kind of performance you want.  Write-through is essentially
worthless with more than 2 or 3 of today's 32-bit microprocessors.  If you
want to cascade enough processors to make it really interesting, write-back
is a necessity.  When Futurebus (IEEE P896) is adopted (in May/June), it
will become the first bus standard that supports write-back caches.
However, the design of a smart write-back cache with a finite block size is
not an exercise for the faint-hearted, particularly with no advanced silicon
support.  It is expected that this will change, now that there will be a
standard bus structure to interface to (instead of Sequent/Encore/Convergent
/ad nausem's proprietary busses), and a well-documented set of bus operations
to support the functioning of smart caches.

I love talking about caches, and could probably continue on forever, but
won't in this public forum.  Feel free to address more specific questions
by email.

	Mark Papamarcos
	Valid Logic Systems, Inc.
	{ihnp4,hplabs}!pesnta!valid!markp
	(member, P896 working group and cache task group)

Generic disclaimer-- I have no vested interest in anyone mentioned above.
(except Futurebus, whose banner I loyally wave)

pase@ogcvax.UUCP (02/18/87)

In article <rocheste.24385> crowl@rochester.UUCP (Lawrence Crowl) writes:
>In article <76700001@uiucdcsp> johnson@uiucdcsp.UUCP writes:
>>There are a number of shared memory multiprocessors on the market today that
>>consist of a number of high-end microprocessors on a single bus.  
>
>Single bus multiprocessors tend to not scale much past 32 processors.  Other
>interconnection topologies scale better.  The Intel Hypercube and the BBN
>Butterfly scale with O(n log n) interconnection costs.  Meshes and rings
>scale with O(n) interconnection costs.  

This statement is quite misleading.  For one thing, the Intel Hypercube is NOT
a shared memory machine.  Thus comparing scalability, cost-per-connection and
soforth is much like comparing apples and oranges.  For example, a certain
popular shared memory machine would cost in excess of $500,000 for a 32-node
configuration.  Another quite popular machine with distributed memory (ie NOT
shared) costs subtantially less than $200,000 for a similar 32-node
configuation.

IBM's RP3 is proving two points about scaling up shared-memory machines:
	1)  It can be done (at least to 512 processors)
	2)  It is very expensive

The two types of machines do not compare well, as they are best used for
different problems.  If you want many users running independent tasks, sharing
common resources with near constant response, shared memory machines are more
appropriate.  If you want a lower cost-per-node and you have a problem that
can be partitioned into a reasonable number of communicating tasks, a
distributed machine may be more appropriate.

If you want to compare them, decide what you want to compare them for.
In their own domains, each one out performs the other.  And don't forget to
include cost as part of the comparison.
-- 
Doug Pase   --   ...ucbvax!tektronix!ogcvax!pase   or   pase@Oregon-Grad

pase@ogcvax.UUCP (02/18/87)

In article <uiucdcsp.76700002> johnson@uiucdcsp.cs.uiuc.edu writes:
 [...]
>This is all fine, but the Butterfly is pretty expensive, and the various
>hypercubes are hard to use.  [...]
 ^^^^^^^^^^^^^^^^^^^^^^^^^^
>Different kinds of machines are good for different kinds of things,
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>and I want a very cheap shared memory multiprocessor with only
>a handful of processors.

Hypercubes are hard to use only if you don't know anything about them, or
about what they're good for.  For the type of processing you're doing, (many
independent users/tasks) yes a hypercube (or any other distributed memory
multiprocessor) would be totally inappropriate.  For large scale computing,
a hypercube might actually be easier to use.  It all depends on what you're
doing.
-- 
Doug Pase   --   ...ucbvax!tektronix!ogcvax!pase   or   pase@Oregon-Grad

chuck@amdahl.UUCP (02/20/87)

In article <1205@ogcvax.UUCP> pase@ogcvax.UUCP (Douglas M. Pase) writes:
>The two types of machines do not compare well, as they are best used for
>different problems.  If you want many users running independent tasks, sharing
>common resources with near constant response, shared memory machines are more
>appropriate.  If you want a lower cost-per-node and you have a problem that
>can be partitioned into a reasonable number of communicating tasks, a
>distributed machine may be more appropriate.
>Doug Pase   --   ...ucbvax!tektronix!ogcvax!pase   or   pase@Oregon-Grad

Seeing how two different people suggested that distributed memory is
inappropriate for multiple users running independent tasks, maybe someone
could tell me why.  I find it hard to imagine a problem which is more
partitionable.  Since the tasks are independent, each user should be very
happy to have her very own address space.  Personally, I'm trying to
convince my boss that the productivity of this department would go way
up if each of us had our very own clipper based workstation with 8 Meg
of memory and a 19" high resolution bit-mapped screen communicating
over an ethernet to such common resources as a file system and printers.

(Some of my friends don't think a clipper would be powerful enough, and
they want a hypercube of clippers as their very own workstation.  They
also want to be able to run lots of independent tasks on their workstation
instead of the three or four that I get by with.)

-- Chuck

suhler@im4u.UUCP (02/20/87)

In article <5699@amdahl.UUCP> chuck@amdahl.UUCP (Charles Simmons) writes:
>Seeing how two different people suggested that distributed memory is
>inappropriate for multiple users running independent tasks, maybe someone
>could tell me why.  I find it hard to imagine a problem which is more
>partitionable.  Since the tasks are independent, each user should be very
>happy to have her very own address space.  

This is true only if the machine has a decent I/O architecture, i.e.,
enough paths among local memories and disk(s).  Imagine trying to squeeze
page swaps over multiple inter-node links.  The situation with Intel's
iPSC isn't much better:  a single ethernet linking a single disk with
32 to 128 nodes.  Rumor has it that they're well aware of the problem
and will fix it in the next incarnation.

One visitor to the CS Dept. here described programs and data that took
hours to load and minutes to execute, although that's not exactly what
you're suggesting.

I believe Ncube has a lot of independent I/O paths in their system; does
anyone out there have any experience with their machine?

FPS's T-Series was to have had a disk farm with its own hypercube
connecting the disks and lots of paths up to the processor hypercube.
However the rumor a few months ago on this newsgroup was that
they'd scrapped the project.  Now I hear that John Gustafson (the
designer of the T-Series) doesn't work there any more.  Does
anyone have information on either of these rumors?
-- 
Paul Suhler        suhler@im4u.UTEXAS.EDU	512-474-9517

ian@loral.UUCP (02/21/87)

In article <5699@amdahl.UUCP> chuck@amdahl.UUCP (Charles Simmons) writes:
>Seeing how two different people suggested that distributed memory is
>inappropriate for multiple users running independent tasks, maybe someone
>could tell me why.  I find it hard to imagine a problem which is more
>partitionable.  

  Someone else in this news group has already suggested that one problem
  that must be solved before a distributed memory parallel processor can 
  be used to support multiple users is access to I/O resources.  I agree 
  with this, but meeting this requirement is not sufficient.  Assuming 
  that the objective is to do the sort of work that is done on the average 
  UN*X system, a distributed memory parallel processors is a poor choice
  for supporting multiple users.

  Charles Simmons suggests in his article that one way to allow multiple
  users to make use of a distributed memory parallel processor is to 
  dedicate individual processors to the system's users.  I have seen 
  "home grown" systems that did just this.  A system that permanently 
  allocates a processor to a user until the user logs out makes very 
  poor use of the system hardware, since it is used only a fraction of 
  the time.  Such a system could not compete with shared memory systems 
  like those offered by Sequent or Encore.

  In a distributed memory system hardware utilization cannot be increased
  by dynamically moving a process between processors, since the overhead
  incurred is so large.  This option is only realistic for tasks that take
  so long that the overhead is a small part of the execution time.  This 
  is not true for many applications.

  Dynamic process "allocation" is practical on a distributed memory system
  that uses the Single Program Multiple Data Stream (SPMD).  This 
  model is used on the iNTEL cube.  SPMD downloads the same code to all
  processors.  To execute a process, a short message is sent to a
  processor telling it what to execute.  There is little overhead since 
  the code is already resident.  However this model introduces some 
  significant costs.  Consider a parallel processor that consists of 32 
  processing elements (PEs), where each one has 512K bytes of memory.  The 
  processor potentially has 16M bytes of memory available.  Under the SPMD 
  model, only 512K bytes is really available for use, since all 
  processors have the same code.

  Other problems with trying to use a distributed memory parallel processor
  to support multiple users include developing a distributed operating
  system and assuring deterministic system operation.  These are not
  trivial problems.


		     Ian Kaplan
		     Loral Dataflow Group
		     Loral Instrumentation
	     USENET: {ucbvax,decvax,ihnp4}!sdcsvax!loral!ian
	     ARPA:   sdcc6!loral!ian@UCSD
	     USPS:   8401 Aero Dr. San Diego, CA 92123

chuck@amdahl.UUCP (02/21/87)

In article <1513@im4u.UUCP> suhler@im4u.UUCP (Paul A. Suhler) writes:
>In article <5699@amdahl.UUCP> chuck@amdahl.UUCP (Charles Simmons) writes:
>>Seeing how two different people suggested that distributed memory is
>>inappropriate for multiple users running independent tasks, maybe someone
>>could tell me why.  I find it hard to imagine a problem which is more
>>partitionable.  Since the tasks are independent, each user should be very
>>happy to have her very own address space.  
>
>This is true only if the machine has a decent I/O architecture, i.e.,
>enough paths among local memories and disk(s).  Imagine trying to squeeze
>page swaps over multiple inter-node links.  The situation with Intel's
>iPSC isn't much better:  a single ethernet linking a single disk with
>32 to 128 nodes.  Rumor has it that they're well aware of the problem
>and will fix it in the next incarnation.
>Paul Suhler        suhler@im4u.UTEXAS.EDU	512-474-9517

Let's assume that each workstation has a 20Mb hard disk for cacheing
I/O to the network.  It may take a few minutes to load a working set
of programs and data into the cache over a 40K baud ethernet link,
but after that, things should hum pretty well.  20Mb is enough to
do a substantial amount of page swapping (even if the 8Mb of local
ram won't hold your programs) as well as hold most of the useful files
associated with Unix.  If you don't turn your machine off at night,
you won't even be slowed down by having to reload your working set
every mornining.

-- Chuck

pase@ogcvax.UUCP (02/23/87)

In article <loral.1379> ian@loral.UUCP (Ian Kaplan) writes:
>In article <5699@amdahl.UUCP> chuck@amdahl.UUCP (Charles Simmons) writes:
>>Seeing how two different people suggested that distributed memory is
>>inappropriate for multiple users running independent tasks, maybe someone
>>could tell me why.  I find it hard to imagine a problem which is more
>>partitionable.  
>
>  Someone else in this news group has already suggested that one problem
>  that must be solved before a distributed memory parallel processor can 
>  be used to support multiple users is access to I/O resources.  I agree 
>  with this, but meeting this requirement is not sufficient.  Assuming 
>  that the objective is to do the sort of work that is done on the average 
>  UN*X system, a distributed memory parallel processors is a poor choice
>  for supporting multiple users.
>  [...]

Distributed memory networks have been used for multi-user systems for several
years now - cf. Apollo networks.  Some, at least, would argue they have been
used successfully.  However, machines like the iPSC were designed to do heavy
computing, and NOT a lot of resource sharing.  The cube manager is too much of
a bottleneck to be used as a resource server to the tower.

At FPS we used a 10-node Apollo network.  5 nodes had disks, but performance
was still poor.  Our best guess was that the network more than anything else
was the limiting factor.  Processors were idle while they waited for the data
to come back over the net.  The Apollo network used a ring, which is slow,
but the point is that half of the nodes had local disks, all nodes had 2 Mbyte
of memory, and performance was often bad.  The iPSC has 512 Kbyte, a better
network topology, and a bottleneck called the cube manager.  The I/O throughput
is such that it might successfully support 4 to 8 processors as task servers,
but the cube manager would be absolutely swamped with much more than that.  As
poor as the Apollo performed, the iPSC would do worse.  (To be perfectly fair
to Apollo, our load was heavier than their system was really designed for.)

The hypercube is set up to do number crunching, with lots of operations per
byte of I/O.  If your application is heavily biased towards I/O, or sharing
a particular resource, the iPSC will not perform as well.  What it does, it
does well.  Supporting lots of users just isn't one of the things it does.
-- 
Doug Pase   --   ...ucbvax!tektronix!ogcvax!pase   or   pase@Oregon-Grad

simoni@Shasta.UUCP (02/25/87)

In article <5713@amdahl.UUCP> chuck@amdahl.UUCP (Charles Simmons) writes:
>Let's assume that each workstation has a 20Mb hard disk for cacheing
>I/O to the network.  It may take a few minutes to load a working set
>of programs and data into the cache over a 40K baud ethernet link,
>but after that, things should hum pretty well.

Interesting related reference:

"File Access Performance of Diskless Workstations"
Edward D. Lazowska, John Zahorjan, David Cheriton, and Willy Zwaenepoel
Deaprtment of Computer Science, Stanford University, June 1984
Technical Report No. STAN-CS-84-1010


Rich

mark@mimsy.UUCP (02/25/87)

In article <1330@Shasta.STANFORD.EDU> simoni@Shasta.UUCP (Richard Simoni) writes:
>"File Access Performance of Diskless Workstations" [etc.]

Also appeared in ACM Transactions on Computer Systems, August 1986.
-- 
Spoken: Mark Weiser 	ARPA:	mark@mimsy.umd.edu	Phone: +1-301-454-7817
After May 1, 1987: weiser@xerox.com