johnson@uiucdcsp.UUCP (01/31/87)
There are a number of shared memory multiprocessors on the market today that consist of a number of high-end microprocessors on a single bus. Decent performance is obtained by using fancy cache technology. Making a single board computer is pretty trivial now-a-days, and making a multiprocessor system without the caches is not much harder but fairly pointless. How hard is it to make these caches? Are there chips on the market that do most of the work for you? I am mildly interested in building a shared memory multiprocessor to attatch to a Sun or some other workstation for experimental purposes. Is anybody selling boards from which one could build such a system easily and inexpensively? Ralph Johnson ihnp4!uiucdcs!johnson johnson@p.cs.uiuc.edu
crowl@rochester.UUCP (02/02/87)
In article <76700001@uiucdcsp> johnson@uiucdcsp.UUCP writes: >There are a number of shared memory multiprocessors on the market today that >consist of a number of high-end microprocessors on a single bus. Single bus multiprocessors tend to not scale much past 32 processors. Other interconnection topologies scale better. The Intel Hypercube and the BBN Butterfly scale with O(n log n) interconnection costs. Meshes and rings scale with O(n) interconnection costs. >Decent performance is obtained by using fancy cache technology. Making a >single board computer is pretty trivial now-a-days, and making a >multiprocessor system without the caches is not much harder but fairly >pointless. It is pointless only if you have a shared bus. The BBN Butterfly has up to 256 68000s each with local memory, but without caches. A switch handles memory references to remote nodes. -- Lawrence Crowl 716-275-5766 University of Rochester crowl@rochester.arpa Computer Science Department ...!{allegra,decvax,seismo}!rochester!crowl Rochester, New York, 14627
johnson@uiucdcsp.UUCP (02/04/87)
/* Written 9:10 am Feb 2, 1987 by crowl@rochester.ARPA */ >>Single bus multiprocessors tend to not scale much past 32 processors. I would be happy with a dozen. >>Other interconnection topologies scale better. The Intel Hypercube and >>the BBN Butterfly scale with O(n log n) interconnection costs. This is all fine, but the Butterfly is pretty expensive, and the various hypercubes are hard to use. The Sequent and Multimax (to name the ones that I know about) are quite good and much more cost effective than a supermini for the things that we use them for, i.e. student computing and text processing. Different kinds of machines are good for different kinds of things, and I want a very cheap shared memory multiprocessor with only a handful of processors.
markp@valid.UUCP (02/05/87)
> > There are a number of shared memory multiprocessors on the market today > that consist of a number of high-end microprocessors on a single bus. > Decent performance is obtained by using fancy cache technology. Making > a single board computer is pretty trivial now-a-days, and making a > multiprocessor system without the caches is not much harder but fairly > pointless. How hard is it to make these caches? Are there chips on > the market that do most of the work for you? > > I am mildly interested in building a shared memory multiprocessor to > attatch to a Sun or some other workstation for experimental purposes. > Is anybody selling boards from which one could build such a system > easily and inexpensively? > > Ralph Johnson ihnp4!uiucdcs!johnson johnson@p.cs.uiuc.edu Silicon support for caches is close to non-existent, with a few notable exceptions. The most obvious are "cache-tag RAMs," which are just fast static RAMs with built-in expandable comparators for implementing the cache tag directory. The only cache management chips I know of off hand are some Signetics 689xx (sic.) which do address translation and virtual cache management, but only work on 68010's, hardly state-of-the-art. Some processor designers have actually started to think about smart caches, but all we see for the most part is a bit in the page table controlling cacheability. The 80386 doesn't even have that. A notable highlight is the MIPS R2000, which has on-chip cache control for direct-mapped I/D caches, but which doesn't support hardware invalidates. The 68030's on-chip data cache won't support invalidates either. Yech! Pure software-enforcement. Might just as well use message-passing. :-) Now, if you want something off the shelf, check out the Fairchild Clipper. The Clipper modules have built-in instruction and data write-back caches, but consistency is only enforced in write-through mode (programmable on a per-page basis). The modules are fairly expensive (well over a thousand dollars in small quantities last time I checked), but are good for about 4-5 MIPS each. You hook them in parallel on a proprietary synchronous bus and would then need to build memory boards that plug into the same form factor. Unless you REALLY want to spend the time designing a cache-based CPU, the Clippers might do the trick for you. They might even give you a good deal on off-speed parts for research use. Call your local rep, or (800)423-5516 (Fairchild advanced processor division in Palo Alto, CA). It is not inherently difficult or expensive to design a cache-based CPU board, particularly direct-mapped with a block size of 1 word and write-through. The only modification to an otherwise-conventional CPU board design that you make is a bank of cache tag RAMs and cache data RAMs, hooked up in a fashion left as an exercise for the reader, and a circuit to steal cycles in the cache directory whenever somebody else does a write on the external bus. When this happens, you query the cache directory and invalidate. The suitability of this approach depends on how many boards you want to build, and what kind of performance you want. Write-through is essentially worthless with more than 2 or 3 of today's 32-bit microprocessors. If you want to cascade enough processors to make it really interesting, write-back is a necessity. When Futurebus (IEEE P896) is adopted (in May/June), it will become the first bus standard that supports write-back caches. However, the design of a smart write-back cache with a finite block size is not an exercise for the faint-hearted, particularly with no advanced silicon support. It is expected that this will change, now that there will be a standard bus structure to interface to (instead of Sequent/Encore/Convergent /ad nausem's proprietary busses), and a well-documented set of bus operations to support the functioning of smart caches. I love talking about caches, and could probably continue on forever, but won't in this public forum. Feel free to address more specific questions by email. Mark Papamarcos Valid Logic Systems, Inc. {ihnp4,hplabs}!pesnta!valid!markp (member, P896 working group and cache task group) Generic disclaimer-- I have no vested interest in anyone mentioned above. (except Futurebus, whose banner I loyally wave)
pase@ogcvax.UUCP (02/18/87)
In article <rocheste.24385> crowl@rochester.UUCP (Lawrence Crowl) writes: >In article <76700001@uiucdcsp> johnson@uiucdcsp.UUCP writes: >>There are a number of shared memory multiprocessors on the market today that >>consist of a number of high-end microprocessors on a single bus. > >Single bus multiprocessors tend to not scale much past 32 processors. Other >interconnection topologies scale better. The Intel Hypercube and the BBN >Butterfly scale with O(n log n) interconnection costs. Meshes and rings >scale with O(n) interconnection costs. This statement is quite misleading. For one thing, the Intel Hypercube is NOT a shared memory machine. Thus comparing scalability, cost-per-connection and soforth is much like comparing apples and oranges. For example, a certain popular shared memory machine would cost in excess of $500,000 for a 32-node configuration. Another quite popular machine with distributed memory (ie NOT shared) costs subtantially less than $200,000 for a similar 32-node configuation. IBM's RP3 is proving two points about scaling up shared-memory machines: 1) It can be done (at least to 512 processors) 2) It is very expensive The two types of machines do not compare well, as they are best used for different problems. If you want many users running independent tasks, sharing common resources with near constant response, shared memory machines are more appropriate. If you want a lower cost-per-node and you have a problem that can be partitioned into a reasonable number of communicating tasks, a distributed machine may be more appropriate. If you want to compare them, decide what you want to compare them for. In their own domains, each one out performs the other. And don't forget to include cost as part of the comparison. -- Doug Pase -- ...ucbvax!tektronix!ogcvax!pase or pase@Oregon-Grad
pase@ogcvax.UUCP (02/18/87)
In article <uiucdcsp.76700002> johnson@uiucdcsp.cs.uiuc.edu writes: [...] >This is all fine, but the Butterfly is pretty expensive, and the various >hypercubes are hard to use. [...] ^^^^^^^^^^^^^^^^^^^^^^^^^^ >Different kinds of machines are good for different kinds of things, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >and I want a very cheap shared memory multiprocessor with only >a handful of processors. Hypercubes are hard to use only if you don't know anything about them, or about what they're good for. For the type of processing you're doing, (many independent users/tasks) yes a hypercube (or any other distributed memory multiprocessor) would be totally inappropriate. For large scale computing, a hypercube might actually be easier to use. It all depends on what you're doing. -- Doug Pase -- ...ucbvax!tektronix!ogcvax!pase or pase@Oregon-Grad
chuck@amdahl.UUCP (02/20/87)
In article <1205@ogcvax.UUCP> pase@ogcvax.UUCP (Douglas M. Pase) writes: >The two types of machines do not compare well, as they are best used for >different problems. If you want many users running independent tasks, sharing >common resources with near constant response, shared memory machines are more >appropriate. If you want a lower cost-per-node and you have a problem that >can be partitioned into a reasonable number of communicating tasks, a >distributed machine may be more appropriate. >Doug Pase -- ...ucbvax!tektronix!ogcvax!pase or pase@Oregon-Grad Seeing how two different people suggested that distributed memory is inappropriate for multiple users running independent tasks, maybe someone could tell me why. I find it hard to imagine a problem which is more partitionable. Since the tasks are independent, each user should be very happy to have her very own address space. Personally, I'm trying to convince my boss that the productivity of this department would go way up if each of us had our very own clipper based workstation with 8 Meg of memory and a 19" high resolution bit-mapped screen communicating over an ethernet to such common resources as a file system and printers. (Some of my friends don't think a clipper would be powerful enough, and they want a hypercube of clippers as their very own workstation. They also want to be able to run lots of independent tasks on their workstation instead of the three or four that I get by with.) -- Chuck
suhler@im4u.UUCP (02/20/87)
In article <5699@amdahl.UUCP> chuck@amdahl.UUCP (Charles Simmons) writes: >Seeing how two different people suggested that distributed memory is >inappropriate for multiple users running independent tasks, maybe someone >could tell me why. I find it hard to imagine a problem which is more >partitionable. Since the tasks are independent, each user should be very >happy to have her very own address space. This is true only if the machine has a decent I/O architecture, i.e., enough paths among local memories and disk(s). Imagine trying to squeeze page swaps over multiple inter-node links. The situation with Intel's iPSC isn't much better: a single ethernet linking a single disk with 32 to 128 nodes. Rumor has it that they're well aware of the problem and will fix it in the next incarnation. One visitor to the CS Dept. here described programs and data that took hours to load and minutes to execute, although that's not exactly what you're suggesting. I believe Ncube has a lot of independent I/O paths in their system; does anyone out there have any experience with their machine? FPS's T-Series was to have had a disk farm with its own hypercube connecting the disks and lots of paths up to the processor hypercube. However the rumor a few months ago on this newsgroup was that they'd scrapped the project. Now I hear that John Gustafson (the designer of the T-Series) doesn't work there any more. Does anyone have information on either of these rumors? -- Paul Suhler suhler@im4u.UTEXAS.EDU 512-474-9517
ian@loral.UUCP (02/21/87)
In article <5699@amdahl.UUCP> chuck@amdahl.UUCP (Charles Simmons) writes: >Seeing how two different people suggested that distributed memory is >inappropriate for multiple users running independent tasks, maybe someone >could tell me why. I find it hard to imagine a problem which is more >partitionable. Someone else in this news group has already suggested that one problem that must be solved before a distributed memory parallel processor can be used to support multiple users is access to I/O resources. I agree with this, but meeting this requirement is not sufficient. Assuming that the objective is to do the sort of work that is done on the average UN*X system, a distributed memory parallel processors is a poor choice for supporting multiple users. Charles Simmons suggests in his article that one way to allow multiple users to make use of a distributed memory parallel processor is to dedicate individual processors to the system's users. I have seen "home grown" systems that did just this. A system that permanently allocates a processor to a user until the user logs out makes very poor use of the system hardware, since it is used only a fraction of the time. Such a system could not compete with shared memory systems like those offered by Sequent or Encore. In a distributed memory system hardware utilization cannot be increased by dynamically moving a process between processors, since the overhead incurred is so large. This option is only realistic for tasks that take so long that the overhead is a small part of the execution time. This is not true for many applications. Dynamic process "allocation" is practical on a distributed memory system that uses the Single Program Multiple Data Stream (SPMD). This model is used on the iNTEL cube. SPMD downloads the same code to all processors. To execute a process, a short message is sent to a processor telling it what to execute. There is little overhead since the code is already resident. However this model introduces some significant costs. Consider a parallel processor that consists of 32 processing elements (PEs), where each one has 512K bytes of memory. The processor potentially has 16M bytes of memory available. Under the SPMD model, only 512K bytes is really available for use, since all processors have the same code. Other problems with trying to use a distributed memory parallel processor to support multiple users include developing a distributed operating system and assuring deterministic system operation. These are not trivial problems. Ian Kaplan Loral Dataflow Group Loral Instrumentation USENET: {ucbvax,decvax,ihnp4}!sdcsvax!loral!ian ARPA: sdcc6!loral!ian@UCSD USPS: 8401 Aero Dr. San Diego, CA 92123
chuck@amdahl.UUCP (02/21/87)
In article <1513@im4u.UUCP> suhler@im4u.UUCP (Paul A. Suhler) writes: >In article <5699@amdahl.UUCP> chuck@amdahl.UUCP (Charles Simmons) writes: >>Seeing how two different people suggested that distributed memory is >>inappropriate for multiple users running independent tasks, maybe someone >>could tell me why. I find it hard to imagine a problem which is more >>partitionable. Since the tasks are independent, each user should be very >>happy to have her very own address space. > >This is true only if the machine has a decent I/O architecture, i.e., >enough paths among local memories and disk(s). Imagine trying to squeeze >page swaps over multiple inter-node links. The situation with Intel's >iPSC isn't much better: a single ethernet linking a single disk with >32 to 128 nodes. Rumor has it that they're well aware of the problem >and will fix it in the next incarnation. >Paul Suhler suhler@im4u.UTEXAS.EDU 512-474-9517 Let's assume that each workstation has a 20Mb hard disk for cacheing I/O to the network. It may take a few minutes to load a working set of programs and data into the cache over a 40K baud ethernet link, but after that, things should hum pretty well. 20Mb is enough to do a substantial amount of page swapping (even if the 8Mb of local ram won't hold your programs) as well as hold most of the useful files associated with Unix. If you don't turn your machine off at night, you won't even be slowed down by having to reload your working set every mornining. -- Chuck
pase@ogcvax.UUCP (02/23/87)
In article <loral.1379> ian@loral.UUCP (Ian Kaplan) writes: >In article <5699@amdahl.UUCP> chuck@amdahl.UUCP (Charles Simmons) writes: >>Seeing how two different people suggested that distributed memory is >>inappropriate for multiple users running independent tasks, maybe someone >>could tell me why. I find it hard to imagine a problem which is more >>partitionable. > > Someone else in this news group has already suggested that one problem > that must be solved before a distributed memory parallel processor can > be used to support multiple users is access to I/O resources. I agree > with this, but meeting this requirement is not sufficient. Assuming > that the objective is to do the sort of work that is done on the average > UN*X system, a distributed memory parallel processors is a poor choice > for supporting multiple users. > [...] Distributed memory networks have been used for multi-user systems for several years now - cf. Apollo networks. Some, at least, would argue they have been used successfully. However, machines like the iPSC were designed to do heavy computing, and NOT a lot of resource sharing. The cube manager is too much of a bottleneck to be used as a resource server to the tower. At FPS we used a 10-node Apollo network. 5 nodes had disks, but performance was still poor. Our best guess was that the network more than anything else was the limiting factor. Processors were idle while they waited for the data to come back over the net. The Apollo network used a ring, which is slow, but the point is that half of the nodes had local disks, all nodes had 2 Mbyte of memory, and performance was often bad. The iPSC has 512 Kbyte, a better network topology, and a bottleneck called the cube manager. The I/O throughput is such that it might successfully support 4 to 8 processors as task servers, but the cube manager would be absolutely swamped with much more than that. As poor as the Apollo performed, the iPSC would do worse. (To be perfectly fair to Apollo, our load was heavier than their system was really designed for.) The hypercube is set up to do number crunching, with lots of operations per byte of I/O. If your application is heavily biased towards I/O, or sharing a particular resource, the iPSC will not perform as well. What it does, it does well. Supporting lots of users just isn't one of the things it does. -- Doug Pase -- ...ucbvax!tektronix!ogcvax!pase or pase@Oregon-Grad
simoni@Shasta.UUCP (02/25/87)
In article <5713@amdahl.UUCP> chuck@amdahl.UUCP (Charles Simmons) writes: >Let's assume that each workstation has a 20Mb hard disk for cacheing >I/O to the network. It may take a few minutes to load a working set >of programs and data into the cache over a 40K baud ethernet link, >but after that, things should hum pretty well. Interesting related reference: "File Access Performance of Diskless Workstations" Edward D. Lazowska, John Zahorjan, David Cheriton, and Willy Zwaenepoel Deaprtment of Computer Science, Stanford University, June 1984 Technical Report No. STAN-CS-84-1010 Rich
mark@mimsy.UUCP (02/25/87)
In article <1330@Shasta.STANFORD.EDU> simoni@Shasta.UUCP (Richard Simoni) writes: >"File Access Performance of Diskless Workstations" [etc.] Also appeared in ACM Transactions on Computer Systems, August 1986. -- Spoken: Mark Weiser ARPA: mark@mimsy.umd.edu Phone: +1-301-454-7817 After May 1, 1987: weiser@xerox.com