rick@cs.arizona.edu (Rick Schlichting) (05/27/91)
[Dr. David Kahaner is a numerical analyst visiting Japan for two-years under the auspices of the Office of Naval Research-Asia (ONR/Asia). The following is the professional opinion of David Kahaner and in no way has the blessing of the US Government or any agency of it. All information is dated and of limited life time. This disclaimer should be noted on ANY attribution.] [Copies of previous reports written by Kahaner can be obtained from host cs.arizona.edu using anonymous FTP.] To: Distribution From: David K. Kahaner, ONR Asia [kahaner@xroads.cc.u-tokyo.ac.jp] Re: Int Symp on Shared Memory Multiproc (ISSMM), Tokyo, 2-4 April 1991 24 May 1991 ABSTRACT. The International Symposium on Shared Memory Multiprocessing (ISSMM), Tokyo, 2-4 April 1991 is summarized. D. Notkin has been on a year's research sabbatical in Japan. Professor David Notkin Dept. of Computer Science & Engineering, FR-35 University of Washington Seattle, WA 98195 USA Tel: 1-206-685-3798, Fax: 1-206-543-2969 Email: notkin@cs.washington.edu He attended the International Symposium on Shared Memory Multiprocessing (ISSMM), which was held in Tokyo on 2-4 April 1991. His summary of that meeting follows with some additional comments by Kahaner. ISSMM was sponsored by the Information Processing Society of Japan (IPSJ), with Professor Tadashi MASUDA from the University of Tokyo serving as general chairman and Dr. Norihisa SUZUKI serving as program chairman. Dr. Norihisa SUZUKI Director, IBM Tokyo Research Laboratory 5-19 Sanban-cho, Chiyoda-ku Tokyo 102 JAPAN Tel: 81-3-3288-8300 Email: nsuzuki@trlvm1.vnet.ibm.com Both Japan and US industry and academia were represented on the program committee. The proceedings were published by the Information Processing Society of Japan (IPSJ), and a version of them is expected to be published by MIT Press, with Dr. Suzuki serving as the editor. The current plan is to have the second instance of the conference in northern California in late 1992. The conference included presentations of 23 refereed papers. Of the papers, five were from Japan and the remaining 18 were from abroad. (The titles, authors, and affiliations of all the papers are given below. In addition, the abstracts of the five Japanese papers are given along with a few related comments.) The attendance was approximately 130, consisting of about 100 Japanese and 30 foreigners. There was also a panel on the feasibility of large-scale shared memory multiprocessors. Thacker (DEC SRC) started off by making four observations: (1) SIMD machines work well for large, regular problems; (2) MIMD machines work well for less regular problems; (3) Shared memory machines work well when they don't share and don't synchronize; (4) Programming large shared MIMD machines is much harder than programming SIMD machines. Koike (NEC) commented that the history of processor design indicates that shared memory machines consisting of thousands of processors is not feasible in the near-term. In addition, he recommended that we focus carefully on specific application domains and on the kind of parallelism they require to get the most from existing (and near-future) parallel machines. (Note that this is consistent with the discussion with the Cenju system, described below, in which a special-purpose machine was built that after-the-fact was viewed to be more general-purpose than intended.) Mr. Nobuhiko Koike Manager, Computer System Research Laboratory C&C Systems Research Laboratories NEC Corp. 1-1, 4-chome, Miyazaki, Miyamae-ku Kawasaki City, Kanagawa 213 Japan Tel: (044) 856-2127, Fax: (044) 856-2231 Email: koike@csl.cl.nec.co.jp Baskett (Silicon Graphics) argued that for many problems, computational complexity is greater than read/write complexity, and so large problem sizes need relatively less bandwidth than small problem sizes. So, he concluded that building large-scale machines is feasible if we intend to run large problems sizes on them. Gifford (MIT) said that such machines are feasible, but that they are called multicomputers (not necessarily shared memory). For programming ease, he stated that it was critical to name entities that are both local (in shared name spaces) and remote (in nonshared name spaces) uniformly; the performance consequences of this are acceptable, he suggested, if you don't in fact share very much in practice. The reason is that, in order to get get machines on the order of 100 or more processors, we will have to manage non-uniform access times. Overall, Gifford saw the merging of two research streams: (1) the flow uniform access time machines (like TOP-1) to non-uniform access time machines (like Dash and Alewife); and (2) the flow of multicomputers (like hypercubes) to high-performance multicomputers (like J-Machine). The central issue in this merging is the programming model. Murata (Waseda) said to look towards machines with hundreds of thousands of processors, based on wafer-scale technology. He observed that we should not forget Amdahl's Law when considering the relationship between increasing problem size and the increasing number of processors. He finished by stating that parallelizing compilers aren't working well enough, and that the need for distributed memory and programming paradigms is clear. Goodman (Wisconsin) concluded the panel by observing that shared memory machines are easy to program by hard to build, while non-shared memory machines are easy to build and hard to program. He believes that the trend is to merge the two by building non-shared memory machines with hardware primitives that support cache-coherent shared memory. He believes the challenge is to exploit the shared memory programming paradigms, perhaps extended in some ways, to non-shared memory algorithms and synchronization approaches. Papers: Parallel Symbolic Computation on Shared Memory Multiprocessors. E.M. Clarke, S. Kimura, D.E. Long, S. Michaylov, S.A. Schwab, and J.P. Vidal (CMU). Experimental Evaluation of Algorithmic Performance on Two Shared Memory Multiprocessors. A. Sivasubramaniam, G. Shah, J. Lee, U. Ramachandran, and H. Venkateswaran (Georgia Tech). An Empirical Investigation of the Effectiveness and Limitations of Automatic Parallelization. J.P. Singh and J.L. Hennessy (Stanford). Fine-Grain Loop Scheduling for MIMD Machines. C.J. Brownhill, K-C. Kim, and A. Nicolau (Irvine). A Cache Coherence Mechanism for Scalable, Shared-Memory Multiprocessors. S.L. Scott (Wisconsin). Fault-Tolerant Design for Multistage Routing Networks. A. DeHon, T. Knight, and H. Minsky (MIT). Dynamic Pointer Allocation for Scalable Cache Coherence Directories. R. Simoni and M. Horowitz (Stanford). Cenju: A Multiprocessor System with a Distributed Shared Memory Scheme for Modular Circuit Simulation. T. Nakata, N. Tanabe, N. Kajihara, S. Matsushita, H. Onozuka, Y. Asano, and N. Koike (NEC). Cenju is an experimental multiprocessor system with a distributed shared memory scheme developed mainly for circuit simulation. The system is composed of 64 PEs which are divided into 8 clusters. In each cluster, 8 PEs are connected by a cluster bus. The cluster buses are in turn connected by a multistage network to form the whole system. Each PE consists of MC68020 (20MHZ), MC68882 (20MHZ), 4MB of RAM and a floating point processor WTL1167 (20MHZ). In this system, a distributed shared memory scheme in which each PE contains a part of the whole global memory is adopted. The simulation algorithm used is hierarchical modular simulation in which the circuit to be simulated is divided into subcircuits connected by an interconnection network. For the 64 processors system, a speed up of 14.7 and 15.8 was attained for two DRAM circuits. Furthermore, by parallelizing the serial bottleneck, a speedup of 25.8 could be realized. In this article, we describe the simulation algorithm and the architecture of the system. We also give some preliminary evaluation of the memory scheme. [Kahaner remarks that Cenju was reported on (spice, 2 July 1990) as a machine for transient analysis of circuits, for which it was originally built. Recently though, NEC researchers have been studying other applications and reported on an MHD computation at the annual parallel processing meeting in May 1991.] Latency Tolerance through Multithreading in Large-Scale Multiprocessors. K. Kurihara (IBM Japan), D. Chaiken, and A. Agarawal (MIT). Overview and Status of the Stanford DASH Multiprocessor. D. Lenoski, J. Laudon, K. Charachorloo, W-D. Weber, A. Gupta, and J.L Hennessy (Stanford). Restructuring a Parallel Simulation to Improve Cache Behavior in a Shared-Memory Multiprocessor: A First Experience. D.R. Cheriton, H.A. Goosen, and P. Machanick (Stanford). A Replay Mechanism for Mostly Functional Parallel Programs. R.H. Halstead, Jr. (DEC CRL) and D.A. Kranz (MIT). MUSTARD: A Multiprocessor UNIX for Embedded Real-Time Systems. S. Hiroya, T. Momoi, and K. Nihei (NEC). MUSTARD is a portable multiprocessor UNIX for microprocessor embedded real-time systems. This UNIX is a two-layered OS consisting of a real-time kernel and UNIX kernel. It is operated on a tightly coupled multiprocessor without a dedicated kernel processor. In addition, to simplify the structure of the fault- tolerant system, MUSTARD supports the addition/separation of a processor during system operation. The paper presents the features, implementation, some performance measurements, hardware construction to evaluate MUSTARD, and user programming tools for MUSTARD. Abstracting Data-Representation and Partitioning-Scheduling in Parallel Programs. G.A. Alverson and D. Notkin (Washington). An Analysis of Synchronization Mechanisms in Shared Memory Multiprocessors. P.J. Woest and J.R. Goodman (Wisconsin). Throughput and Fairness Analysis of Prioritized Multiprocessor Bus Arbitration Protocols. M. Ishigaki (Nomura Research), H. Takagi (IBM TRL), Y. Takahashi, and T. Hasegawa (Kyoto). Performance characteristics of bus arbitration protocols for multiprocessor computer systems are studied by queuing theoretic approach as an alternative to the previous method based on generalized Petri nets. Bus utilization of each processor is calculated numerically for a fixed priority, a cyclic priority, a batching priority, and a modified Futurebus protocol. Plotting utilizations against the rate of service requests reveals the fairness characteristics of these protocols. For instance, in the modified Futurebus protocol with statistically identical processors, the bus utilization is evenly divided to all processors at both light and heavy load conditions, while it is allotted unevenly in accordance with their priority order at medium load conditions. Fine Grain Software Pipelining of Non-Vectorizable Nested Loops. K-C. Kim and A. Nicolau (Irvine). A Dynamic Instruction Delivery Pipeline for Superscalar Architecture. T. Chiueh (Berkeley). Experience with the Firefly Multiprocessor Workstation. S.S. Owicki (DEC SRC). Design and Evaluation of Snoop-Cache-Based Multiprocessor, TOP-1. S. Shimizu, N. Oba, A. Moriwaki, and T. Nakada (IBM TRL). TOP-1 is a snoop-cache-based multiprocessor workstation that was developed to evaluate multiprocessor architecture design choices as well as to conduct research on operating systems, compilers, and applications for multiprocessor workstations. It is a ten- way multiprocessor using the Intel 80386 and Weitek 1167, and is currently running with a multiprocessor version of AIX, which was also developed at IBM's Tokyo Research Laboratory. The research interest was focused on the design of an effective snoop cache system and quantitative evaluation of its performance. One of the unique aspects of TOP-1's design is that the cache supports four different original snoop protocols, which may coexist in the system. To evaluate the actual performance, we implemented a hardware statistics monitor, which gathers statistical data on the hardware. This paper focuses mainly on a description of the TOP-1 memory system design with regard to the cache protocols, and its evaluation by means of the hardware statistics monitor mentioned above. Besides its cache design, the TOP-1 memory system has three other unique architectural features: a high- speed bus-locking mechanism, two-way interleaved 64-bit buses supported by two snoop cache controllers per processor, and an effective arbitration mechanism to allow a prioritized quasi- round-robin service with distributed control. These features are also described in detail. [Kahaner comments that several researchers at IBM's TRL told him that there were no plans for commercialization, and the project is very much for experimentation and education.] The Kyushu University Reconfigurable Parallel Processor--Cache Architecture and Cache Coherence Schemes. S. Mori, K. Murakami, E. Iwata, A. Fukuda, and S. Tomita (Kyushu). The Kyushu University Reconfigurable Parallel Processor system is an MIMD-type multiprocessor which consists of 128 processing elements (PEs), interconnected by a full (128x128) crossbar network. The system employs reconfigurable memory architecture, a kind of local/remote memory architectures, and encompasses a shared-memory TCMP (tightly coupled multiprocessor) and a message-passing LCMP (loosely coupled multiprocessor). When the system is configured as a shared-memory TCMP, memory contentions will be obstacles to the performance. To relieve the effects, the system provides each PE with a private unified cache. Each PE may have the cached copy of shared data in its cache whether it accesses to local or remote memory, and therefore the multicache consistency, or inter-cache coherence, problem arises. The cache is a virtual-address direct-mapped cache to meet the requirements for the hit time and size. The virtual-address cache implementation causes the other consistency problem, the synonym problem, which we call intra-cache coherence problem. This paper presents the cache coherence schemes that we have chosen for resolving these cache coherence problems: i) cacheability marking scheme, ii) fast selective invalidation scheme, iii) distributed limited-directory scheme, and iv) dual- directory cache scheme. Cache coherence protocols and their trade-offs among several aspects are also discussed. [Kahaner comments that the Kyushu work was described in the report (parallel.904, 6 Nov 1990). There, he commented that given the resources available at Kyushu, a project like this might be best thought of as mechanism for experience building and training. Another paper on this was also presented at the annual parallel processing meeting in May 1991. One of the project's principal investigators, S. Tomita, has recently moved to Kyoto University.] Formal Verification of the Gigamax Cache Consistency Protocol. K.L. McMillian (CMU) and J. Schwalbe (Encore). -------------------------END OF REPORT------------------------------