[comp.research.japan] Kahaner Report: Int Symp on Shared Memory Multiproc, April 1991

rick@cs.arizona.edu (Rick Schlichting) (05/27/91)
  [Dr. David Kahaner is a numerical analyst visiting Japan for two-years
   under the auspices of the Office of Naval Research-Asia (ONR/Asia).  
   The following is the professional opinion of David Kahaner and in no 
   way has the blessing of the US Government or any agency of it.  All 
   information is dated and of limited life time.  This disclaimer should 
   be noted on ANY attribution.]

  [Copies of previous reports written by Kahaner can be obtained from
   host cs.arizona.edu using anonymous FTP.]

To: Distribution
From: David K. Kahaner, ONR Asia [kahaner@xroads.cc.u-tokyo.ac.jp]
Re: Int Symp on Shared Memory Multiproc (ISSMM), Tokyo, 2-4 April 1991
24 May 1991

ABSTRACT. The International Symposium on Shared Memory Multiprocessing
(ISSMM), Tokyo, 2-4 April 1991 is summarized.

D. Notkin has been on a year's research sabbatical in Japan.
          Professor David Notkin
          Dept. of Computer Science & Engineering, FR-35
          University of Washington
          Seattle, WA  98195  USA
           Tel: 1-206-685-3798, Fax: 1-206-543-2969
           Email: notkin@cs.washington.edu
He attended the International Symposium on Shared Memory Multiprocessing 
(ISSMM), which was held in Tokyo on 2-4 April 1991. His summary of that 
meeting follows with some additional comments by Kahaner.  

ISSMM was sponsored by the Information Processing Society of Japan 
(IPSJ), with Professor Tadashi MASUDA from the University of Tokyo 
serving as general chairman and Dr. Norihisa SUZUKI serving as program 
chairman.  
        Dr. Norihisa SUZUKI
        Director, IBM Tokyo Research Laboratory
        5-19 Sanban-cho, Chiyoda-ku
        Tokyo 102 JAPAN
         Tel: 81-3-3288-8300
         Email: nsuzuki@trlvm1.vnet.ibm.com

Both Japan and US industry and academia were represented on the program 
committee.  The proceedings were published by the Information Processing 
Society of Japan (IPSJ), and a version of them is expected to be 
published by MIT Press, with Dr. Suzuki serving as the editor.  The 
current plan is to have the second instance of the conference in northern 
California in late 1992.  

The conference included presentations of 23 refereed papers.  Of the 
papers, five were from Japan and the remaining 18 were from abroad.  (The 
titles, authors, and affiliations of all the papers are given below.  In 
addition, the abstracts of the five Japanese papers are given along with 
a few related comments.) The attendance was approximately 130, consisting 
of about 100 Japanese and 30 foreigners.  

There was also a panel on the feasibility of large-scale shared memory 
multiprocessors.  Thacker (DEC SRC) started off by making four 
observations: 
  (1) SIMD machines work well for large, regular problems; 
  (2) MIMD machines work well for less regular problems; 
  (3) Shared memory machines work well when they don't share and don't 
       synchronize; 
  (4) Programming large shared MIMD machines is much harder than 
       programming SIMD machines.  

Koike (NEC) commented that the history of processor design indicates that 
shared memory machines consisting of thousands of processors is not 
feasible in the near-term.  In addition, he recommended that we focus 
carefully on specific application domains and on the kind of parallelism 
they require to get the most from existing (and near-future) parallel 
machines.  (Note that this is consistent with the discussion with the 
Cenju system, described below, in which a special-purpose machine was 
built that after-the-fact was viewed to be more general-purpose than 
intended.) 
	    Mr. Nobuhiko Koike
	    Manager, Computer System Research Laboratory
	    C&C Systems Research Laboratories
	    NEC Corp.
	    1-1, 4-chome, Miyazaki, Miyamae-ku
	    Kawasaki City, Kanagawa 213 Japan
	     Tel: (044) 856-2127, Fax: (044) 856-2231
 	     Email: koike@csl.cl.nec.co.jp

Baskett (Silicon Graphics) argued that for many problems, computational 
complexity is greater than read/write complexity, and so large problem 
sizes need relatively less bandwidth than small problem sizes.  So, he 
concluded that building large-scale machines is feasible if we intend to 
run large problems sizes on them.  

Gifford (MIT) said that such machines are feasible, but that they are 
called multicomputers (not necessarily shared memory).  For programming 
ease, he stated that it was critical to name entities that are both local 
(in shared name spaces) and remote (in nonshared name spaces) uniformly; 
the performance consequences of this are acceptable, he suggested, if you 
don't in fact share very much in practice. The reason is that, in order 
to get get machines on the order of 100 or more processors, we will have 
to manage non-uniform access times.  Overall, Gifford saw the merging of 
two research streams: (1) the flow uniform access time machines (like 
TOP-1) to non-uniform access time machines (like Dash and Alewife); and 
(2) the flow of multicomputers (like hypercubes) to high-performance 
multicomputers (like J-Machine).  The central issue in this merging is 
the programming model.  

Murata (Waseda) said to look towards machines with hundreds of thousands 
of processors, based on wafer-scale technology.  He observed that we 
should not forget Amdahl's Law when considering the relationship between 
increasing problem size and the increasing number of processors.  He 
finished by stating that parallelizing compilers aren't working well 
enough, and that the need for distributed memory and programming 
paradigms is clear.  

Goodman (Wisconsin) concluded the panel by observing that shared memory 
machines are easy to program by hard to build, while non-shared memory 
machines are easy to build and hard to program.  He believes that the 
trend is to merge the two by building non-shared memory machines with 
hardware primitives that support cache-coherent shared memory.  He 
believes the challenge is to exploit the shared memory programming 
paradigms, perhaps extended in some ways, to non-shared memory algorithms 
and synchronization approaches.  

Papers:

Parallel Symbolic Computation on Shared Memory Multiprocessors.  E.M. 
Clarke, S. Kimura, D.E. Long, S. Michaylov, S.A. Schwab, and J.P.
Vidal (CMU).

Experimental Evaluation of Algorithmic Performance on Two Shared
Memory Multiprocessors.  A. Sivasubramaniam, G. Shah, J. Lee, U. 
Ramachandran, and H.  Venkateswaran (Georgia Tech).

An Empirical Investigation of the Effectiveness and Limitations of
Automatic Parallelization.  J.P. Singh and J.L. Hennessy (Stanford).

Fine-Grain Loop Scheduling for MIMD Machines.  C.J. Brownhill, K-C.
Kim, and A. Nicolau (Irvine).

A Cache Coherence Mechanism for Scalable, Shared-Memory
Multiprocessors.  S.L. Scott (Wisconsin).

Fault-Tolerant Design for Multistage Routing Networks.  A. DeHon, T.
Knight, and H. Minsky (MIT).

Dynamic Pointer Allocation for Scalable Cache Coherence Directories.
R. Simoni and M. Horowitz (Stanford).

Cenju: A Multiprocessor System with a Distributed Shared Memory Scheme
for Modular Circuit Simulation.  T. Nakata, N. Tanabe, N. Kajihara, S.
Matsushita, H. Onozuka, Y. Asano, and N. Koike (NEC).

        Cenju is an experimental multiprocessor system with a distributed 
        shared memory scheme developed mainly for circuit simulation.  
        The system is composed of 64 PEs which are divided into 8 
        clusters.  In each cluster, 8 PEs are connected by a cluster bus.  
        The cluster buses are in turn connected by a multistage network 
        to form the whole system.  Each PE consists of MC68020 (20MHZ), 
        MC68882 (20MHZ), 4MB of RAM and a floating point processor 
        WTL1167 (20MHZ).  In this system, a distributed shared memory 
        scheme in which each PE contains a part of the whole global 
        memory is adopted.  The simulation algorithm used is hierarchical 
        modular simulation in which the circuit to be simulated is 
        divided into subcircuits connected by an interconnection network.  
        For the 64 processors system, a speed up of 14.7 and 15.8 was 
        attained for two DRAM circuits.  Furthermore, by parallelizing 
        the serial bottleneck, a speedup of 25.8 could be realized.  In 
        this article, we describe the simulation algorithm and the 
        architecture of the system.  We also give some preliminary 
        evaluation of the memory scheme.  [Kahaner remarks that Cenju was 
        reported on (spice, 2 July 1990) as a machine for transient 
        analysis of circuits, for which it was originally built. Recently 
        though, NEC researchers have been studying other applications and 
        reported on an MHD computation at the annual parallel processing 
        meeting in May 1991.] 

Latency Tolerance through Multithreading in Large-Scale
Multiprocessors.  K. Kurihara (IBM Japan), D. Chaiken, and A. Agarawal
(MIT).

Overview and Status of the Stanford DASH Multiprocessor.  D. Lenoski,
J. Laudon, K. Charachorloo, W-D. Weber, A. Gupta, and J.L Hennessy
(Stanford). 

Restructuring a Parallel Simulation to Improve Cache Behavior in a
Shared-Memory Multiprocessor: A First Experience.  D.R. Cheriton, H.A.
Goosen, and P. Machanick (Stanford).

A Replay Mechanism for Mostly Functional Parallel Programs.  R.H.
Halstead, Jr. (DEC CRL) and D.A. Kranz (MIT).

MUSTARD: A Multiprocessor UNIX for Embedded Real-Time Systems.  S.
Hiroya, T. Momoi, and K. Nihei (NEC).

        MUSTARD is a portable multiprocessor UNIX for microprocessor 
        embedded real-time systems.  This UNIX is a two-layered OS 
        consisting of a real-time kernel and UNIX kernel.  It is operated 
        on a tightly coupled multiprocessor without a dedicated kernel 
        processor.  In addition, to simplify the structure of the fault-
        tolerant system, MUSTARD supports the addition/separation of a 
        processor during system operation.  The paper presents the 
        features, implementation, some performance measurements, hardware 
        construction to evaluate MUSTARD, and user programming tools for 
        MUSTARD.  

Abstracting Data-Representation and Partitioning-Scheduling in
Parallel Programs.  G.A. Alverson and D. Notkin (Washington).

An Analysis of Synchronization Mechanisms in Shared Memory
Multiprocessors.  P.J. Woest and J.R. Goodman (Wisconsin).

Throughput and Fairness Analysis of Prioritized Multiprocessor Bus
Arbitration Protocols.  M. Ishigaki (Nomura Research), H. Takagi (IBM
TRL), Y. Takahashi, and T. Hasegawa (Kyoto).

        Performance characteristics of bus arbitration protocols for 
        multiprocessor computer systems are studied by queuing theoretic 
        approach as an alternative to the previous method based on 
        generalized Petri nets.  Bus utilization of each processor is 
        calculated numerically for a fixed priority, a cyclic priority, a 
        batching priority, and a modified Futurebus protocol.  Plotting 
        utilizations against the rate of service requests reveals the 
        fairness characteristics of these protocols.  For instance, in 
        the modified Futurebus protocol with statistically identical 
        processors, the bus utilization is evenly divided to all 
        processors at both light and heavy load conditions, while it is 
        allotted unevenly in accordance with their priority order at 
        medium load conditions.  

Fine Grain Software Pipelining of Non-Vectorizable Nested Loops.  K-C.
Kim and A. Nicolau (Irvine).

A Dynamic Instruction Delivery Pipeline for Superscalar Architecture.
T. Chiueh (Berkeley).

Experience with the Firefly Multiprocessor Workstation.  S.S. Owicki
(DEC SRC).

Design and Evaluation of Snoop-Cache-Based Multiprocessor, TOP-1.  S.
Shimizu, N. Oba, A. Moriwaki, and T. Nakada (IBM TRL).

        TOP-1 is a snoop-cache-based multiprocessor workstation that was 
        developed to evaluate multiprocessor architecture design choices 
        as well as to conduct research on operating systems, compilers, 
        and applications for multiprocessor workstations.  It is a ten-
        way multiprocessor using the Intel 80386 and Weitek 1167, and is 
        currently running with a multiprocessor version of AIX, which was 
        also developed at IBM's Tokyo Research Laboratory.  The research 
        interest was focused on the design of an effective snoop cache 
        system and quantitative evaluation of its performance.  One of 
        the unique aspects of TOP-1's design is that the cache supports 
        four different original snoop protocols, which may coexist in the 
        system.  To evaluate the actual performance, we implemented a 
        hardware statistics monitor, which gathers statistical data on 
        the hardware.  This paper focuses mainly on a description of the 
        TOP-1 memory system design with regard to the cache protocols, 
        and its evaluation by means of the hardware statistics monitor 
        mentioned above.  Besides its cache design, the TOP-1 memory 
        system has three other unique architectural features: a high-
        speed bus-locking mechanism, two-way interleaved 64-bit buses 
        supported by two snoop cache controllers per processor, and an 
        effective arbitration mechanism to allow a prioritized quasi-
        round-robin service with distributed control.  These features are 
        also described in detail. [Kahaner comments that several 
        researchers at IBM's TRL told him that there were no plans for 
        commercialization, and the project is very much for 
        experimentation and education.] 
        

The Kyushu University Reconfigurable Parallel Processor--Cache
Architecture and Cache Coherence Schemes.  S. Mori, K. Murakami, E.
Iwata, A. Fukuda, and S. Tomita (Kyushu).

        The Kyushu University Reconfigurable Parallel Processor system is 
        an MIMD-type multiprocessor which consists of 128 processing 
        elements (PEs), interconnected by a full (128x128) crossbar 
        network.  The system employs reconfigurable memory architecture, 
        a kind of local/remote memory architectures, and encompasses a 
        shared-memory TCMP (tightly coupled multiprocessor) and a 
        message-passing LCMP (loosely coupled multiprocessor).  When the 
        system is configured as a shared-memory TCMP, memory contentions 
        will be obstacles to the performance.  To relieve the effects, 
        the system provides each PE with a private unified cache.  Each 
        PE may have the cached copy of shared data in its cache whether 
        it accesses to local or remote memory, and therefore the 
        multicache consistency, or inter-cache coherence, problem arises.  
        The cache is a virtual-address direct-mapped cache to meet the 
        requirements for the hit time and size.  The virtual-address 
        cache implementation causes the other consistency problem, the 
        synonym problem, which we call intra-cache coherence problem.  
        This paper presents the cache coherence schemes that we have 
        chosen for resolving these cache coherence problems: i) 
        cacheability marking scheme, ii) fast selective invalidation 
        scheme, iii) distributed limited-directory scheme, and iv) dual-
        directory cache scheme.  Cache coherence protocols and their 
        trade-offs among several aspects are also discussed.  [Kahaner 
        comments that the Kyushu work was described in the report 
        (parallel.904, 6 Nov 1990). There, he commented that given the 
        resources available at Kyushu, a project like this might be best 
        thought of as mechanism for experience building and training.  
        Another paper on this was also presented at the annual parallel 
        processing meeting in May 1991. One of the project's principal 
        investigators, S. Tomita, has recently moved to Kyoto 
        University.] 

Formal Verification of the Gigamax Cache Consistency Protocol.  K.L.
McMillian (CMU) and J. Schwalbe (Encore).

-------------------------END OF REPORT------------------------------