rick@cs.arizona.edu (Rick Schlichting) (05/27/91)
[Dr. David Kahaner is a numerical analyst visiting Japan for two-years
under the auspices of the Office of Naval Research-Asia (ONR/Asia).
The following is the professional opinion of David Kahaner and in no
way has the blessing of the US Government or any agency of it. All
information is dated and of limited life time. This disclaimer should
be noted on ANY attribution.]
[Copies of previous reports written by Kahaner can be obtained from
host cs.arizona.edu using anonymous FTP.]
To: Distribution
From: David K. Kahaner, ONR Asia [kahaner@xroads.cc.u-tokyo.ac.jp]
Re: Int Symp on Shared Memory Multiproc (ISSMM), Tokyo, 2-4 April 1991
24 May 1991
ABSTRACT. The International Symposium on Shared Memory Multiprocessing
(ISSMM), Tokyo, 2-4 April 1991 is summarized.
D. Notkin has been on a year's research sabbatical in Japan.
Professor David Notkin
Dept. of Computer Science & Engineering, FR-35
University of Washington
Seattle, WA 98195 USA
Tel: 1-206-685-3798, Fax: 1-206-543-2969
Email: notkin@cs.washington.edu
He attended the International Symposium on Shared Memory Multiprocessing
(ISSMM), which was held in Tokyo on 2-4 April 1991. His summary of that
meeting follows with some additional comments by Kahaner.
ISSMM was sponsored by the Information Processing Society of Japan
(IPSJ), with Professor Tadashi MASUDA from the University of Tokyo
serving as general chairman and Dr. Norihisa SUZUKI serving as program
chairman.
Dr. Norihisa SUZUKI
Director, IBM Tokyo Research Laboratory
5-19 Sanban-cho, Chiyoda-ku
Tokyo 102 JAPAN
Tel: 81-3-3288-8300
Email: nsuzuki@trlvm1.vnet.ibm.com
Both Japan and US industry and academia were represented on the program
committee. The proceedings were published by the Information Processing
Society of Japan (IPSJ), and a version of them is expected to be
published by MIT Press, with Dr. Suzuki serving as the editor. The
current plan is to have the second instance of the conference in northern
California in late 1992.
The conference included presentations of 23 refereed papers. Of the
papers, five were from Japan and the remaining 18 were from abroad. (The
titles, authors, and affiliations of all the papers are given below. In
addition, the abstracts of the five Japanese papers are given along with
a few related comments.) The attendance was approximately 130, consisting
of about 100 Japanese and 30 foreigners.
There was also a panel on the feasibility of large-scale shared memory
multiprocessors. Thacker (DEC SRC) started off by making four
observations:
(1) SIMD machines work well for large, regular problems;
(2) MIMD machines work well for less regular problems;
(3) Shared memory machines work well when they don't share and don't
synchronize;
(4) Programming large shared MIMD machines is much harder than
programming SIMD machines.
Koike (NEC) commented that the history of processor design indicates that
shared memory machines consisting of thousands of processors is not
feasible in the near-term. In addition, he recommended that we focus
carefully on specific application domains and on the kind of parallelism
they require to get the most from existing (and near-future) parallel
machines. (Note that this is consistent with the discussion with the
Cenju system, described below, in which a special-purpose machine was
built that after-the-fact was viewed to be more general-purpose than
intended.)
Mr. Nobuhiko Koike
Manager, Computer System Research Laboratory
C&C Systems Research Laboratories
NEC Corp.
1-1, 4-chome, Miyazaki, Miyamae-ku
Kawasaki City, Kanagawa 213 Japan
Tel: (044) 856-2127, Fax: (044) 856-2231
Email: koike@csl.cl.nec.co.jp
Baskett (Silicon Graphics) argued that for many problems, computational
complexity is greater than read/write complexity, and so large problem
sizes need relatively less bandwidth than small problem sizes. So, he
concluded that building large-scale machines is feasible if we intend to
run large problems sizes on them.
Gifford (MIT) said that such machines are feasible, but that they are
called multicomputers (not necessarily shared memory). For programming
ease, he stated that it was critical to name entities that are both local
(in shared name spaces) and remote (in nonshared name spaces) uniformly;
the performance consequences of this are acceptable, he suggested, if you
don't in fact share very much in practice. The reason is that, in order
to get get machines on the order of 100 or more processors, we will have
to manage non-uniform access times. Overall, Gifford saw the merging of
two research streams: (1) the flow uniform access time machines (like
TOP-1) to non-uniform access time machines (like Dash and Alewife); and
(2) the flow of multicomputers (like hypercubes) to high-performance
multicomputers (like J-Machine). The central issue in this merging is
the programming model.
Murata (Waseda) said to look towards machines with hundreds of thousands
of processors, based on wafer-scale technology. He observed that we
should not forget Amdahl's Law when considering the relationship between
increasing problem size and the increasing number of processors. He
finished by stating that parallelizing compilers aren't working well
enough, and that the need for distributed memory and programming
paradigms is clear.
Goodman (Wisconsin) concluded the panel by observing that shared memory
machines are easy to program by hard to build, while non-shared memory
machines are easy to build and hard to program. He believes that the
trend is to merge the two by building non-shared memory machines with
hardware primitives that support cache-coherent shared memory. He
believes the challenge is to exploit the shared memory programming
paradigms, perhaps extended in some ways, to non-shared memory algorithms
and synchronization approaches.
Papers:
Parallel Symbolic Computation on Shared Memory Multiprocessors. E.M.
Clarke, S. Kimura, D.E. Long, S. Michaylov, S.A. Schwab, and J.P.
Vidal (CMU).
Experimental Evaluation of Algorithmic Performance on Two Shared
Memory Multiprocessors. A. Sivasubramaniam, G. Shah, J. Lee, U.
Ramachandran, and H. Venkateswaran (Georgia Tech).
An Empirical Investigation of the Effectiveness and Limitations of
Automatic Parallelization. J.P. Singh and J.L. Hennessy (Stanford).
Fine-Grain Loop Scheduling for MIMD Machines. C.J. Brownhill, K-C.
Kim, and A. Nicolau (Irvine).
A Cache Coherence Mechanism for Scalable, Shared-Memory
Multiprocessors. S.L. Scott (Wisconsin).
Fault-Tolerant Design for Multistage Routing Networks. A. DeHon, T.
Knight, and H. Minsky (MIT).
Dynamic Pointer Allocation for Scalable Cache Coherence Directories.
R. Simoni and M. Horowitz (Stanford).
Cenju: A Multiprocessor System with a Distributed Shared Memory Scheme
for Modular Circuit Simulation. T. Nakata, N. Tanabe, N. Kajihara, S.
Matsushita, H. Onozuka, Y. Asano, and N. Koike (NEC).
Cenju is an experimental multiprocessor system with a distributed
shared memory scheme developed mainly for circuit simulation.
The system is composed of 64 PEs which are divided into 8
clusters. In each cluster, 8 PEs are connected by a cluster bus.
The cluster buses are in turn connected by a multistage network
to form the whole system. Each PE consists of MC68020 (20MHZ),
MC68882 (20MHZ), 4MB of RAM and a floating point processor
WTL1167 (20MHZ). In this system, a distributed shared memory
scheme in which each PE contains a part of the whole global
memory is adopted. The simulation algorithm used is hierarchical
modular simulation in which the circuit to be simulated is
divided into subcircuits connected by an interconnection network.
For the 64 processors system, a speed up of 14.7 and 15.8 was
attained for two DRAM circuits. Furthermore, by parallelizing
the serial bottleneck, a speedup of 25.8 could be realized. In
this article, we describe the simulation algorithm and the
architecture of the system. We also give some preliminary
evaluation of the memory scheme. [Kahaner remarks that Cenju was
reported on (spice, 2 July 1990) as a machine for transient
analysis of circuits, for which it was originally built. Recently
though, NEC researchers have been studying other applications and
reported on an MHD computation at the annual parallel processing
meeting in May 1991.]
Latency Tolerance through Multithreading in Large-Scale
Multiprocessors. K. Kurihara (IBM Japan), D. Chaiken, and A. Agarawal
(MIT).
Overview and Status of the Stanford DASH Multiprocessor. D. Lenoski,
J. Laudon, K. Charachorloo, W-D. Weber, A. Gupta, and J.L Hennessy
(Stanford).
Restructuring a Parallel Simulation to Improve Cache Behavior in a
Shared-Memory Multiprocessor: A First Experience. D.R. Cheriton, H.A.
Goosen, and P. Machanick (Stanford).
A Replay Mechanism for Mostly Functional Parallel Programs. R.H.
Halstead, Jr. (DEC CRL) and D.A. Kranz (MIT).
MUSTARD: A Multiprocessor UNIX for Embedded Real-Time Systems. S.
Hiroya, T. Momoi, and K. Nihei (NEC).
MUSTARD is a portable multiprocessor UNIX for microprocessor
embedded real-time systems. This UNIX is a two-layered OS
consisting of a real-time kernel and UNIX kernel. It is operated
on a tightly coupled multiprocessor without a dedicated kernel
processor. In addition, to simplify the structure of the fault-
tolerant system, MUSTARD supports the addition/separation of a
processor during system operation. The paper presents the
features, implementation, some performance measurements, hardware
construction to evaluate MUSTARD, and user programming tools for
MUSTARD.
Abstracting Data-Representation and Partitioning-Scheduling in
Parallel Programs. G.A. Alverson and D. Notkin (Washington).
An Analysis of Synchronization Mechanisms in Shared Memory
Multiprocessors. P.J. Woest and J.R. Goodman (Wisconsin).
Throughput and Fairness Analysis of Prioritized Multiprocessor Bus
Arbitration Protocols. M. Ishigaki (Nomura Research), H. Takagi (IBM
TRL), Y. Takahashi, and T. Hasegawa (Kyoto).
Performance characteristics of bus arbitration protocols for
multiprocessor computer systems are studied by queuing theoretic
approach as an alternative to the previous method based on
generalized Petri nets. Bus utilization of each processor is
calculated numerically for a fixed priority, a cyclic priority, a
batching priority, and a modified Futurebus protocol. Plotting
utilizations against the rate of service requests reveals the
fairness characteristics of these protocols. For instance, in
the modified Futurebus protocol with statistically identical
processors, the bus utilization is evenly divided to all
processors at both light and heavy load conditions, while it is
allotted unevenly in accordance with their priority order at
medium load conditions.
Fine Grain Software Pipelining of Non-Vectorizable Nested Loops. K-C.
Kim and A. Nicolau (Irvine).
A Dynamic Instruction Delivery Pipeline for Superscalar Architecture.
T. Chiueh (Berkeley).
Experience with the Firefly Multiprocessor Workstation. S.S. Owicki
(DEC SRC).
Design and Evaluation of Snoop-Cache-Based Multiprocessor, TOP-1. S.
Shimizu, N. Oba, A. Moriwaki, and T. Nakada (IBM TRL).
TOP-1 is a snoop-cache-based multiprocessor workstation that was
developed to evaluate multiprocessor architecture design choices
as well as to conduct research on operating systems, compilers,
and applications for multiprocessor workstations. It is a ten-
way multiprocessor using the Intel 80386 and Weitek 1167, and is
currently running with a multiprocessor version of AIX, which was
also developed at IBM's Tokyo Research Laboratory. The research
interest was focused on the design of an effective snoop cache
system and quantitative evaluation of its performance. One of
the unique aspects of TOP-1's design is that the cache supports
four different original snoop protocols, which may coexist in the
system. To evaluate the actual performance, we implemented a
hardware statistics monitor, which gathers statistical data on
the hardware. This paper focuses mainly on a description of the
TOP-1 memory system design with regard to the cache protocols,
and its evaluation by means of the hardware statistics monitor
mentioned above. Besides its cache design, the TOP-1 memory
system has three other unique architectural features: a high-
speed bus-locking mechanism, two-way interleaved 64-bit buses
supported by two snoop cache controllers per processor, and an
effective arbitration mechanism to allow a prioritized quasi-
round-robin service with distributed control. These features are
also described in detail. [Kahaner comments that several
researchers at IBM's TRL told him that there were no plans for
commercialization, and the project is very much for
experimentation and education.]
The Kyushu University Reconfigurable Parallel Processor--Cache
Architecture and Cache Coherence Schemes. S. Mori, K. Murakami, E.
Iwata, A. Fukuda, and S. Tomita (Kyushu).
The Kyushu University Reconfigurable Parallel Processor system is
an MIMD-type multiprocessor which consists of 128 processing
elements (PEs), interconnected by a full (128x128) crossbar
network. The system employs reconfigurable memory architecture,
a kind of local/remote memory architectures, and encompasses a
shared-memory TCMP (tightly coupled multiprocessor) and a
message-passing LCMP (loosely coupled multiprocessor). When the
system is configured as a shared-memory TCMP, memory contentions
will be obstacles to the performance. To relieve the effects,
the system provides each PE with a private unified cache. Each
PE may have the cached copy of shared data in its cache whether
it accesses to local or remote memory, and therefore the
multicache consistency, or inter-cache coherence, problem arises.
The cache is a virtual-address direct-mapped cache to meet the
requirements for the hit time and size. The virtual-address
cache implementation causes the other consistency problem, the
synonym problem, which we call intra-cache coherence problem.
This paper presents the cache coherence schemes that we have
chosen for resolving these cache coherence problems: i)
cacheability marking scheme, ii) fast selective invalidation
scheme, iii) distributed limited-directory scheme, and iv) dual-
directory cache scheme. Cache coherence protocols and their
trade-offs among several aspects are also discussed. [Kahaner
comments that the Kyushu work was described in the report
(parallel.904, 6 Nov 1990). There, he commented that given the
resources available at Kyushu, a project like this might be best
thought of as mechanism for experience building and training.
Another paper on this was also presented at the annual parallel
processing meeting in May 1991. One of the project's principal
investigators, S. Tomita, has recently moved to Kyoto
University.]
Formal Verification of the Gigamax Cache Consistency Protocol. K.L.
McMillian (CMU) and J. Schwalbe (Encore).
-------------------------END OF REPORT------------------------------
--
=========================== MODERATOR ==============================
Steve Stevenson {steve,fpst}@hubcap.clemson.edu
Department of Computer Science, comp.parallel
Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell