rick@cs.arizona.edu (Rick Schlichting) (06/17/91)
(installment one)
[Dr. David Kahaner is a numerical analyst visiting Japan for two-years
under the auspices of the Office of Naval Research-Asia (ONR/Asia).
The following is the professional opinion of David Kahaner and in no
way has the blessing of the US Government or any agency of it. All
information is dated and of limited life time. This disclaimer should
be noted on ANY attribution.]
[Copies of previous reports written by Kahaner can be obtained from
host cs.arizona.edu using anonymous FTP.]
To: Distribution
From: David K. Kahaner, ONR Asia [kahaner@xroads.cc.u-tokyo.ac.jp]
Re: Joint Symposium on Parallel Processing '91, Kobe Japan, 14-16 May 1991
Addendum
17 June 1991
Shortly after I distributed my report on JSPP.91 I received extended
English abstracts of six papers from Kyushu University. The titles of
these papers are given below [I have also included the abstracts--rick.]
LIST OF TITLES:
(1) KRPP: Kyushu University Reconfigurable Parallel Processor
--- Design of Processing-Element ---
(2) IPC: Integrated Parallelizing Compiler
--- Network Synthesizer ---
(3) A Parallel Rendering Machine for High Speed Ray-Tracing
--- Instruction-Level Parallelism in Macropipeline Stages ---
(4) A Single-Chip Vector Processor Prototype Based on
Streaming/FIFO Architecture
--- Evaluation of Macro Operation, Vector-Scalar Cooperation, and
Terminating Vector Operations ---
(5) DSNS (Dynamically hazard resolved, Statically code scheduled,
Nonuniform Superscalar) Processor Prototype
--- Evaluation of the Architecture and
the Effect of Static Code Scheduling --
(6) Hyperscalar Processor Architecture
--- The Fifth Approach to Instruction-Level Parallel Processing ---
A Single-Chip Vector-Processor Prototype Based on Streaming/FIFO
Architecture - Evaluation of Macro Operation,
Vector-Scalar Cooperation and Terminating Vector Operations
Takashi Hashimoto, Keizou Okazaki, Tetsuo Hironaka, Kazuaki Murakami
(Interdisciplinary Graduate School of Engineering Sciences, Kyushu
University)
Shinji Tomita (Kyoto University)
E-mail: {hashimot,keizo,hironaka,murakami}@is.kyushu-u.ac.jp
A single-chip vector processor prototype based on the
streaming/FIFO architecture has been developed at Kyushu University.
The goal of the streaming/FIFO Architecture is to achieve flexible
chaining ability and to support vector-scalar cooperative execution
by means of FIFO vector registers. In addition, a virtual-pipeline
mechanism, which is a derivative of pipeline-shared MIMD, is adopted
to achieve high pipeline utilization.
To increase performance of vector processor, the vectorization
ratio must be improved. For this, macro operations and DO-loops
with conditionally executed statements, which are usually difficult
to vectorize, must be vectorized.
Vector operations, such as summation of vector-elements, inner
product, finding maximum/minimum values, and first-order recurrence,
have data dependency, which is essentially unsuitable for
vectorization. In conventional vector processors, these operations
are treated as macro operations with special extra hardware. On the
other hand, with the streaming/FIFO architecture, such operations
can be vectorized without special extra hardware, because of the
flexible chaining ability of the architecture.
DO-loops with conditionally executed statements can be classified
to: (i) completely vectorizable loop, (ii) partially vectorizable
loop, and (iii) unvectorizable loops such as loops with jump outs.
The prototype processor handles these loops by the following
methods.
(1) Completely vectorizable loops: (i) vector-mask controls, (ii)
scatter-gather, (iii) index-vector, and (iv) split-merge, which is a
specialized form of scatter-gather method.
(2) Partially vectorizable loops: Vector-scalar cooperative
execution are applied. It resolves the data dependency between a
vectorized loop and an unvectorized loop and allows both loops to
overlap.
(3) Unvectorizable loops: Terminating vector operations when a
condition for branching from the loop is detected.
This paper describes these mechanisms and reports simulation results.
A Parallel Rendering Machine for High Speed Ray-Tracing - Instruction-
Level Parallelism in the Macropipeline Stages
Seiji Murata, Oubong Gwun, Kazuaki Murakami (Interdisciplinary Graduate
School of Engineering Sciences, Kyushu University)
Shinji Tomita (Kyoto University)
E-mail: {murata,gwun,murakami}@is.kyushu-u.ac.jp
A parallel rendering machine for high-speed ray tracing has been
built at Kyushu University. The machine exploits multiple-level
parallelism as follows.
(1) Multiprocessing image space by dividing it among processors.
(2) Macropipelining every ray within each processor.
(3) Instruction-level parallelism with VLIW architectures.
Ray-tracing task is usually composed of two subtasks:
intersection and shading. The overall processing speed is usually
bounded by the processing speed for the intersection subtask. Thus
it is important to speed up the intersection subtask.
Intersection subtask can be divided further into two subtasks:
object search and intersection calculation with 3D grids
subdivision. Finally, the task for processing each ray can be
divided into three subtasks: (i) object search, (ii) intersection
calculation, and (iii) shading calculation.
The ray processing is implemented by three-stage macropipeline,
each stage corresponding to a subtask: (i) object-search stage, (ii)
intersection-calculation stage, and (iii) shade-calculation stage.
The stages are connected by FIFO buffer cyclically, and those are
executed concurrently.
To achieve high macropipeline utilization, it is necessary to
balance the load of every stage. The result of load measurement
showed that the object-search stage and the intersection-calculation
stage have four times larger load than the shade-calculation stage
does. To balance the object-search and intersection-calculation
stages with the shade-calculation stage, we have provided
instruction-level parallel processing for the former stages, as
follows.
(1) Object-search stage: It is main process to search a voxel which
contains primitives using 3DDDA algorithm. This algorithm is made
up of renewal of index coefficient to proceed a neighbor voxel and
memory access, and does it over and over until primitives are found.
Software pipelining can be adopted into such a loop. There are four
functional units: two ALUs and two FPUs.
(2) Intersection-calculation stage: The main calculation is a
three-dimensional floating point arithmetic. This stage has
multiple operations of similar form which can be processed
simultaneously, so it is suitable for VLIW architecture. There are
four functional units: one ALU and three FPUs.
As a result, the performance with instruction-level parallel
processing and macropipelining is 5-9 times higher over the
performance with sequential processing. rick@cs.arizona.edu (Rick Schlichting) (06/19/91)
(installment 2)
DSNS Processor Prototype - Evaluation of the Architecture and the Effect
of Static Code Schedule
Akira Noudomi, Morihiro Kuga, Kazuaki Murakami (Interdisciplinary
Graduate School of Engineering Sciences, Kyushu University)
Tetsuya Hara (Mitsubishi Electric Co.)
Shinji Tomita (Kyoto University)
E-mail: {noudomi,kuga,murakami}@is.kyushu-u.ac.jp
DSNS is a superscalar processor architecture with the following
features.
(1) The hazards such as data dependencies and conflicts of the
functional units are resolved dynamically.
(2) Superscalar processor needs code schedulings to get much
parallelizations. In DSNS architecture, code schedulings are not
supported dynamically, but statically with the compiler.
(3) DSNS has nonuniform functional units.
A prototype processor based on the DSNS architecture and an
optimizing compiler, called DSNS compiler, have been being built at
Kyushu University. DSNS compiler has capability of static code
scheduling. DSNS compiler supports both global and local code
schedulings to make the most of DSNS processor's capacity for
parallel executions.
Local code scheduling is based on the list scheduling taking
account of functional units; function types, numbers and execution
latencies. For global code scheduling, DSNS compiler employs
percolation scheduling for global code motion, and loop
restructurings such as loop unrolling and software pipelining for
loop structured programs.
This paper describes the outline of DSNS architecture and static
code scheduling supported in DSNS compiler, and reports some
simulation results on the performance of the architecture and the
scheduling. As a result of this simulation, the paper confirms the
two facts about the relation between degree of superscalar (i.e.,
degree of instruction supplying) and the effect of static code
schedulings. One of the facts is that high degree of superscalar is
useless without advanced static code schedulings. And the other is
that the effect of static code schedulings is clear with high degree
of superscalar. From these facts the paper confirms that the degree
of superscalar, which is four in the prototype processor, is
reasonable. The effectiveness of local code scheduling and the
techniques for loop restructurings such as loop unrolling and
software pipelining are also confirmed.
Hyperscalar Processor Architecture - The Fifth Approach to Instruction-Level
Parallel Processing
Kazuaki Murakami (Interdisciplinary Graduate School of Engineering
Sciences, Kyushu University)
E-mail: murakami@is.kyushu-u.ac.jp
A new processor architecture, called hyperscalar processor
architecture, is proposed and discussed. Hyperscalar processor
architecture encompasses superscalar, VLIW, and vector
architectures.
Hyperscalar processor architecture has the following major
features:
(1) the instruction size and the instruction-fetch bandwidth are
the same as those of superscalar processors,
(2) a VLIW instruction can be self-created with loading several
short instructions into instruction registers, each associated with
a functional unit,
(3) the self-created VLIW program can be in the form of either
vectorized loops or software-pipelined loops.
This paper presents the principles of operation and examples of
vectorized loops and software-pipelined loops.
An Approach to Realizing a Reconfigurable Interconnection Network Using
Field Programmable Gate Arrays
Toshinori Sueyoshi, Itsujiro Arita (Kyushu Institute of Technology)
Kouhei Hano (Kyocera Inc.)
E-mail: sueyoshi@ai.kyutech.ac.jp
We present a new reconfigurable interconnection network utilizing
the reconfigurability facilities of FPGA (Field Programmable Gate
Array), a kind of programmable logic LSI. Reconfiguration for the
desired connections on our proposed reconfigurable interconnection
network is performed by programming the configuration data to each
FPGA, so that it can be directly implemented without simulation to
both: the static networks such as mesh and hypercube networks, and
dynamic networks such as baseline and omega networks. Consequently,
the optimum connections for interprocess communications or memory
reference patterns in executing application programs over the
reconfigurable multiprocessor can be configured adaptively by
programming.
IPC: Integrated Parallelizing Compiler - Network Synthesizer
Hiroki Akaboshi, Kazuaki murakami, Akira Fukuda (Interdisciplinary
Graduate School of Engineering Sciences, Kyushu University)
Shinji Tomita (Kyoto University)
E-mail: {akaboshi,murakami,fukuda}@is.kyushu-u.ac.jp
The authors have been developing the `Integrate Parallelizing
Compiler (IPC) System.' IPC has the following features: (i)
multilingualism, (ii) retargetability (or multitargetability), and
(iii) multilevel parallelism.
One of target machines of IPC is KRPP (Kyushu University
Reconfigurable Parallel Processor). KRPP is an MIMD parallel
processor which consists of 128 PEs (Processing Elements) connected
by a 128*128 crossbar network with reconfigurability. The crossbar
network provides an operation mode, called preset mode, to program
inter-PE communication topologies in it. The program for inter-PE
communications is a sequence of inter-PE switching patterns. To map
inter-task communication topologies into a program for inter-PE
communications, network synthesis should be done at compile time.
Thus IPC has a network synthesizer.
Network Synthesizer comprises four basic steps as follows;
(1) Pattern Generation: Generate a set of switching patterns which
should be satisfied some conditions by using algorithm which find
minimal edge coloring of bipartite graph.
(2) Sequence Generation: Generate a sequence of switching patterns
by scheduling inter-PE communication on switching patterns. This
scheduling algorithm is based on list scheduling.
(3) Split Communication Gantt Chart (CGC): Split CGC to optimize
individual switching patterns.
(4) Back End: Output a network control information stored as a
program for inter-PE communications.
This paper presents a network synthsis algorithm which generates a
set of switching patterns and a sequence of switching patterns for
reconfigurable and programmable network architectures.
KRPP: Kyushu University Reconfigurable Parallel Processor
Naoya Tokunaga, Shinichiro Mori, Kazuaki Murakami, Akira Fukuda
(Interdisciplinary Graduate School of Engineering Sciences, Kyushu
University)
Tomoo Ueno (Kyushu Nippon Electric Co.)
Eiji Iwata (Sony Co.)
Koji Kai (Matsushita Electric Ind. Co.)
Shinji Tomita (Kyoto University)
E-mail: {tokunaga,mori,murakami,fukuda}@is.kyushu-u.ac.jp
The Kyushu University Reconfigurable Parallel Processor (KRPP)
has been developed at Kyushu University. KRPP is an MIMD-type
multiprocessor which consists of 128 Processing-Elements (PEs) fully
connected by a 128*128 crossbar network. Each PE consists of four
components: Processor Unit (PU), Memory Unit (MU), Message
Communication Unit (MCU), and Back-Plane Bus Interface (BPBIF).
To construct a high performance multiprocessor system and to offer
an experimental parallel processing environment, KRPP employs the
following reconfigurable architectures.
(1) Reconfigurable network architecture: The reconfigurable network
should have an ability to accommodate itself to arbitrary topologies
for inter-PE (i.e., processor-memory and/or processor-processor)
paths. KRPP adopts the crossbar network as such one. In addition,
KRPP equips crossbar network with control memories to program its
topologies, as IBM GF11 does.
(2) Reconfigurable memory architecture: There are two paradigms of
multiprocessor configurations, shared-memory TCMP (Tightly Coupled
MultiProcessor) and message-passing LCMP (Loosely Coupled
MultiProcessor). KRPP provides both of these paradigms effectively,
although it is built in the form of distributed memory organization.
In addition to these reconfigurability of network and memory, KRPP
provides high-level hardware support for both caching and inter-PE
communication, as follows:
(3) Cache architecture: To reduce effective access time and network
traffic, KRPP provides each PE with a private virtual-address cache.
And KRPP allows a cache to hold copies of data from remote-memory,
as well as those from local-memory. To ensure cache coherence, KRPP
provides the following four cache coherence schemes: (i)
cacheability marking scheme, (ii) fast selective invalidation
scheme, (iii) distributed limited-directory scheme, and (iv) dual-
directory cache scheme.
(4) Communication architecture: To realize both TCMP and LCMP, it
is necessary to offer two different communication methods, shared-
memory access and message-passing. In order to implement these on
the single crossbar network, a hierarchical communication protocol
is established. With this protocol, the MCU offers these
communication methods.
This paper is a status report of real machine implementation. At
first, it presents the overview of KRPP. Then, it describes the
design philosophy of KRPP. After that, it introduces more detailed
specifications of the hardware components of PE.
--
=========================== MODERATOR ==============================
Steve Stevenson {steve,fpst}@hubcap.clemson.edu
Department of Computer Science, comp.parallel
Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell