rick@cs.arizona.edu (Rick Schlichting) (06/17/91)
[Dr. David Kahaner is a numerical analyst visiting Japan for two-years under the auspices of the Office of Naval Research-Asia (ONR/Asia). The following is the professional opinion of David Kahaner and in no way has the blessing of the US Government or any agency of it. All information is dated and of limited life time. This disclaimer should be noted on ANY attribution.] [Copies of previous reports written by Kahaner can be obtained from host cs.arizona.edu using anonymous FTP.] To: Distribution From: David K. Kahaner, ONR Asia [kahaner@xroads.cc.u-tokyo.ac.jp] Re: Joint Symposium on Parallel Processing '91, Kobe Japan, 14-16 May 1991 Addendum 17 June 1991 Shortly after I distributed my report on JSPP.91 I received extended English abstracts of six papers from Kyushu University. The titles of these papers are given below [I have also included the abstracts--rick.] LIST OF TITLES: (1) KRPP: Kyushu University Reconfigurable Parallel Processor --- Design of Processing-Element --- (2) IPC: Integrated Parallelizing Compiler --- Network Synthesizer --- (3) A Parallel Rendering Machine for High Speed Ray-Tracing --- Instruction-Level Parallelism in Macropipeline Stages --- (4) A Single-Chip Vector Processor Prototype Based on Streaming/FIFO Architecture --- Evaluation of Macro Operation, Vector-Scalar Cooperation, and Terminating Vector Operations --- (5) DSNS (Dynamically hazard resolved, Statically code scheduled, Nonuniform Superscalar) Processor Prototype --- Evaluation of the Architecture and the Effect of Static Code Scheduling -- (6) Hyperscalar Processor Architecture --- The Fifth Approach to Instruction-Level Parallel Processing --- A Single-Chip Vector-Processor Prototype Based on Streaming/FIFO Architecture - Evaluation of Macro Operation, Vector-Scalar Cooperation and Terminating Vector Operations Takashi Hashimoto, Keizou Okazaki, Tetsuo Hironaka, Kazuaki Murakami (Interdisciplinary Graduate School of Engineering Sciences, Kyushu University) Shinji Tomita (Kyoto University) E-mail: {hashimot,keizo,hironaka,murakami}@is.kyushu-u.ac.jp A single-chip vector processor prototype based on the streaming/FIFO architecture has been developed at Kyushu University. The goal of the streaming/FIFO Architecture is to achieve flexible chaining ability and to support vector-scalar cooperative execution by means of FIFO vector registers. In addition, a virtual-pipeline mechanism, which is a derivative of pipeline-shared MIMD, is adopted to achieve high pipeline utilization. To increase performance of vector processor, the vectorization ratio must be improved. For this, macro operations and DO-loops with conditionally executed statements, which are usually difficult to vectorize, must be vectorized. Vector operations, such as summation of vector-elements, inner product, finding maximum/minimum values, and first-order recurrence, have data dependency, which is essentially unsuitable for vectorization. In conventional vector processors, these operations are treated as macro operations with special extra hardware. On the other hand, with the streaming/FIFO architecture, such operations can be vectorized without special extra hardware, because of the flexible chaining ability of the architecture. DO-loops with conditionally executed statements can be classified to: (i) completely vectorizable loop, (ii) partially vectorizable loop, and (iii) unvectorizable loops such as loops with jump outs. The prototype processor handles these loops by the following methods. (1) Completely vectorizable loops: (i) vector-mask controls, (ii) scatter-gather, (iii) index-vector, and (iv) split-merge, which is a specialized form of scatter-gather method. (2) Partially vectorizable loops: Vector-scalar cooperative execution are applied. It resolves the data dependency between a vectorized loop and an unvectorized loop and allows both loops to overlap. (3) Unvectorizable loops: Terminating vector operations when a condition for branching from the loop is detected. This paper describes these mechanisms and reports simulation results. A Parallel Rendering Machine for High Speed Ray-Tracing - Instruction- Level Parallelism in the Macropipeline Stages Seiji Murata, Oubong Gwun, Kazuaki Murakami (Interdisciplinary Graduate School of Engineering Sciences, Kyushu University) Shinji Tomita (Kyoto University) E-mail: {murata,gwun,murakami}@is.kyushu-u.ac.jp A parallel rendering machine for high-speed ray tracing has been built at Kyushu University. The machine exploits multiple-level parallelism as follows. (1) Multiprocessing image space by dividing it among processors. (2) Macropipelining every ray within each processor. (3) Instruction-level parallelism with VLIW architectures. Ray-tracing task is usually composed of two subtasks: intersection and shading. The overall processing speed is usually bounded by the processing speed for the intersection subtask. Thus it is important to speed up the intersection subtask. Intersection subtask can be divided further into two subtasks: object search and intersection calculation with 3D grids subdivision. Finally, the task for processing each ray can be divided into three subtasks: (i) object search, (ii) intersection calculation, and (iii) shading calculation. The ray processing is implemented by three-stage macropipeline, each stage corresponding to a subtask: (i) object-search stage, (ii) intersection-calculation stage, and (iii) shade-calculation stage. The stages are connected by FIFO buffer cyclically, and those are executed concurrently. To achieve high macropipeline utilization, it is necessary to balance the load of every stage. The result of load measurement showed that the object-search stage and the intersection-calculation stage have four times larger load than the shade-calculation stage does. To balance the object-search and intersection-calculation stages with the shade-calculation stage, we have provided instruction-level parallel processing for the former stages, as follows. (1) Object-search stage: It is main process to search a voxel which contains primitives using 3DDDA algorithm. This algorithm is made up of renewal of index coefficient to proceed a neighbor voxel and memory access, and does it over and over until primitives are found. Software pipelining can be adopted into such a loop. There are four functional units: two ALUs and two FPUs. (2) Intersection-calculation stage: The main calculation is a three-dimensional floating point arithmetic. This stage has multiple operations of similar form which can be processed simultaneously, so it is suitable for VLIW architecture. There are four functional units: one ALU and three FPUs. As a result, the performance with instruction-level parallel processing and macropipelining is 5-9 times higher over the performance with sequential processing. DSNS Processor Prototype - Evaluation of the Architecture and the Effect of Static Code Schedule Akira Noudomi, Morihiro Kuga, Kazuaki Murakami (Interdisciplinary Graduate School of Engineering Sciences, Kyushu University) Tetsuya Hara (Mitsubishi Electric Co.) Shinji Tomita (Kyoto University) E-mail: {noudomi,kuga,murakami}@is.kyushu-u.ac.jp DSNS is a superscalar processor architecture with the following features. (1) The hazards such as data dependencies and conflicts of the functional units are resolved dynamically. (2) Superscalar processor needs code schedulings to get much parallelizations. In DSNS architecture, code schedulings are not supported dynamically, but statically with the compiler. (3) DSNS has nonuniform functional units. A prototype processor based on the DSNS architecture and an optimizing compiler, called DSNS compiler, have been being built at Kyushu University. DSNS compiler has capability of static code scheduling. DSNS compiler supports both global and local code schedulings to make the most of DSNS processor's capacity for parallel executions. Local code scheduling is based on the list scheduling taking account of functional units; function types, numbers and execution latencies. For global code scheduling, DSNS compiler employs percolation scheduling for global code motion, and loop restructurings such as loop unrolling and software pipelining for loop structured programs. This paper describes the outline of DSNS architecture and static code scheduling supported in DSNS compiler, and reports some simulation results on the performance of the architecture and the scheduling. As a result of this simulation, the paper confirms the two facts about the relation between degree of superscalar (i.e., degree of instruction supplying) and the effect of static code schedulings. One of the facts is that high degree of superscalar is useless without advanced static code schedulings. And the other is that the effect of static code schedulings is clear with high degree of superscalar. From these facts the paper confirms that the degree of superscalar, which is four in the prototype processor, is reasonable. The effectiveness of local code scheduling and the techniques for loop restructurings such as loop unrolling and software pipelining are also confirmed. Hyperscalar Processor Architecture - The Fifth Approach to Instruction-Level Parallel Processing Kazuaki Murakami (Interdisciplinary Graduate School of Engineering Sciences, Kyushu University) E-mail: murakami@is.kyushu-u.ac.jp A new processor architecture, called hyperscalar processor architecture, is proposed and discussed. Hyperscalar processor architecture encompasses superscalar, VLIW, and vector architectures. Hyperscalar processor architecture has the following major features: (1) the instruction size and the instruction-fetch bandwidth are the same as those of superscalar processors, (2) a VLIW instruction can be self-created with loading several short instructions into instruction registers, each associated with a functional unit, (3) the self-created VLIW program can be in the form of either vectorized loops or software-pipelined loops. This paper presents the principles of operation and examples of vectorized loops and software-pipelined loops. An Approach to Realizing a Reconfigurable Interconnection Network Using Field Programmable Gate Arrays Toshinori Sueyoshi, Itsujiro Arita (Kyushu Institute of Technology) Kouhei Hano (Kyocera Inc.) E-mail: sueyoshi@ai.kyutech.ac.jp We present a new reconfigurable interconnection network utilizing the reconfigurability facilities of FPGA (Field Programmable Gate Array), a kind of programmable logic LSI. Reconfiguration for the desired connections on our proposed reconfigurable interconnection network is performed by programming the configuration data to each FPGA, so that it can be directly implemented without simulation to both: the static networks such as mesh and hypercube networks, and dynamic networks such as baseline and omega networks. Consequently, the optimum connections for interprocess communications or memory reference patterns in executing application programs over the reconfigurable multiprocessor can be configured adaptively by programming. IPC: Integrated Parallelizing Compiler - Network Synthesizer Hiroki Akaboshi, Kazuaki murakami, Akira Fukuda (Interdisciplinary Graduate School of Engineering Sciences, Kyushu University) Shinji Tomita (Kyoto University) E-mail: {akaboshi,murakami,fukuda}@is.kyushu-u.ac.jp The authors have been developing the `Integrate Parallelizing Compiler (IPC) System.' IPC has the following features: (i) multilingualism, (ii) retargetability (or multitargetability), and (iii) multilevel parallelism. One of target machines of IPC is KRPP (Kyushu University Reconfigurable Parallel Processor). KRPP is an MIMD parallel processor which consists of 128 PEs (Processing Elements) connected by a 128*128 crossbar network with reconfigurability. The crossbar network provides an operation mode, called preset mode, to program inter-PE communication topologies in it. The program for inter-PE communications is a sequence of inter-PE switching patterns. To map inter-task communication topologies into a program for inter-PE communications, network synthesis should be done at compile time. Thus IPC has a network synthesizer. Network Synthesizer comprises four basic steps as follows; (1) Pattern Generation: Generate a set of switching patterns which should be satisfied some conditions by using algorithm which find minimal edge coloring of bipartite graph. (2) Sequence Generation: Generate a sequence of switching patterns by scheduling inter-PE communication on switching patterns. This scheduling algorithm is based on list scheduling. (3) Split Communication Gantt Chart (CGC): Split CGC to optimize individual switching patterns. (4) Back End: Output a network control information stored as a program for inter-PE communications. This paper presents a network synthsis algorithm which generates a set of switching patterns and a sequence of switching patterns for reconfigurable and programmable network architectures. KRPP: Kyushu University Reconfigurable Parallel Processor Naoya Tokunaga, Shinichiro Mori, Kazuaki Murakami, Akira Fukuda (Interdisciplinary Graduate School of Engineering Sciences, Kyushu University) Tomoo Ueno (Kyushu Nippon Electric Co.) Eiji Iwata (Sony Co.) Koji Kai (Matsushita Electric Ind. Co.) Shinji Tomita (Kyoto University) E-mail: {tokunaga,mori,murakami,fukuda}@is.kyushu-u.ac.jp The Kyushu University Reconfigurable Parallel Processor (KRPP) has been developed at Kyushu University. KRPP is an MIMD-type multiprocessor which consists of 128 Processing-Elements (PEs) fully connected by a 128*128 crossbar network. Each PE consists of four components: Processor Unit (PU), Memory Unit (MU), Message Communication Unit (MCU), and Back-Plane Bus Interface (BPBIF). To construct a high performance multiprocessor system and to offer an experimental parallel processing environment, KRPP employs the following reconfigurable architectures. (1) Reconfigurable network architecture: The reconfigurable network should have an ability to accommodate itself to arbitrary topologies for inter-PE (i.e., processor-memory and/or processor-processor) paths. KRPP adopts the crossbar network as such one. In addition, KRPP equips crossbar network with control memories to program its topologies, as IBM GF11 does. (2) Reconfigurable memory architecture: There are two paradigms of multiprocessor configurations, shared-memory TCMP (Tightly Coupled MultiProcessor) and message-passing LCMP (Loosely Coupled MultiProcessor). KRPP provides both of these paradigms effectively, although it is built in the form of distributed memory organization. In addition to these reconfigurability of network and memory, KRPP provides high-level hardware support for both caching and inter-PE communication, as follows: (3) Cache architecture: To reduce effective access time and network traffic, KRPP provides each PE with a private virtual-address cache. And KRPP allows a cache to hold copies of data from remote-memory, as well as those from local-memory. To ensure cache coherence, KRPP provides the following four cache coherence schemes: (i) cacheability marking scheme, (ii) fast selective invalidation scheme, (iii) distributed limited-directory scheme, and (iv) dual- directory cache scheme. (4) Communication architecture: To realize both TCMP and LCMP, it is necessary to offer two different communication methods, shared- memory access and message-passing. In order to implement these on the single crossbar network, a hierarchical communication protocol is established. With this protocol, the MCU offers these communication methods. This paper is a status report of real machine implementation. At first, it presents the overview of KRPP. Then, it describes the design philosophy of KRPP. After that, it introduces more detailed specifications of the hardware components of PE.