[comp.research.japan] Kahaner Report: JSPP '91

rick@cs.arizona.edu (Rick Schlichting) (06/17/91)
  [Dr. David Kahaner is a numerical analyst visiting Japan for two-years
   under the auspices of the Office of Naval Research-Asia (ONR/Asia).  
   The following is the professional opinion of David Kahaner and in no 
   way has the blessing of the US Government or any agency of it.  All 
   information is dated and of limited life time.  This disclaimer should 
   be noted on ANY attribution.]

  [Copies of previous reports written by Kahaner can be obtained from
   host cs.arizona.edu using anonymous FTP.]

To: Distribution
From: David K. Kahaner, ONR Asia [kahaner@xroads.cc.u-tokyo.ac.jp]
Re: Joint Symposium on Parallel Processing '91, Kobe Japan, 14-16 May 1991
       Addendum
17 June 1991 

Shortly after I distributed my report on JSPP.91 I received extended 
English abstracts of six papers from Kyushu University. The titles of
these papers are given below [I have also included the abstracts--rick.]


LIST OF TITLES:
(1) KRPP: Kyushu University Reconfigurable Parallel Processor
    --- Design of Processing-Element ---
(2) IPC: Integrated Parallelizing Compiler
    --- Network Synthesizer ---
(3) A Parallel Rendering Machine for High Speed Ray-Tracing
    --- Instruction-Level Parallelism in Macropipeline Stages ---
(4) A Single-Chip Vector Processor Prototype Based on
    Streaming/FIFO Architecture
    --- Evaluation of Macro Operation, Vector-Scalar Cooperation, and
        Terminating Vector Operations ---
(5) DSNS (Dynamically hazard resolved, Statically code scheduled,
    Nonuniform Superscalar) Processor Prototype
    --- Evaluation of the Architecture and
        the Effect of Static Code Scheduling --
(6) Hyperscalar Processor Architecture
    --- The Fifth Approach to Instruction-Level Parallel Processing ---


A Single-Chip Vector-Processor Prototype Based on Streaming/FIFO
          Architecture - Evaluation of Macro Operation, 
          Vector-Scalar Cooperation and Terminating Vector Operations
     Takashi Hashimoto, Keizou Okazaki, Tetsuo Hironaka, Kazuaki Murakami
     (Interdisciplinary Graduate School of Engineering Sciences, Kyushu
     University)
     Shinji Tomita (Kyoto University)
        E-mail: {hashimot,keizo,hironaka,murakami}@is.kyushu-u.ac.jp
       A single-chip vector processor prototype based on the 
     streaming/FIFO architecture has been developed at Kyushu University.  
       The goal of the streaming/FIFO Architecture is to achieve flexible 
     chaining ability and to support vector-scalar cooperative execution 
     by means of FIFO vector registers.  In addition, a virtual-pipeline 
     mechanism, which is a derivative of pipeline-shared MIMD, is adopted 
     to achieve high pipeline utilization.  
       To increase performance of vector processor, the vectorization 
     ratio must be improved.  For this, macro operations and DO-loops 
     with conditionally executed statements, which are usually difficult 
     to vectorize, must be vectorized.  
       Vector operations, such as summation of vector-elements, inner 
     product, finding maximum/minimum values, and first-order recurrence, 
     have data dependency, which is essentially unsuitable for 
     vectorization.  In conventional vector processors, these operations 
     are treated as macro operations with special extra hardware.  On the 
     other hand, with the streaming/FIFO architecture, such operations 
     can be vectorized without special extra hardware, because of the 
     flexible chaining ability of the architecture.  
       DO-loops with conditionally executed statements can be classified 
     to: (i) completely vectorizable loop, (ii) partially vectorizable 
     loop, and (iii) unvectorizable loops such as loops with jump outs.  
     The prototype processor handles these loops by the following 
     methods.  
      (1) Completely vectorizable loops:  (i) vector-mask controls, (ii) 
     scatter-gather, (iii) index-vector, and (iv) split-merge, which is a 
     specialized form of scatter-gather method.  
      (2) Partially vectorizable loops: Vector-scalar cooperative 
     execution are applied. It resolves the data dependency between a 
     vectorized loop and an unvectorized loop and allows both loops to 
     overlap.  
      (3) Unvectorizable loops:  Terminating vector operations when a 
     condition for branching from the loop is detected.  
      This paper describes these mechanisms and reports simulation results.  

A Parallel Rendering Machine for High Speed Ray-Tracing - Instruction-
            Level Parallelism in the Macropipeline Stages 
     Seiji Murata, Oubong Gwun, Kazuaki Murakami (Interdisciplinary Graduate
     School of Engineering Sciences, Kyushu University)
     Shinji Tomita (Kyoto University)
          E-mail: {murata,gwun,murakami}@is.kyushu-u.ac.jp
        A parallel rendering machine for high-speed ray tracing has been 
     built at Kyushu University.  The machine exploits multiple-level 
     parallelism as follows.  
      (1) Multiprocessing image space by dividing it among processors.
      (2) Macropipelining every ray within each processor.
      (3) Instruction-level parallelism with VLIW architectures.
        Ray-tracing task is usually composed of two subtasks: 
     intersection and shading.  The overall processing speed is usually 
     bounded by the processing speed for the intersection subtask.  Thus 
     it is important to speed up the intersection subtask.  
        Intersection subtask can be divided further into two subtasks: 
     object search and intersection calculation with 3D grids 
     subdivision.  Finally, the task for processing each ray can be 
     divided into three subtasks: (i) object search, (ii) intersection 
     calculation, and (iii) shading calculation.  
        The ray processing is implemented by three-stage macropipeline, 
     each stage corresponding to a subtask: (i) object-search stage, (ii) 
     intersection-calculation stage, and (iii) shade-calculation stage.  
     The stages are connected by FIFO buffer cyclically, and those are 
     executed concurrently.  
        To achieve high macropipeline utilization, it is necessary to 
     balance the load of every stage.  The result of load measurement 
     showed that the object-search stage and the intersection-calculation 
     stage have four times larger load than the shade-calculation stage 
     does.  To balance the object-search and intersection-calculation 
     stages with the shade-calculation stage, we have provided 
     instruction-level parallel processing for the former stages, as 
     follows.  
      (1) Object-search stage: It is main process to search a voxel which 
     contains primitives using 3DDDA algorithm.  This algorithm is made 
     up of renewal of index coefficient to proceed a neighbor voxel and 
     memory access, and does it over and over until primitives are found.  
     Software pipelining can be adopted into such a loop.  There are four 
     functional units: two ALUs and two FPUs.  
      (2) Intersection-calculation stage: The main calculation is a 
     three-dimensional floating point arithmetic.  This stage has 
     multiple operations of similar form which can be processed 
     simultaneously, so it is suitable for VLIW architecture.  There are 
     four functional units: one ALU and three FPUs.  
       As a result, the performance with instruction-level parallel 
     processing and macropipelining is 5-9 times higher over the 
     performance with sequential processing.  

DSNS Processor Prototype - Evaluation of the Architecture and the Effect 
                 of Static Code Schedule 
     Akira Noudomi, Morihiro Kuga, Kazuaki Murakami (Interdisciplinary
     Graduate School of Engineering Sciences, Kyushu University)
     Tetsuya Hara (Mitsubishi Electric Co.)
     Shinji Tomita (Kyoto University)
          E-mail:  {noudomi,kuga,murakami}@is.kyushu-u.ac.jp
       DSNS is a superscalar processor architecture with the following
     features.
      (1) The hazards such as data dependencies and conflicts of the 
     functional units are resolved dynamically.  
      (2) Superscalar processor needs code schedulings to get much 
     parallelizations.  In DSNS architecture, code schedulings are not 
     supported dynamically, but statically with the compiler.  
      (3) DSNS has nonuniform functional units.  
        A prototype processor based on the DSNS architecture and an 
     optimizing compiler, called DSNS compiler, have been being built at 
     Kyushu University.  DSNS compiler has capability of static code 
     scheduling.  DSNS compiler supports both global and local code 
     schedulings to make the most of DSNS processor's capacity for 
     parallel executions.  
        Local code scheduling is based on the list scheduling taking 
     account of functional units; function types, numbers and execution 
     latencies. For global code scheduling, DSNS compiler employs 
     percolation scheduling for global code motion, and loop 
     restructurings such as loop unrolling and software pipelining for 
     loop structured programs.  
       This paper describes the outline of DSNS architecture and static 
     code scheduling supported in DSNS compiler, and reports some 
     simulation results on the performance of the architecture and the 
     scheduling.  As a result of this simulation, the paper confirms the 
     two facts about the relation between degree of superscalar (i.e., 
     degree of instruction supplying) and the effect of static code 
     schedulings.  One of the facts is that high degree of superscalar is 
     useless without advanced static code schedulings.  And the other is 
     that the effect of static code schedulings is clear with high degree 
     of superscalar.  From these facts the paper confirms that the degree 
     of superscalar, which is four in the prototype processor, is 
     reasonable.  The effectiveness of local code scheduling and the 
     techniques for loop restructurings such as loop unrolling and 
     software pipelining are also confirmed.  

Hyperscalar Processor Architecture - The Fifth Approach to Instruction-Level
                 Parallel Processing
     Kazuaki Murakami (Interdisciplinary Graduate School of Engineering
     Sciences, Kyushu University)
          E-mail: murakami@is.kyushu-u.ac.jp
        A new processor architecture, called hyperscalar processor 
     architecture, is proposed and discussed.  Hyperscalar processor 
     architecture encompasses superscalar, VLIW, and vector 
     architectures.  
        Hyperscalar processor architecture has the following major 
     features: 
      (1) the instruction size and the instruction-fetch bandwidth are 
     the same as those of superscalar processors, 
      (2) a VLIW instruction can be self-created with loading several 
     short instructions into instruction registers, each associated with 
     a functional unit, 
      (3) the self-created VLIW program can be in the form of either 
     vectorized loops or software-pipelined loops.  
        This paper presents the principles of operation and examples of 
     vectorized loops and software-pipelined loops.  

An Approach to Realizing a Reconfigurable Interconnection Network Using 
                 Field Programmable Gate Arrays 
     Toshinori Sueyoshi, Itsujiro Arita (Kyushu Institute of Technology) 
     Kouhei Hano (Kyocera Inc.) 
      E-mail: sueyoshi@ai.kyutech.ac.jp
        We present a new reconfigurable interconnection network utilizing 
     the reconfigurability facilities of FPGA (Field Programmable Gate 
     Array), a kind of programmable logic LSI. Reconfiguration for the 
     desired connections  on our proposed reconfigurable interconnection 
     network is performed by programming the configuration data to each 
     FPGA, so that it can be directly implemented without simulation to 
     both: the static networks such as mesh and hypercube networks, and 
     dynamic networks such as baseline and omega networks. Consequently, 
     the optimum connections for interprocess communications or memory 
     reference patterns in executing application programs over the 
     reconfigurable multiprocessor can be configured adaptively by 
     programming.  

IPC: Integrated Parallelizing Compiler - Network Synthesizer
     Hiroki Akaboshi, Kazuaki murakami, Akira Fukuda (Interdisciplinary
     Graduate School of Engineering Sciences, Kyushu University)
     Shinji Tomita (Kyoto University)
          E-mail: {akaboshi,murakami,fukuda}@is.kyushu-u.ac.jp
        The authors have been developing the `Integrate Parallelizing
     Compiler (IPC) System.' IPC has the following features: (i)
     multilingualism, (ii) retargetability (or multitargetability), and
     (iii) multilevel parallelism.
        One of target machines of IPC is KRPP (Kyushu University 
     Reconfigurable Parallel Processor).  KRPP is an MIMD parallel 
     processor which consists of 128 PEs (Processing Elements) connected 
     by a 128*128 crossbar network with reconfigurability.  The crossbar 
     network provides an operation mode, called preset mode, to program 
     inter-PE communication topologies in it.  The program for inter-PE 
     communications is a sequence of inter-PE switching patterns.  To map 
     inter-task communication topologies into a program for inter-PE 
     communications, network synthesis should be done at compile time.  
     Thus IPC has a network synthesizer.  
          Network Synthesizer comprises four basic steps as follows;
      (1) Pattern Generation: Generate a set of switching patterns which 
     should be satisfied some conditions by using algorithm which find 
     minimal edge coloring of bipartite graph.  
      (2) Sequence Generation: Generate a sequence of switching patterns 
     by scheduling inter-PE communication on switching patterns. This 
     scheduling algorithm is based on list scheduling.  
      (3) Split Communication Gantt Chart (CGC): Split CGC to optimize 
     individual switching patterns.  
      (4) Back End: Output a network control information stored as a 
     program for inter-PE communications.  
       This paper presents a network synthsis algorithm which generates a 
     set of switching patterns and a sequence of switching patterns for 
     reconfigurable and programmable network architectures.  


KRPP: Kyushu University Reconfigurable Parallel Processor
     Naoya Tokunaga, Shinichiro Mori, Kazuaki Murakami, Akira Fukuda
     (Interdisciplinary Graduate School of Engineering Sciences, Kyushu
     University)
     Tomoo Ueno (Kyushu Nippon Electric Co.)
     Eiji Iwata (Sony Co.)
     Koji Kai (Matsushita Electric Ind. Co.)
     Shinji Tomita (Kyoto University)
          E-mail: {tokunaga,mori,murakami,fukuda}@is.kyushu-u.ac.jp
        The Kyushu University Reconfigurable Parallel Processor (KRPP) 
     has been developed at Kyushu University.  KRPP is an MIMD-type 
     multiprocessor which consists of 128 Processing-Elements (PEs) fully 
     connected by a 128*128 crossbar network.  Each PE consists of four 
     components: Processor Unit (PU), Memory Unit (MU), Message 
     Communication Unit (MCU), and Back-Plane Bus Interface (BPBIF).  
       To construct a high performance multiprocessor system and to offer 
     an experimental parallel processing environment, KRPP employs the 
     following reconfigurable architectures.  
      (1) Reconfigurable network architecture: The reconfigurable network 
     should have an ability to accommodate itself to arbitrary topologies 
     for inter-PE (i.e., processor-memory and/or processor-processor) 
     paths.  KRPP adopts the crossbar network as such one. In addition, 
     KRPP equips crossbar network with control memories to program its 
     topologies, as IBM GF11 does.  
      (2) Reconfigurable memory architecture: There are two paradigms of 
     multiprocessor configurations, shared-memory TCMP (Tightly Coupled 
     MultiProcessor) and message-passing LCMP (Loosely Coupled 
     MultiProcessor).  KRPP provides both of these paradigms effectively, 
     although it is built in the form of distributed memory organization.  
       In addition to these reconfigurability of network and memory, KRPP 
     provides high-level hardware support for both caching and inter-PE 
     communication, as follows: 
      (3) Cache architecture: To reduce effective access time and network 
     traffic, KRPP provides each PE with a private virtual-address cache.  
     And KRPP allows a cache to hold copies of data from remote-memory, 
     as well as those from local-memory.  To ensure cache coherence, KRPP 
     provides the following four cache coherence schemes: (i) 
     cacheability marking scheme, (ii) fast selective invalidation 
     scheme, (iii) distributed limited-directory scheme, and (iv) dual-
     directory cache scheme.  
      (4) Communication architecture: To realize both TCMP and LCMP, it 
     is necessary to offer two different communication methods, shared-
     memory access and message-passing.  In order to implement these on 
     the single crossbar network, a hierarchical communication protocol 
     is established.  With this protocol, the MCU offers these 
     communication methods.  
      This paper is a status report of real machine implementation.  At 
     first, it presents the overview of KRPP.  Then, it describes the 
     design philosophy of KRPP.  After that, it introduces more detailed 
     specifications of the hardware components of PE.