[comp.research.japan] Kahaner Report: Comments on ETL Parallel Processing by M. Rosing

rick@cs.arizona.edu (Rick Schlichting) (01/19/91)

  [This duplicates a previous article in comp.research.japan, but
   is provided for completeness. -- Rick]

  [Dr. David Kahaner is a numerical analyst visiting Japan for two-years
   under the auspices of the Office of Naval Research-Asia (ONR/Asia).  
   The following is the professional opinion of David Kahaner and in no 
   way has the blessing of the US Government or any agency of it.  All 
   information is dated and of limited life time.  This disclaimer should 
   be noted on ANY attribution.]

  [Copies of previous reports written by Kahaner can be obtained from
   host cs.arizona.edu using anonymous FTP.]

To: Distribution
From: David K. Kahaner ONR Asia [kahaner@xroads.cc.u-tokyo.ac.jp]
Re: Comments on ETL's parallel processing projects from M. Rosing. 
18 Jan 1991

ABSTRACT. Matt Rosing (U Colorado) reports on work during
summer 1990 in the Computer Architecture Division of the
Electrotechnical Lab in Tsukuba Japan.


Rosing's comments provide welcome detail to other reports on this subject 
that I have distributed, including the following.
       "etl"            2 July     1990
       "dataflow"      16 August   1990
       "parallel.904"   6 November 1990 
       "ricks"         17 January  1991


Matt Rosing
Department of Computer Science
Campus Box 430
Boulder, Colorado 80309
(ROSING@BOULDER.COLORADO.EDU)

Last summer I had the opportunity to spend two months at the 
Electrotechnical Lab in Tsukuba Japan. This was an NSF sponsored trip for 
American grad students to spend time in Japan learning about the research 
and culture in Japan. [See below for more details. (DKK)]. I worked in 
the Computer Architecture Division and spent most of my time learning 
about the SIGMA-1, EM-4, and CODA projects.  I have seen a few reports, 
from Kahaner, Schlichting, and others, describing these projects and 
would like to add a little more detail.  

The SIGMA-1 machine is a 128 node, instruction level data flow machine. The 
project was started in 1982. It has measured rates of 170 MFLOPS. I can't 
really add anymore information than what Kahaner has described. If you have 
questions the best person to ask is Satoshi Sekiguchi (sekiguti@etl.go.jp).  

The project that I spent the most time learning about was the EM-4. The EM-
4 is an 80 node, distributed memory, coarse grain data flow machine which 
is a prototype for a 1000 node version. By coarse grain, I mean that blocks 
of Von-Neumann, register based code are connected by data flow arcs. This 
mixed model provides the advantage of data flow (flexible, dynamic 
scheduling) at the upper level of a program and advantages of Von-Neumann 
machines (static register scheduling) at the lower levels. As many parallel 
programs have this type of architecture this appears to be a promising 
architecture.  

The low level, register based part of this architecture is risc based.  The 
risc pipeline is integrated with the data flow loop so the two parts work 
quite well together (see experiment below). All of the hardware for one 
node (including communications) is built into a single gate array chip. The 
clock cycle is 80ns.  Although there is no floating point in this 
prototype, there will be in the next version.  

The other part of the architecture is the communications network. Each 
processor is one node in an omega network. The communications network is 
designed to support data flow computations and therefore is designed for 
very fine grain communications.  A word can be sent or received in a single 
instruction. Messages propagate through the network at one word per node 
per clock cycle (80ns). The really nice feature of this is that the user 
does not need to worry about packing up words into contiguous memory 
buffers before sending a message.  Messages are typed by both destination 
memory address and category.  Example categories are "data message" or 
"create process." The user may define these categories.  

I wrote a program to illustrate some of these features. The test was a 
smoothing algorithm which worked on a vector. It is an iterative algorithm 
in which the next value of each element of a vector is a function of the 
previous value and the two neighboring values. This generates 
communications which are similar to many numerical problems (PDEs etc). In 
a sense, the test I wrote was not something which is ideally suited for 
data flow machines. It is more of a data parallel algorithm which is more 
ideally suited for a SIMD machine like the CM-2.  

The program consisted of placing a "process" (or whatever the equivalent 
idea is for data flow machines) on each node which iteratively (10000 
cycles) received the values from the two neighboring processors, added them 
together, and sent the results to the two neighboring processors. So 
messages are one word long and there are lots of messages. The results of 
the test were that it took 36 clock ticks to perform each cycle. (The EM-4 
has a wonderful timing facility which counts clock ticks.) During each 
iteration two words were sent so, at 80 ns/cycle, this corresponds to 
1440ns/word or 360ns/byte. (Also includes the integer add). This compares 
quite favorably to 390ns/byte on an IPSC2 with messages long enough to 
overcome message start up costs (greater than 20k bytes).  

The best person to ask questions about the EM-4 is Shuichi Sakai 
(sakai@etl.go.jp) 

The final project is called CODA and is still in the paper stage. CODA is a 
distributed memory machine that is designed for real time applications. 
This machine looks much less like a data flow machine than the EM-4.  The 
interesting aspects of CODA are in the communications hardware. First of 
all, message priorities have been added to support real time applications. 
Within the communications hardware, messages with higher priority are 
routed before lower priority messages.  

To me, the more interesting aspect of CODA is in the synchronization and 
communication hardware. Each memory location and register contain full 
empty bits which are used to implement produce/consume semantics on all 
memory and register access.  In conjunction with this, a send or receive 
instruction can target any memory element or register on any processor. 
Message packets to read or write memory or registers are inserted into the 
risc instruction pipeline along with instructions from the local processor. 
The result should be very fine grained communications with very little 
overhead.  

In order to keep the processors busy there are several instruction streams 
supported by the hardware. Processes that block on memory or register 
accesses can be swapped out and another instruction stream swapped in in a 
few clock cycles.  

These three constructs, message priorities, register level synchronization, 
and multiple instruction stream support, should greatly improve 
communications on distributed memory multiprocessors. If you have questions 
you should ask Kenji Toda (toda@etl.go.jp) 

My stay at ETL was supported by the NSF Japan Summer Institute program. I 
strongly recommend that any PhD students in science or engineering that are 
interested in Japan should apply for next year's program. Yes, this is an 
advertisement but the program was well worth it.  

NOTE.

   If you would like to apply for support to conduct research and or 
study in Japan (either graduate student, post doc, or faculty 
appointments), please contact NSF's Japan Program, at Room 1214, NSF, 
Washington, D.C.  20550, or by e-mail to NSFJinfo@nsf.gov (on Internet) 
or NSFJinfo@nsf (BitNet).  The telephone number is (202) 357-9558.  

   For more general information request the new Japan Program 
announcement (NSF 90-144) from NSF's Publications Unit, Room 232, NSF, 
Washington, D.C. 20550, or by e-mail to pubs@nsf.gov (Internet) or 
pubs@nsf (BitNet).  

   NSF also provides free to libraries and reference collections a copy 
of a catalogue of Japanese government laboratories.  

   The catalogue gives two-page descriptions of the organization and 
major research activities of each of 110 laboratories in Japan which are 
run by the national government, public corporations, and non-profit 
organizations.  The catalogue will be of most use to U.S. scientists and 
engineers attempting to find Japanese research partners, whether for 
research collaboration or for visits to Japan.  NSF would like to place 
the catalogue in collections readily accessible to scientists and 
engineers at U.S. Ph.D.-granting institutions. Contact the Japan Program, 
at the above address.  

Citation:
   Research Development Corporation of Japan (JRDC).  "National 
   Laboratories and Research Public Corporations in Japan."  200 pp.  
   [Tokyo,] Japan, 1990.  Revision of the first edition, 1987, and 
   supplement (Part II), 1988.  

Another useful NSF contact is Dr. Douglas McNeal (DMCNEAL@NSF.GOV).
------------------END OF REPORT------------------------------------------