[comp.parallel] NSF Summer Institute Visit to ETL

rosing@molson.Colorado.EDU (Matt Rosing) (01/21/91)

Last summer I had the opportunity to spend two months at the Electrotechnical
Lab in Tsukuba Japan. This was an NSF sponsored trip for American grad students
to spend time in Japan learning about the research and culture in Japan. I
worked in the Computer Architecture Division and spent most of my time learning
about the SIGMA-1, EM-4, and CODA projects. I have seen a few reports, from
Kahaner and others, describing these projects and would like to add a little
more detail.

The SIGMA-1 machine is a 128 node, instruction level data flow machine. The
project was started in 1981 (?). It has measured rates of 170 MFLOPS. I can't 
really add anymore information than what Kahaner has described. If you have 
questions the best person to ask is Satoshi Sekiguchi (sekiguti@etl.go.jp).

The project that I spent the most time learning about was the EM-4. The EM-4 is
an 80 node, distributed memory, coarse grain data flow machine which is a
prototype for a 1000 node version. By coarse grain, I mean that blocks of
Von-Neumann, register based code are connected by data flow arcs. This mixed
model provides the advantage of data flow (flexible, dynamic scheduling) at the
upper level of a program and advantages of Von-Neumann machines (static register
scheduling) at the lower levels. As many parallel programs have this type of
architecture this appears to be a promising architecture.

The low level, register based part of this architecture is risc based.  The
risc pipeline is integrated with the data flow loop so the two parts work quite
well together (see experiment below). All of the hardware for one node
(including communications) is built into a single gate array chip. The clock
cycle is 80ns.  Although there is no floating point in this prototype, there
will be in the next version.

The other part of the architecture is the communications network. Each
processor is one node in an omega network. The communications network is
designed to support data flow computations and therefore is designed for very
fine grain communications.  A word can be sent or received in a single
instruction. Messages propagate through the network at one word per node per
clock cycle (80ns). The really nice feature of this is that the user does not
need to worry about packing up words into contiguous memory buffers before
sending a message.  Messages are typed by both destination memory address and
category.  Example categories are "data message" or "create process." The user
may define these categories.

I wrote a program to illustrate some of these features. The test was a
smoothing algorithm which worked on a vector. It is an iterative algorithm in
which the next value of each element of a vector is a function of the previous
value and the two neighboring values. This generates communications which are
similar to many numerical problems (PDEs etc). In a sense, the test I wrote was
not something which is ideally suited for data flow machines. It is more of a
data parallel algorithm which is more ideally suited for a SIMD machine like
the CM-2.

The program consisted of placing a "process" (or whatever the equivalent idea
is for data flow machines) on each node which iteratively (10000 cycles)
received the values from the two neighboring processors, added them together,
and sent the results to the two neighboring processors. So messages are one
word long and there are lots of messages. The results of the test were that it
took 36 clock ticks to perform each cycle. (The EM-4 has a wonderful timing
facility which counts clock ticks.) During each iteration two words were sent
so, at 80 ns/cycle, this corresponds to 1440ns/word or 360ns/byte. (Also
includes the integer add). This compares quite favorably to 390ns/byte on an
IPSC2 with messages long enough to overcome message start up costs (greater
than 20k bytes).

The best person to ask questions about the EM-4 is Shuichi Sakai 
(sakai@etl.go.jp)

The final project is called CODA and is still in the paper stage. CODA is a
distributed memory machine that is designed for real time applications. This
machine looks much less like a data flow machine than the EM-4.  The
interesting aspects of CODA are in the communications hardware. First of all,
message priorities have been added to support real time applications. Within
the communications hardware, messages with higher priority are routed before
lower priority messages.

To me, the more interesting aspect of CODA is in the synchronization and
communication hardware. Each memory location and register contain full empty
bits which are used to implement produce/consume semantics on all memory and
register access.  In conjunction with this, a send or receive instruction can
target any memory element or register on any processor. Message packets to read
or write memory or registers are inserted into the risc instruction pipeline
along with instructions from the local processor. The result should be very
fine grained communications with very little overhead.

In order to keep the processors busy there are several instruction streams
supported by the hardware. Processes that block on memory or register accesses
can be swapped out and another instruction stream swapped in in a few clock
cycles.

These three constructs, message priorities, register level synchronization, and
multiple instruction stream support, should greatly improve communications on
distributed memory multiprocessors. If you have questions you should ask Kenji
Toda (toda@etl.go.jp)

My stay at ETL was supported by the NSF Japan Summer Institute program. I
strongly recommend that any PhD students in science or engineering that are
interested in Japan should apply for next year's program. Yes, this is an
advertisement but the program was well worth it.