rosing@molson.Colorado.EDU (Matt Rosing) (01/19/91)
Last summer I had the opportunity to spend two months at the Electrotechnical Lab in Tsukuba Japan. This was an NSF sponsored trip for American grad students to spend time in Japan learning about the research and culture in Japan. I worked in the Computer Architecture Division and spent most of my time learning about the SIGMA-1, EM-4, and CODA projects. I have seen a few reports, from Kahaner and others, describing these projects and would like to add a little more detail. The SIGMA-1 machine is a 128 node, instruction level data flow machine. The project was started in 1981 (?). It has measured rates of 170 MFLOPS. I can't really add anymore information than what Kahaner has described. If you have questions the best person to ask is Satoshi Sekiguchi (sekiguti@etl.go.jp). The project that I spent the most time learning about was the EM-4. The EM-4 is an 80 node, distributed memory, coarse grain data flow machine which is a prototype for a 1000 node version. By coarse grain, I mean that blocks of Von-Neumann, register based code are connected by data flow arcs. This mixed model provides the advantage of data flow (flexible, dynamic scheduling) at the upper level of a program and advantages of Von-Neumann machines (static register scheduling) at the lower levels. As many parallel programs have this type of architecture this appears to be a promising architecture. The low level, register based part of this architecture is risc based. The risc pipeline is integrated with the data flow loop so the two parts work quite well together (see experiment below). All of the hardware for one node (including communications) is built into a single gate array chip. The clock cycle is 80ns. Although there is no floating point in this prototype, there will be in the next version. The other part of the architecture is the communications network. Each processor is one node in an omega network. The communications network is designed to support data flow computations and therefore is designed for very fine grain communications. A word can be sent or received in a single instruction. Messages propagate through the network at one word per node per clock cycle (80ns). The really nice feature of this is that the user does not need to worry about packing up words into contiguous memory buffers before sending a message. Messages are typed by both destination memory address and category. Example categories are "data message" or "create process." The user may define these categories. I wrote a program to illustrate some of these features. The test was a smoothing algorithm which worked on a vector. It is an iterative algorithm in which the next value of each element of a vector is a function of the previous value and the two neighboring values. This generates communications which are similar to many numerical problems (PDEs etc). In a sense, the test I wrote was not something which is ideally suited for data flow machines. It is more of a data parallel algorithm which is more ideally suited for a SIMD machine like the CM-2. The program consisted of placing a "process" (or whatever the equivalent idea is for data flow machines) on each node which iteratively (10000 cycles) received the values from the two neighboring processors, added them together, and sent the results to the two neighboring processors. So messages are one word long and there are lots of messages. The results of the test were that it took 36 clock ticks to perform each cycle. (The EM-4 has a wonderful timing facility which counts clock ticks.) During each iteration two words were sent so, at 80 ns/cycle, this corresponds to 1440ns/word or 360ns/byte. (Also includes the integer add). This compares quite favorably to 390ns/byte on an IPSC2 with messages long enough to overcome message start up costs (greater than 20k bytes). The best person to ask questions about the EM-4 is Shuichi Sakai (sakai@etl.go.jp) The final project is called CODA and is still in the paper stage. CODA is a distributed memory machine that is designed for real time applications. This machine looks much less like a data flow machine than the EM-4. The interesting aspects of CODA are in the communications hardware. First of all, message priorities have been added to support real time applications. Within the communications hardware, messages with higher priority are routed before lower priority messages. To me, the more interesting aspect of CODA is in the synchronization and communication hardware. Each memory location and register contain full empty bits which are used to implement produce/consume semantics on all memory and register access. In conjunction with this, a send or receive instruction can target any memory element or register on any processor. Message packets to read or write memory or registers are inserted into the risc instruction pipeline along with instructions from the local processor. The result should be very fine grained communications with very little overhead. In order to keep the processors busy there are several instruction streams supported by the hardware. Processes that block on memory or register accesses can be swapped out and another instruction stream swapped in in a few clock cycles. These three constructs, message priorities, register level synchronization, and multiple instruction stream support, should greatly improve communications on distributed memory multiprocessors. If you have questions you should ask Kenji Toda (toda@etl.go.jp) My stay at ETL was supported by the NSF Japan Summer Institute program. I strongly recommend that any PhD students in science or engineering that are interested in Japan should apply for next year's program. Yes, this is an advertisement but the program was well worth it.