rick@cs.arizona.edu (Rick Schlichting) (06/02/91)
[Dr. David Kahaner is a numerical analyst visiting Japan for two-years under the auspices of the Office of Naval Research-Asia (ONR/Asia). The following is the professional opinion of David Kahaner and in no way has the blessing of the US Government or any agency of it. All information is dated and of limited life time. This disclaimer should be noted on ANY attribution.] [Copies of previous reports written by Kahaner can be obtained from host cs.arizona.edu using anonymous FTP.] To: Distribution: From: David K. Kahaner, ONR Asia [kahaner@xroads.cc.u-tokyo.ac.jp] Re: Parallel processing research in Japan, supplement. 30 May 1991 ABSTRACT. Parallel processing research (mostly associated with dataflow model) and database research in Japan, based on visits to various labs and attendance at the IEEE Data Engineering Conference in Kobe (April 10-12, 1991) is summarized. INTRODUCTION. Professor Rishiyur Nikhil MIT Laboratory for Computer Science 545 Technology Square Cambridge, MA 02139, USA Tel: (617)-253-0237, Fax: (617)-253-6652 Email: nikhil@lcs.mit.edu spent two weeks in Japan (April 1991), visiting five research labs and attending the IEEE Data Engineering Conference in Kobe. What follows are Nikhil's observations, along with comments of my own when these are relevant. PARALLEL PROCESSING (ETL, ICOT, Terada's Osaka lab) Nikhil writes as follows. The projects at these labs are primarily focused on parallel processing architectures and languages, and is closest to my own research work. I continue to be very highly impressed with the machine-building capabilities of our Japanese colleagues, but I think that with the exception of the ICOT researchers, many of them are still very weak in software. It is quite breathtaking to see how quickly new research machines are designed and built, and by such small teams--- I wish we could do as well, in the US. However, once built, their machines do not seem to be evaluated thoroughly--- good languages, compilers and run time systems are not developed and (consequently, perhaps) very few applications are written. The new designs, therefore, do not benefit from any deep lessons learned from previous designs. Also, in my opinion, another consequence of the "hardware-centric" nature of the machine builders is that certain functions are built into hardware that I would expect ought to be done in software (such as resource allocation decisions and load monitoring in the ETL machines). In my opinion, ETL's EM-4 and proposed EM-5 are the most exciting machines in Japan (and the world). The reason is this: as first elucidated in [Arv87], a large, general purpose parallel machine must be able to perform multi-threading efficiently at a fine granularity, because this is the only way to deal effectively with the long inter-node latencies of large, parallel machines. Von Neumann processors are very bad at this, and dataflow architectures have always excelled at this. However, previous dataflow architectures (including MIT's TTDA and Monsoon, and ETL's previous Sigma-1) were weak in single-thread performance and control over scheduling, two areas that are the forte of von Neumann processors. Recently, new architectures have been proposed to obtain the best of both worlds: our *T architecture at MIT, and the EM-4 and EM-5 in Japan. I believe that these machines are the first truly viable parallel MIMD machines. EM-4 [Sak90, Yam89] is a medium sized machine (80 nodes), but does not have any floating point arithmetic. However the chief problem is the lack of any good programming language or compiler. It is currently programmed in DFC ("dataflow C"), a very simple subset of C with single assignment semantics. Perhaps this situation will change in the future: the ETL researchers said that they have just hired a compiler expert, but they still do not to expect a good programming environment for some years. I also have my doubts regarding their choice of C as the programming language for the EM-4 and EM-5. Kahaner writes.. Dataflow research at ETL has a long history, including the Sigma-1, EM-4, and the proposed EM-5. The EM-4 was designed to have 1024 processors. A prototype with 80 processors is running and I am told that if the budget is maintained then the full system will be built. See the reports (etl, 2 July 1990 and parallel.904, 6 Nov 1990). My interpretations of the ETL research direction are that their evolving designs are moving away from a pure data-flow model. At the same time interest in numerical applications which was ambivalent, seems to have increased. Nikhil agrees that the ETL group is now more explicit about this, but feels that they were always interested in general purpose computing, including scientific applications. Perhaps in the atmosphere of the 80's when there was so much emphasis in Japan on knowledge processing, they may have emphasized symbolic aspects, but in technical discussions, they usually compared their machines to vector and other supercomputers, and never to "symbolic supercomputers" such as Lisp machines or ICOT's machines. In other words, they have may have always considered CRAY's, NEC's, Fujitsu's, and Connection Machine's to be their real competition. It is interesting to note that the Connection Machine was also initially portrayed as a supercomputer for AI; the reality today is that it is mostly used for scientific supercomputing. Sigma-1 was pure data-flow, similar to MIT's Tagged Token Dataflow Architecture. The EM-4 is based on what the ETL group called a strongly connected arc model. Their description of that follows [Sak91]. "In a dataflow graph, arcs are categorized into two types: normal arcs and strongly connected arcs. A dataflow subgraph whose nodes are connected by strongly connected arcs is called a strongly connected block (SCB). There are two firing rules. One is that a node on a dataflow graph is firable when all the input arcs have their own tokens (a normal data- driven rule). The other is that after each SCB fires, all the processing elements which will execute a node in the block should execute nodes in the block exclusively....In the EM-4, each SCB is executed in a single PE and tokens do not flow but are stored in a local register file. This property enables fast-register execution of a dataflow graph, realizes an advanced-control pipeline, and offers flexible resource management facilities." The designers also wrote in 1989 that "the dataflow concept can be applied not only to numerical computations involved in scientific and technological applications but also to symbolic manipulations involved in knowledge information processing. The application field of the EM-4 is now focused on the latter." EM-4 was not originally designed to have floating point support, but I was told that this was also a budgetary issue. For the EM-5, its objectives are as follows [Sak91]. "..to develop a feasible parallel supercomputer including more than 16,384 processors for general use, e.g., for numerical computation, symbolic computation, and large scale simulations. The target performance is more than 1.3 TFLOPS, i.e. 1.3*10^(12) FLOPS (double precision) and 655 GIPS. Unlike the EM-4, the EM-5 is not a dataflow machine in any sense. It exploits side-effects and it treats location-oriented computation" (see note below). "In addition the EM-5 is a 64-bit machine while the EM-4 is a 32-bit machine." The EM-5 will be based on a "layered activation model", a further generalization of the strongly connected arc mode of the EM-4. The machine will be highly pipelined, with a 25ns clock and 25ns pipeline pitch. This is half the pitch of the EM-4, largely because of the use of RISC technology. Each of the up to 16,384 processors (called EMC-G) is 64-bit, RISC, with global addressing and no embedded network switch. Similarly the floating point unit will not be within the processor chip, but separate, like a co-processor, because of limitations of pins and space on the chip. At the present time the designers have not decided on the topology of the interconnection network. Peak performance of the floating point unit will be 80MFLOPS with maximum transfer rate of 335MB/sec. The EMC-G will be built in a CMOS standard-cell chip with 391 pins and 100K gates, using 1.0 micron rules. This processor will have its logical design completed in 1991, and the gate design of the EMC-G will be completed in 1992. A full 16,384 node system will be designed in 1993 and a prototype is planned to be operational by March 1994. With regard to languages, new work will emphasize DFC-II as Nikhil explained. This will have sequential description and parallel execution, and is not a pure functional language. DFC-II can break a single assignment rule and programs can contain global variables. The group is also planning to implement several other languages, such as Id and Fortran. Finally some object oriented model is also being considered. In Japan at least, the ETL research group is considered to have some of the best (most creative, energetic, visionary, etc.) staff among all the non-university research labs. Readers may be interested to know that Dr. Shuichi Sakai of ETL (the chief designer of EM-4) is now visiting the dataflow group at MIT for one year, as of April 1, 1991. He will be assisting the group in the design of the new *T machine, which Nikhil mentions above [Nik91]. *T is based on Nikhil's previous work on the P-RISC architecture [Nik89], and is a synthesis of dataflow and von Neumann architectures (Nikhil says that one should think of it as a step beyond EM-4-like machines). The group plans to build this machine in collaboration with Motorola, in a 3 year project that will follow the current MIT-Motorola project to build the Monsoon dataflow machine. Concerning the remarks that the EM-5 will NOT be a dataflow machine, I passed them on to Nikhil who was also quite surprised. He comments that the EM-5 is not fundamentally different from the EM-4. In both those machines, as well as in MIT's P-RISC and *T, the execution model is a HYBRID of dataflow and von Neumann models. In MIT's terminology, a program is a dataflow graph where each node is a "thread". ETL's equivalent of MIT's "thread" is the SCB, or Strongly Connected Block. Dataflow execution is used to trigger and schedule threads, just as in previous dataflow machines. In MIT's *T, this scheduling happens in the Start Coprocessor; in ETL's machines, it happens in the FMU (Fetch and Matching Unit). Within a thread, instructions are scheduled using a conventional program counter, as in von Neumann machines. In MIT's *T, this happens in the Data CoProcessor; in ETL's machines, it happens in the EXU (Execution Unit). In both the EM-4 and EM-5 the processor is organized as an IBU (Input Buffer Unit) followed by a FMU (Fetching and Matching Unit) followed by an EXU (Execution Unit). The overall execution strategy is the same in both machines. The EM-5 and EM-4 differ in smaller details: EM-5 has newer chip technology, a separate memory for packet buffers, a finer pitch pipeline, a direct instruction pointer in packets, a floating point unit, a 64 bit arch, etc., but the fundamental organization is the same. Nikhil also asked Sakai about the statements in [Sak91]. Sakai claims that what he meant was "... the EM-5 is not a dataflow machine in SOME sense." and faults his poor command of English for this error. With respect to the second sentence: "It exploits side-effects and it treats location- oriented computation", Nikhil is not sure what the authors meant by this. He explains that Dataflow architectures have never prohibited side-effects or enforced single-assignment semantics. It is only dataflow languages that take this position on side-effects. Dataflow architectures merely provided support for this, while not enforcing it. Dataflow architectures are equally appropriate for other languages, such as Fortran or C. After visiting ICOT, Nikhil remarks that... I got a sense of complementary strengths relative to ETL. ICOT researchers seem to be very sophisticated with respect to parallel languages, compilers and runtime systems; the parallel machines, on the other hand, were not that exciting. I do not think that anyone can claim any longer that the KL1 language used extensively at ICOT is a logic programming language (ICOT researchers themselves are quite frank about this). The main remaining vestige of logic programming (albeit a very important one) is the "logic variable" which is used for asynchronous communication. Logic variables in KL1 are very similar (perhaps identical) to "I-structure variables" in Id, the programming language developed at MIT over the last 6 years. Regardless of whether we label KL1 as a logic programming language or not, it is certainly a very interesting and expressive language, and is perhaps the largest and most heavily used parallel symbolic processing language in existence anywhere. Because of the sheer volume of applications that people are writing in KL1 and running on ICOT's parallel machines (we saw 5 demos from a very impressive suite of demos), I think that ICOT researchers are certainly as experienced and sophisticated as anyone in the world about parallel implementations of symbolic processing: compilation, resource allocation, scheduling, garbage collection, etc. ICOT's machines are not as exciting as ETL's. The original PSIs (130 KLIPS) were heavily horizontally microcoded sequential machines, and one must wonder whether they will not go the way of Lisp machines, i.e., made obsolete by improving compiling technology on modern RISC machines. PSIs were not originally conceived of as nodes of a parallel machine. Thus, ICOT's two multi-PSIs, which are networks of PSIs (2D grid topology), are just short term prototypes for experimentation. ICOT researchers want to put one of the two Multi-PSIs on the Internet for open access, but they are having trouble convincing MITI to allow this. ICOT's real parallel targets are the PIM machines, the first of which (a PIM/p) had just been delivered to ICOT during our visit (it was not yet up and running). ICOT's machines are built by various industrial partners, of course with heavy participation in the design by ICOT researchers. There are 5 different PIM architectures (different node architectures, different network architectures) with 5 different industrial partners. I was surprised by this because this will lead to serious portability problems for the software. On the positive side, I suppose they will gain a lot of experience on a variety of architectures, and on portability, and can learn from the best of each! From what little I know about the PIM architectures, they do not seem to be as exciting as ETL's EM-4 and EM-5 machines. The NEC C&C Systems Research Lab has also been involved with ICOT in the Fifth Generation project. NEC's CHI machine (300 KLIPS), a single user microcoded machine for logic programming, predates and outperformed ICOT's PSI machine. However, like the PSI machine and Lisp machines, I expect that this type of machine will become obsolete as compiling technology on RISC machines improves. NEC has also started work on an implementation of ICOT's A'UM programming language. Kahaner notes that additional technical details of ICOT research is given in the report (icot-sci, 17 May 1991). A number of Japanese researchers have remarked that one of the most important aspects of the ICOT project is it gives many young Japanese researchers the opportunity to meet together informally (outside their individual corporate or University environment) and assists in the networking that is so prevalent in Japanese science. Nikhil continues... Prof. Terada's dataflow laboratory at Osaka University is remarkable in the degree to which they collaborate with industry. Professor Hiroaki Terada Department of Information Systems Engineering, Faculty of Engineering, Osaka University, Yamadaoka, 2-1 Suita, Osaka, Japan 565 Tel: +81(Japan)-6-877-5111, Fax: +81 6 875 0506 They are one of the notable exceptions to my prior image of research at Japanese universities: starved of funds from the education ministry and generally not very exciting. Prof. Terada has close collaborations with Mitsubishi, Sharp, Matsushita and Sanyo. They developed a TTL dataflow machine Q-p in 1983-86 (2-4 MOPs); they now have Q-pv, a multi-chip VLSI version (20 MFLOPs), and they are planning to integrate this further in Q-v1, a single-chip version (50 MFLOPS). A unique aspect of their technology approach is that they have an asynchronous, self-timed design; they have consciously avoided clock-synchronous circuits. The architecture of all these machines is a ring, similar to the Manchester dataflow machine. Like the ETL project, this project again seems weak in software, with the result that no significant applications are written, which in turn means that the hardware design is difficult to evaluate. Prof. Nishikawa, a member of the group, is leading a project to develop AESOP, a program development environment for these dataflow machines. Professor Hiroaki Nishikawa Department of Information Systems Engineering Faculty of Engineering Osaka University Yamadaoka, 2-1 Suita, Osaka, Japan 565 Tel: +81 (6) 877 5111 ext. 5018, Fax: 81 (6) 875 0506 Telex: 5286-227 FEOUJ J Email: nisikawa@oueln0.ouele.osaka-u.ac.jp It appears that he is aiming for some kind of a visual programming style, where one draws dataflow graphs on a screen and chooses a mapping of these graphs onto the physical rings of the dataflow machine. Personally, I am not very impressed with the visual programming languages that I have seen to date: they are too complicated and inflexible. 7TH IEEE CONFERENCE ON DATA ENGINEERING, KOBE, APRIL 10-12, 1991 The conference has become very large, with at least 3 parallel sessions at all times and over 700 pages in the proceedings, so it was impossible for anyone to get a complete overview. Object Oriented Databases (OODBs) was the dominant topic. There were papers on: - notions of consistency - declarative (or associative access) query languages - storage and indexing - models of views - models of time - user interfaces - etc. Unfortunately, it is disappointing that there is still very little work on developing a simple and clear semantics for OODBs. Whereas the relational model had a single, simple model that was agreed upon by a large community of researchers, today each OODB seems to come with its own, unique model, often described imprecisely or with arcane formalism. Consequently, there is very little basis available for objective comparisons of OODBs with each other or with relational DBs. There were several papers on parallel implementations, including: - Parallel transitive closure and join algorithms - Scheduling on shared-memory and shared-nothing machines - Data distribution strategies - dataflow implementation - etc. Most parallel implementations are on stock parallel hardware. Parallel database machines per-se seem to have fallen out of vogue---only one such machine was described (FDS-R2 at Univ. of Tokyo; Kitsuregawa et. al.) Kahaner notes that Professor M. Kitsuregawa, from the U-Tokyo Institute of Industrial Science, has published several papers on SDC, the "Super Database Computer", for example [Kit91]. He also notes that the National Science Foundation (NSF), in cooperation with other agencies, has funded the Japanese Technology Evaluation Center (JTEC) at Loyola College in Maryland to assess the status and trends of Japanese research and development in selected technologies. In March 1991, a JTEC team headed by Professor Gio Wiederhold (Stanford University) [gio@earth.stanford.edu], visited Japan to evaluate Japanese database technology, and the team presented a workshop on their preliminary results 30 April 1991 at the NSF. A comprehensive report is currently in preparation. Genomic databases generated a lot of excitement--- the panel on this topic drew a huge audience. Genomic databases will contain huge volumes of data with unique requirements: inaccurate information, incomplete information, retrieval using approximate matching and sophisticated inference. Many people seem to view genomic databases as the new frontier and driving force in DB research, a beautiful application with lots of exciting research problems (and lots of funding?) for the DB community. Deductive databases (the marriage of logic programming languages and databases) seem to be generating less interest than they were some years ago. There were a few papers on query optimization. The remaining papers reported steady, if unspectacular progress on a variety of topics: - Distributed DBMSs (optimization, voting protocols) - Concurrency control (in high contention DBs, parallel DBs) - Indexing and query languages for temporal databases (time attributes, versions) - Indexing and query languages for spatial databases (e.g. geographic maps) - Incomplete information (formal models, approximate answers to queries) - Heterogeneous databases (transaction protocols, serializability) - Efficient post-failure restart algorithms - Simultaneous optimization for multiple queries REFERENCES [Arv87] Two Fundamental Issues in Multiprocessing, Arvind and R.A.Iannucci, Proc. DFVLR - Conf. on Parallel Processing in Science and Engineering, Bonn-Bad Godesberg, W. Germany, June 25-29, 1987, Springer-Verlag LNCS 295 [Kit91] Multiple Processing Module Control on SDC, the Super Database Computer, S.Hirano, M.Harada, M.Nakamura, Y.Aiba, K.Suzuki, M.Kitsuregawa, M.Takagi, and W.Yang, Proc Japan Soc Parallel Proc, Kobe Japan, May 14-16, 1991, pp 53- 60. [Nik89] Can dataflow subsume von Neumann computing? R.S.Nikhil and Arvind Proc 16th Intl Symp on Computer Architecture, Jerusalem, Israel, May 29-31, 1989, pp 262-272. [Nik91] *T: a Killer Micro for a Brave New World, R.S.Nikhil, G.M.Papadopoulos and Arvind CSG Memo 325, MIT Laboratory for Computer Science 545 Technology Square, Cambridge, MA 02139, USA January 1991 [Sak90] An Architecture of a Dataflow Single Chip Processor, S. Sakai, Y. Yamaguchi, K. Hiraki and T. Yuba Proc. 16th Annual International Symposium on Computer Architecture, Jerusalem, Israel, May 28-June 1, 1989, pp 46-53 [Sak91] Architectural Design of a Parallel Supercomputer EM-5, S. Sakai, Y. Kodama, Y. Yamaguchi Proc Japan Soc Parallel Proc, Kobe Japan, May 14-16, 1991, pp 149-156. [Yam89] An Architectural Design of a Highly Parallel Dataflow Machine Y. Yamaguchi, S. Sakai, K. Hiraki, Y. Kodama and T. Yuba Proc. Information Processing 89, Aug 28-Sep 1, San Francisco, pp 1155-1160. -------------------END OF REPORT------------------------------------- -- =========================== MODERATOR ============================== Steve Stevenson {steve,fpst}@hubcap.clemson.edu Department of Computer Science, comp.parallel Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell