[comp.parallel] Kahaner Report: Parallel processing research in Japan, supplement

rick@cs.arizona.edu (Rick Schlichting) (06/02/91)

  [Dr. David Kahaner is a numerical analyst visiting Japan for two-years
   under the auspices of the Office of Naval Research-Asia (ONR/Asia).  
   The following is the professional opinion of David Kahaner and in no 
   way has the blessing of the US Government or any agency of it.  All 
   information is dated and of limited life time.  This disclaimer should 
   be noted on ANY attribution.]

  [Copies of previous reports written by Kahaner can be obtained from
   host cs.arizona.edu using anonymous FTP.]

To: Distribution:
From: David K. Kahaner, ONR Asia [kahaner@xroads.cc.u-tokyo.ac.jp]
Re: Parallel processing research in Japan, supplement.
30 May 1991


ABSTRACT. Parallel processing research (mostly associated with
dataflow model) and database research in Japan, based on visits to
various labs and attendance at the IEEE Data Engineering Conference in
Kobe (April 10-12, 1991) is summarized.

INTRODUCTION. 
     Professor Rishiyur Nikhil
     MIT Laboratory for Computer Science
     545 Technology Square
     Cambridge, MA 02139, USA
       Tel: (617)-253-0237, Fax: (617)-253-6652
       Email: nikhil@lcs.mit.edu
spent two weeks in Japan (April 1991), visiting five research labs and
attending the IEEE Data Engineering Conference in Kobe.  What follows
are Nikhil's observations, along with comments of my own when these are
relevant.

PARALLEL PROCESSING (ETL, ICOT, Terada's Osaka lab)
Nikhil writes as follows. 
The projects at these labs are primarily focused on parallel processing 
architectures and languages, and is closest to my own research work.  I 
continue to be very highly impressed with the machine-building 
capabilities of our Japanese colleagues, but I think that with the 
exception of the ICOT researchers, many of them are still very weak in 
software.  It is quite breathtaking to see how quickly new research 
machines are designed and built, and by such small teams--- I wish we 
could do as well, in the US.  However, once built, their machines do not 
seem to be evaluated thoroughly--- good languages, compilers and run time 
systems are not developed and (consequently, perhaps) very few 
applications are written.  The new designs, therefore, do not benefit 
from any deep lessons learned from previous designs.  Also, in my 
opinion, another consequence of the "hardware-centric" nature of the 
machine builders is that certain functions are built into hardware that I 
would expect ought to be done in software (such as resource allocation 
decisions and load monitoring in the ETL machines).  

In my opinion, ETL's EM-4 and proposed EM-5 are the most exciting 
machines in Japan (and the world).  The reason is this: as first 
elucidated in [Arv87], a large, general purpose parallel machine must be 
able to perform multi-threading efficiently at a fine granularity, 
because this is the only way to deal effectively with the long inter-node 
latencies of large, parallel machines.  Von Neumann processors are very 
bad at this, and dataflow architectures have always excelled at this.  
However, previous dataflow architectures (including MIT's TTDA and 
Monsoon, and ETL's previous Sigma-1) were weak in single-thread 
performance and control over scheduling, two areas that are the forte of 
von Neumann processors.  Recently, new architectures have been proposed 
to obtain the best of both worlds: our *T architecture at MIT, and the 
EM-4 and EM-5 in Japan.  I believe that these machines are the first 
truly viable parallel MIMD machines.  

EM-4 [Sak90, Yam89] is a medium sized machine (80 nodes), but does not have 
any floating point arithmetic.  However the chief problem is the lack of 
any good programming language or compiler.  It is currently programmed in 
DFC ("dataflow C"), a very simple subset of C with single assignment 
semantics.  Perhaps this situation will change in the future: the ETL 
researchers said that they have just hired a compiler expert, but they 
still do not to expect a good programming environment for some years.  I 
also have my doubts regarding their choice of C as the programming 
language for the EM-4 and EM-5.

Kahaner writes..
Dataflow research at ETL has a long history, including the Sigma-1, EM-4, 
and the proposed EM-5. The EM-4 was designed to have 1024 processors. A 
prototype with 80 processors is running and I am told that if the budget 
is maintained then the full system will be built. See the reports (etl, 2 
July 1990 and parallel.904, 6 Nov 1990).  

My interpretations of the ETL research direction are that their evolving 
designs are moving away from a pure data-flow model. At the same time 
interest in numerical applications which was ambivalent, seems to have 
increased. Nikhil agrees that the ETL group is now more explicit about 
this, but feels that they were always interested in general purpose 
computing, including scientific applications. Perhaps in the atmosphere 
of the 80's when there was so much emphasis in Japan on knowledge 
processing, they may have emphasized symbolic aspects, but in technical 
discussions, they usually compared their machines to vector and other 
supercomputers, and never to "symbolic supercomputers" such as Lisp 
machines or ICOT's machines.  In other words, they have may have always 
considered CRAY's, NEC's, Fujitsu's, and Connection Machine's to be their 
real competition.  It is interesting to note that the Connection Machine 
was also initially portrayed as a supercomputer for AI; the reality today 
is that it is mostly used for scientific supercomputing.  

Sigma-1 was pure data-flow, similar to MIT's Tagged Token Dataflow 
Architecture.  The EM-4 is based on what the ETL group called a strongly 
connected arc model. Their description of that follows [Sak91].  "In a 
dataflow graph, arcs are categorized into two types: normal arcs and 
strongly connected arcs. A dataflow subgraph whose nodes are connected by 
strongly connected arcs is called a strongly connected block (SCB).  
There are two firing rules.  One is that a node on a dataflow graph is 
firable when all the input arcs have their own tokens (a normal data-
driven rule). The other is that after each SCB fires, all the processing 
elements which will execute a node in the block should execute nodes in 
the block exclusively....In the EM-4, each SCB is executed in a single PE 
and tokens do not flow but are stored in a local register file.  This 
property enables fast-register execution of a dataflow graph, realizes an 
advanced-control pipeline, and offers flexible resource management 
facilities." The designers also wrote in 1989 that "the dataflow concept 
can be applied not only to numerical computations involved in scientific 
and technological applications but also to symbolic manipulations 
involved in knowledge information processing. The application field of 
the EM-4 is now focused on the latter."  EM-4 was not originally designed 
to have floating point support, but I was told that this was also a 
budgetary issue.  

For the EM-5, its objectives are as follows [Sak91].
"..to develop a feasible parallel supercomputer including more than 
16,384 processors for general use, e.g., for numerical computation, 
symbolic computation, and large scale simulations. The target performance 
is more than 1.3 TFLOPS, i.e. 1.3*10^(12) FLOPS (double precision) and 
655 GIPS. Unlike the EM-4, the EM-5 is not a dataflow machine in any 
sense. It exploits side-effects and it treats location-oriented 
computation" (see note below). "In addition the EM-5 is a 64-bit machine 
while the EM-4 is a 32-bit machine." The EM-5 will be based on a "layered 
activation model", a further generalization of the strongly connected 
arc mode of the EM-4.  

The machine will be highly pipelined, with a 25ns clock and 25ns pipeline 
pitch. This is half the pitch of the EM-4, largely because of the use of 
RISC technology.  Each of the up to 16,384 processors (called EMC-G) is 
64-bit, RISC, with global addressing and no embedded network switch.  
Similarly the floating point unit will not be within the processor chip, 
but separate, like a co-processor, because of limitations of pins and 
space on the chip.  At the present time the designers have not decided on 
the topology of the interconnection network. Peak performance of the 
floating point unit will be 80MFLOPS with maximum transfer rate of 
335MB/sec. The EMC-G will be built in a CMOS standard-cell chip with 391 
pins and 100K gates, using 1.0 micron rules.  This processor will have 
its logical design completed in 1991, and the gate design of the EMC-G 
will be completed in 1992. A full 16,384 node system will be designed in 
1993 and a prototype is planned to be operational by March 1994.  

With regard to languages, new work will emphasize DFC-II as Nikhil 
explained. This will have sequential description and parallel execution, 
and is not a pure functional language. DFC-II can break a single 
assignment rule and programs can contain global variables. The group is 
also planning to implement several other languages, such as Id and 
Fortran. Finally some object oriented model is also being considered.  

In Japan at least, the ETL research group is considered to have some of 
the best (most creative, energetic, visionary, etc.) staff among all the 
non-university research labs.  

Readers may be interested to know that Dr. Shuichi Sakai of ETL (the 
chief designer of EM-4) is now visiting the dataflow group at MIT for one 
year, as of April 1, 1991.  He will be assisting the group in the design 
of the new *T machine, which Nikhil mentions above [Nik91].  *T is based 
on Nikhil's previous work on the P-RISC architecture [Nik89], and is a 
synthesis of dataflow and von Neumann architectures (Nikhil says that one 
should think of it as a step beyond EM-4-like machines).  The group plans 
to build this machine in collaboration with Motorola, in a 3 year project 
that will follow the current MIT-Motorola project to build the Monsoon 
dataflow machine.  

Concerning the remarks that the EM-5 will NOT be a dataflow machine, I 
passed them on to Nikhil who was also quite surprised. He comments that 
the EM-5 is not fundamentally different from the EM-4.  In both those 
machines, as well as in MIT's P-RISC and *T, the execution model is a 
HYBRID of dataflow and von Neumann models.  In MIT's terminology, a 
program is a dataflow graph where each node is a "thread".  ETL's 
equivalent of MIT's "thread" is the SCB, or Strongly Connected Block.  

Dataflow execution is used to trigger and schedule threads, just as in 
previous dataflow machines.  In MIT's *T, this scheduling happens in the 
Start Coprocessor; in ETL's machines, it happens in the FMU (Fetch and 
Matching Unit).  

Within a thread, instructions are scheduled using a conventional program 
counter, as in von Neumann machines.  In MIT's *T, this happens in the 
Data CoProcessor; in ETL's machines, it happens in the EXU (Execution 
Unit).  

In both the EM-4 and EM-5 the processor is organized as an IBU (Input 
Buffer Unit) followed by a FMU (Fetching and Matching Unit) followed by 
an EXU (Execution Unit).  The overall execution strategy is the same in 
both machines.  

The EM-5 and EM-4 differ in smaller details: EM-5 has newer chip 
technology, a separate memory for packet buffers, a finer pitch pipeline, 
a direct instruction pointer in packets, a floating point unit, a 64 bit 
arch, etc., but the fundamental organization is the same.  

Nikhil also asked Sakai about the statements in [Sak91]. Sakai claims
that what he meant was 
    "... the EM-5 is not a dataflow machine in SOME sense."
and faults his poor command of English for this error.  With respect to 
the second sentence: "It exploits side-effects and it treats location-
oriented computation", Nikhil is not sure what the authors meant by this. 
He explains that 
  Dataflow architectures have never prohibited side-effects or enforced 
single-assignment semantics.  It is only dataflow languages that take 
this position on side-effects.  Dataflow architectures merely provided 
support for this, while not enforcing it.  Dataflow architectures are 
equally appropriate for other languages, such as Fortran or C.  


After visiting ICOT, Nikhil remarks that...
I got a sense of complementary strengths relative to ETL.  ICOT 
researchers seem to be very sophisticated with respect to parallel 
languages, compilers and runtime systems; the parallel machines, on the 
other hand, were not that exciting.  

I do not think that anyone can claim any longer that the KL1 language 
used extensively at ICOT is a logic programming language (ICOT 
researchers themselves are quite frank about this).  The main remaining 
vestige of logic programming (albeit a very important one) is the "logic 
variable" which is used for asynchronous communication.  Logic variables 
in KL1 are very similar (perhaps identical) to "I-structure variables" in 
Id, the programming language developed at MIT over the last 6 years.  

Regardless of whether we label KL1 as a logic programming language or 
not, it is certainly a very interesting and expressive language, and is 
perhaps the largest and most heavily used parallel symbolic processing 
language in existence anywhere.  Because of the sheer volume of 
applications that people are writing in KL1 and running on ICOT's 
parallel machines (we saw 5 demos from a very impressive suite of demos), 
I think that ICOT researchers are certainly as experienced and 
sophisticated as anyone in the world about parallel implementations of 
symbolic processing: compilation, resource allocation, scheduling, 
garbage collection, etc.  

ICOT's machines are not as exciting as ETL's.  The original PSIs (130 
KLIPS) were heavily horizontally microcoded sequential machines, and one 
must wonder whether they will not go the way of Lisp machines, i.e., made 
obsolete by improving compiling technology on modern RISC machines.  PSIs 
were not originally conceived of as nodes of a parallel machine.  Thus, 
ICOT's two multi-PSIs, which are networks of PSIs (2D grid topology), are 
just short term prototypes for experimentation.  ICOT researchers want to 
put one of the two Multi-PSIs on the Internet for open access, but they 
are having trouble convincing MITI to allow this.  

ICOT's real parallel targets are the PIM machines, the first of which (a 
PIM/p) had just been delivered to ICOT during our visit (it was not yet 
up and running).  ICOT's machines are built by various industrial 
partners, of course with heavy participation in the design by ICOT 
researchers.  There are 5 different PIM architectures (different node 
architectures, different network architectures) with 5 different 
industrial partners.  I was surprised by this because this will lead to 
serious portability problems for the software.  On the positive side, I 
suppose they will gain a lot of experience on a variety of architectures, 
and on portability, and can learn from the best of each!  From what 
little I know about the PIM architectures, they do not seem to be as 
exciting as ETL's EM-4 and EM-5 machines.  

The NEC C&C Systems Research Lab has also been involved with ICOT in the 
Fifth Generation project.  NEC's CHI machine (300 KLIPS), a single user 
microcoded machine for logic programming, predates and outperformed 
ICOT's PSI machine.  However, like the PSI machine and Lisp machines, I 
expect that this type of machine will become obsolete as compiling 
technology on RISC machines improves.  NEC has also started work on an 
implementation of ICOT's A'UM programming language.  

Kahaner notes that additional technical details of ICOT research is given 
in the report (icot-sci, 17 May 1991). A number of Japanese researchers 
have remarked that one of the most important aspects of the ICOT project 
is it gives many young Japanese researchers the opportunity to meet 
together informally (outside their individual corporate or University 
environment) and assists in the networking that is so prevalent in 
Japanese science.  

Nikhil continues...
Prof. Terada's dataflow laboratory at Osaka University is remarkable in 
the degree to which they collaborate with industry.  
                Professor Hiroaki Terada
                Department of Information Systems Engineering,
                Faculty of Engineering,
                Osaka University,
                Yamadaoka, 2-1 Suita, Osaka, Japan 565
                  Tel: +81(Japan)-6-877-5111, Fax: +81 6 875 0506
They are one of the notable exceptions to my prior image of research at 
Japanese universities: starved of funds from the education ministry and 
generally not very exciting.  Prof. Terada has close collaborations with 
Mitsubishi, Sharp, Matsushita and Sanyo.  They developed a TTL dataflow 
machine Q-p in 1983-86 (2-4 MOPs); they now have Q-pv, a multi-chip VLSI 
version (20 MFLOPs), and they are planning to integrate this further in 
Q-v1, a single-chip version (50 MFLOPS).  A unique aspect of their 
technology approach is that they have an asynchronous, self-timed design; 
they have consciously avoided clock-synchronous circuits.  

The architecture of all these machines is a ring, similar to the 
Manchester dataflow machine.  Like the ETL project, this project again 
seems weak in software, with the result that no significant applications 
are written, which in turn means that the hardware design is difficult to 
evaluate.  Prof. Nishikawa, a member of the group, is leading a project 
to develop AESOP, a program development environment for these dataflow 
machines.  
        Professor Hiroaki Nishikawa 
        Department of Information Systems Engineering 
        Faculty of Engineering
        Osaka University
        Yamadaoka, 2-1 Suita, Osaka, Japan 565
         Tel: +81 (6) 877 5111 ext. 5018,
         Fax: 81 (6) 875 0506 Telex: 5286-227 FEOUJ J 
         Email: nisikawa@oueln0.ouele.osaka-u.ac.jp 
It appears that he is aiming for some kind of a visual programming style, 
where one draws dataflow graphs on a screen and chooses a mapping of 
these graphs onto the physical rings of the dataflow machine.  
Personally, I am not very impressed with the visual programming languages 
that I have seen to date: they are too complicated and inflexible.  


7TH IEEE CONFERENCE ON DATA ENGINEERING, KOBE, APRIL 10-12, 1991

The conference has become very large, with at least 3 parallel sessions 
at all times and over 700 pages in the proceedings, so it was impossible 
for anyone to get a complete overview.  

Object Oriented Databases (OODBs) was the dominant topic.  There were
papers on:
- notions of consistency
- declarative (or associative access) query languages
- storage and indexing
- models of views
- models of time
- user interfaces
- etc.
Unfortunately, it is disappointing that there is still very little work 
on developing a simple and clear semantics for OODBs.  Whereas the 
relational model had a single, simple model that was agreed upon by a 
large community of researchers, today each OODB seems to come with its 
own, unique model, often described imprecisely or with arcane formalism.  
Consequently, there is very little basis available for objective 
comparisons of OODBs with each other or with relational DBs.  

There were several papers on parallel implementations, including:
- Parallel transitive closure and join algorithms
- Scheduling on shared-memory and shared-nothing machines
- Data distribution strategies
- dataflow implementation
- etc.
Most parallel implementations are on stock parallel hardware.  Parallel 
database machines per-se seem to have fallen out of vogue---only one such 
machine was described (FDS-R2 at Univ. of Tokyo; Kitsuregawa et.  al.) 
Kahaner notes that Professor M. Kitsuregawa, from the U-Tokyo Institute 
of Industrial Science, has published several papers on SDC, the "Super 
Database Computer", for example [Kit91]. He also notes that the National 
Science Foundation (NSF), in cooperation with other agencies, has funded 
the Japanese Technology Evaluation Center (JTEC) at Loyola College in 
Maryland to assess the status and trends of Japanese research and 
development in selected technologies. In March 1991, a JTEC team headed 
by Professor Gio Wiederhold (Stanford University) 
[gio@earth.stanford.edu], visited Japan to evaluate Japanese database 
technology, and the team presented a workshop on their preliminary 
results 30 April 1991 at the NSF. A comprehensive report is currently in 
preparation.  

Genomic databases generated a lot of excitement--- the panel on this 
topic drew a huge audience.  Genomic databases will contain huge volumes 
of data with unique requirements: inaccurate information, incomplete 
information, retrieval using approximate matching and sophisticated 
inference.  Many people seem to view genomic databases as the new 
frontier and driving force in DB research, a beautiful application with 
lots of exciting research problems (and lots of funding?) for the DB 
community.  

Deductive databases (the marriage of logic programming languages and 
databases) seem to be generating less interest than they were some years 
ago.  There were a few papers on query optimization.  

The remaining papers reported steady, if unspectacular progress on a 
variety of topics: 
- Distributed DBMSs (optimization, voting protocols)
- Concurrency control (in high contention DBs, parallel DBs)
- Indexing and query languages for temporal databases (time attributes, 
     versions) 
- Indexing and query languages for spatial databases (e.g. geographic maps)
- Incomplete information (formal models, approximate answers to queries)
- Heterogeneous databases (transaction protocols, serializability)
- Efficient post-failure restart algorithms
- Simultaneous optimization for multiple queries

REFERENCES

[Arv87] Two Fundamental Issues in Multiprocessing,
        Arvind and R.A.Iannucci,
        Proc. DFVLR - Conf. on Parallel Processing in Science and Engineering,
            Bonn-Bad Godesberg, W. Germany, June 25-29, 1987,
        Springer-Verlag LNCS 295

[Kit91] Multiple Processing Module Control on SDC, the Super Database 
        Computer, 
        S.Hirano, M.Harada, M.Nakamura, Y.Aiba, K.Suzuki, M.Kitsuregawa, 
        M.Takagi, and W.Yang,
        Proc Japan Soc Parallel Proc, Kobe Japan, May 14-16, 1991, pp 53-
        60.  

[Nik89] Can dataflow subsume von Neumann computing?
        R.S.Nikhil and Arvind
        Proc 16th Intl Symp on Computer Architecture, Jerusalem, Israel,
        May 29-31, 1989, pp 262-272.

[Nik91] *T: a Killer Micro for a Brave New World,
        R.S.Nikhil, G.M.Papadopoulos and Arvind
        CSG Memo 325, MIT Laboratory for Computer Science
        545 Technology Square, Cambridge, MA 02139, USA
        January 1991

[Sak90] An Architecture of a Dataflow Single Chip Processor,
        S. Sakai, Y. Yamaguchi, K. Hiraki and T. Yuba
        Proc. 16th Annual International Symposium on Computer Architecture,
            Jerusalem, Israel, May 28-June 1, 1989, pp 46-53

[Sak91] Architectural Design of a Parallel Supercomputer EM-5, 
        S. Sakai, Y. Kodama, Y. Yamaguchi
        Proc Japan Soc Parallel Proc, Kobe Japan, May 14-16, 1991, pp 
        149-156.

[Yam89] An Architectural Design of a Highly Parallel Dataflow Machine
        Y. Yamaguchi, S. Sakai, K. Hiraki, Y. Kodama and T. Yuba
        Proc. Information Processing 89, Aug 28-Sep 1, San Francisco,
        pp 1155-1160.

-------------------END OF REPORT-------------------------------------


-- 
=========================== MODERATOR ==============================
Steve Stevenson                            {steve,fpst}@hubcap.clemson.edu
Department of Computer Science,            comp.parallel
Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell