[comp.sys.super] Kahaner Report: Parallel Computing in Japan

rick@cs.arizona.edu (Rick Schlichting) (11/06/90)

  [Dr. David Kahaner is a numerical analyst visiting Japan for two-years
   under the auspices of the Office of Naval Research-Far East (ONRFE).  
   The following is the professional opinion of David Kahaner and in no 
   way has the blessing of the US Government or any agency of it.  All 
   information is dated and of limited life time.  This disclaimer should 
   be noted on ANY attribution.]

  [Copies of previous reports written by Kahaner can be obtained from
   host cs.arizona.edu using anonymous FTP.]

To: Distribution
From: David Kahaner ONRFE [kahaner@xroads.cc.u-tokyo.ac.jp]
      H.T. Kung CMU [ht.kung@cs.cmu.edu]
Re: Aspects of Parallel Computing Research in Japan---NEC & Fujitsu.
Date: 6 Nov 1990

ABSTRACT. Some aspects of parallel computing research in Japan are
analyzed, based on authors' visits to a number of Japanese universities
and industrial laboratories in October 1990. This portion of the report
deals with supercomputing and parallel computing at NEC and Fujitsu.

PART 2.

The following outline describes the topics that are discussed in the
various parts of this report.

PART 1 OUTLINE------------------------------------------------------------
  INTRODUCTION
  SUMMARY
  RECOMMENDATIONS 
  
PART 2 (this part) OUTLINE------------------------------------------------
  FUJITSU OVERVIEW
    Company profile and computer R&D activities
    VP2000 series supercomputer organization and performance
    PARALLEL PROCESSING ACTIVITIES
     SP (Logic Simulation Engine)
     AP1000 (Cellular Array Processor)
     RP (Routing Processor)
     ATM (Asynchronous Transfer Mode) Switch
    MISCELLANEOUS FUJITSU ACTIVITIES
     Neurocomputing
     HMET 

  NEC
    SX-3 series supercomputer organization and performance
      Benchmark data for SX-3, VP2000, and Cray.
      Comments
    MISCELLANEOUS NEC PARALLEL PROCESSING ACTIVITIES

PART 3 OUTLINE------------------------------------------------------------
  HITACHI CENTRAL RESEARCH LABORATORY
    HDTV
    PARALLEL AND VECTOR PROCESSING
      Hyper crossbar parallel processor, H2P
      Parallel Inference Machine, PIM/C
      Josephson-Junctions
      Molecular Dynamics

   JAPAN ELECTRONICS SHOW, 1990
     HDTV
     Flat Panel Displays

   MATSUSHITA ELECTRIC
     Company profile and computer R&D activities
     ADENA Parallel Processor
     MISCELLANEOUS ACTIVITIES
       HDTV
     Comments about Japanese industry

PART 4 OUTLINE-----------------------------------------------------------
    KYUSHU UNIVERSITY
      Profile of Information Science Department
      Reconfigurable Parallel Processor
      Superscalar Processor
      FIFO Vector Processor
      Comments

    ELECTROTECHNICAL LABORATORY
      Sigma-1 Dataflow Computer and EM-4
      Dataflow Comments
      CODA Multiprocessor

    NEW INFORMATION PROCESSING TECHNOLOGY
      Summary
      Comments

    UNIVERSITY OF TSUKUBA
      PAX

    SANYO ELECTRIC
      Company profile and computer R&D activities
      HDTV
END OF OUTLINE----------------------------------------------------------



FUJITSU OVERVIEW.
Currently about a $16Billion US corporation (based on 158Yen/$), with
sales and income growing about 10%/year.  As with most Japanese
companies, Fujitsu includes many subsidiaries (Fujitsu Laboratories,
Fujitsu Business Systems, Fujitsu America, etc.), and affiliates, and
has about 115,000 employees, about 50,000 in Fujitsu proper, the
remainder in associated companies. R&D expenses are about 12% of sales
and have been increasing more rapidly than sales growth.  Corporate
sales are divided as follows.

              Computers             66% 
              Communications        16 
              Electronic devices    14
              Other                  4 
The most important factor in sales growth was the rapid growth in
overseas (outside Japan) sales, now accounting for about one fourth of
the total. The company states that major strategic objectives are to
strengthen activities in information management, and further globalize
the company. Recently they purchased 80% of British based ICL
(International Computers Ltd). Global research and development,
including software development is mentioned as a specific goal.

The company develops and markets a wide range of computers and related
peripherals such as disk subsystems, including a 32 workstation with
built in CD-ROM with secretary-friendly video and sound, FM-Towns,
(apparently available only in Japan) to a large scale supercomputer,
VP2000 series, whose deliveries began spring 1990. A vast range of
semiconductor devices, memories, etc.  and other new technologies, are
sold outside the company and also used in Fujitsu specific products. For
example, Sun SPARK chips were originally purchased directly from
Fujitsu. The company is also very active in important areas of switching
and telecommunication technologies related to HDTV, digital switching
systems, etc.  Fujitsu is also researching high compression rate
encoding for visual telephones and TV conferencing, as well as encoding
methods for HDTV and variable rate encoding methods for future packet
communications.

The main research arm of Fujitsu is the Fujitsu Laboratories, a
subsidiary corporation that operates two laboratories, one in Kawasaki
and the other in Atsugi, both in suburban Tokyo. Total employment is
about 1500. The Atsugi lab, established in 1983 is responsible for
research in areas of electron devices, electronic systems, and advanced
materials. The Kawasaki lab, established in the mid 1960s is on the
grounds of some other Fujitsu facilities, so that the total working
population there is over 12,000. The Kawasaki lab concentrates on
information processing, communication, space, and personal systems. The
overall educational background of the laboratories is interesting.

           Electronics             48% 
           Physics                 19 
           Computer Science        10
           Chemistry               10
           Mechanical Engineering   5
           All others               8
This is certainly one reason for the wealth of activities in hardware
relative to software.  Half of the staff have Masters degrees; only 10%
hold doctorates.

As mentioned above, Fujitsu is working hard to be a global corporation.
That means both R&D and manufacturing outside of Japan. For example,
Fujitsu signed a five year joint research agreement in October 1989 with
the Australian National University in Canberra.  Subjects include
advanced computers, both large scale supercomputers and more exotic
parallel computers, and computer vision using the visual mechanism of
insects.  Another global research project is with the German software
company Aris, to develop software for automatic translation of Japanese
technical materials and documents into German. When complete, the system
will contain a dictionary, syntax for generating German, and appropriate
development tools for both the dictionary and the syntax. Various
natural language processing and voice recognition systems are also under
study, as is a real-time fingerprint sensor system using holography, and
an on-line handwritten input system claimed to be able to correctly
recognize Kanji, Katakana and Hiragana Japanese characters.
Unfortunately we had no opportunity to see any of these last projects.

Fujitsu computers are heavily used in the mainframe world. The company's
efforts in large scale supercomputers are interesting.  More than 100
orders have been received for computers in the VP2000 series. The most
powerful model, the VP2600 has a maximum performance of about 5
gigaflops. According to Fujitsu at least one VP2000 has been installed
in Kodak headquarters in Rochester NY.

What follows is a brief summary Fujitsu VP2000 series supercomputers.
Fujitsu offers four models in this series, as follows.

VP2100 /10, /20 (peak performance 0.5 GFLOPS)
VP2200 /10, /20 (peak performance 1.0 GFLOPS), /40 (peak 2.0 GFLOPS)
VP2400 /10, /20 (peak performance 2.0 GFLOPS), /40 (peak 5.0 GFLOPS)
VP2600 /10, /20 (peak performance 5.0 GFLOPS)

Models designated as /10 have one scalar and one vector arithmetic unit.
Models designated as /20 have two scalar and one vector arithmetic units.
Models designated as /40 have four scalar and two vector arithmetic units.
The /10 and /20 systems are uniprocessor, the /40 is multiprocessor.
Their nomenclature is mildly confusing, as the designation /x0
corresponds to the number of scalar rather than vector units, even
though the latter determine peak performance.

Fujitsu is deeply interested in multiprocessing; one indication has been
their MITI-sponsored research jointly with NEC and Hitachi, called
informally the HPP project, involving four VP2600s each operating as a
uniprocessor attached to a very large shared buffer memory.  Fujitsu
claims that such a large multiprocessor was developed mainly to
demonstrate their success with room temperature HMET devices (see below)
as the communications drivers between the computers and memory.
Nevertheless, using this, a NEC researcher was able to solve a very
large system of 32K linear equations in less than 11 hours.  For more
details see Kahaner's report 21 June 1990, "japgovt".

Fujitsu is probably experimenting on a /40 multiprocessor for the
VP2600, but has not released any public information about this.  Without
a /40 for the VP2600, Fujitsu's VP2000 series peak performance (however
unrelated to actual performance) will fall short of current competition
from NEC as well as new machines from Cray, and perhaps others. In the
meantime though, the VP2000 series come in a variety of colors,
including Elegance Red, Future White, and Florence Green.

Peak performance of the /10 and /20 models in any line are the same, as
this is determined entirely by vector processing.  Peak performance can
easily be computed once the machine cycle time and the maximum possible
number of simultaneous floating point operations are known.  For
example, the VP2400/40 and VP2600 each have cycle times of 3.2
nanoseconds.  To achieve the advertised 5.0 GFLOPS peak implies 16
simultaneous floating point operations. For the VP2400/40 this requires
eight per vector unit, while for the VP2600/20 sixteen simultaneous
operations are required.  Each of Fujitsu's vector units is described as
having two arithmetic pipes, but in reality they are more complicated.
Each pipe is capable of simultaneously performing both an addition and a
multiplication. In addition the pipes effectively deliver twice
(VP2400/40) or four times (VP2600/20) as much data. Thus each pipe on
the VP2600/20 can produce the result four floating point additions and
four floating point multiplications per cycle. This is similar to the
"superword" concept on the ill fated Cyber 205. Of course, if a
calculation is dyadic, that is does not involve both a multiplication
and addition, then the peak performance will be reduced by 50%.

By studying the performance of VP2000 machines on typical job streams it
has been  observed that when the scalar unit is 100% in use, the vector
unit is about 50% to 75% busy. Thus the addition of a second scalar unit
can significantly increase throughput, and was presumably Fujitsu's
reason for adding it.  However, for any single user problem it might not
be possible to keep the vector unit constantly busy. Thus the most
practical environment for such a setup would be a computing center or
other multi user job shop, where several user jobs can be run
simultaneously. Kyoto University, a typical busy university computing
center, will be getting a VP2600/10 soon. We asked about why only one
scalar processor. Although the university made a very strong case for
two scalar processors, the Ministry of Education decided (based on
budgetary, or other, grounds) to only support the one scalar processor
system. However it is an easy field upgrade to add the second scalar
unit. The choice of a VP2600/10 rather than a VP2400/40 was a matter of
policy; Kyoto has always tried to purchase the fastest machine
available. It is also possible that they would like to upgrade
eventually to a multiprocessor 2600 when this is available.

As is the case with most of today's vector supercomputers, data to and
from the vector arithmetic units need to pass through vector registers.
In the VP2600 these registers have a capacity of 128KB (64 elements
times 256 registers times eight byte data) but can be concatenated in
various ways, for example as 2048 times 8 times eight byte instead. Thus
the organization of the registers is very flexible. To get data between
memory and the vector registers Fujitsu only provides two load/store
pipelines. This could be a bottleneck, although the register flexibility
may allieviate it to a certain extent. Memory to register bandwidth has
been criticised in the VP2000 series, but at least one new benchmark,
given below, suggests that Fujitsu has been making efforts to deal with
this.  The computation of interest is that of multiplying large matrices
A=B*C, each of which is 4096 by 4096, with real 64 bit floating point
components. The source program is written in 100% standard Fortran but
is organized to take advantage of the two pipe structure of the VP2000
architecture in a very clear way.  The essential segment of the source
program consists of first zeroing the target array.

        DO 4000 J=1,4096
         DO 4000 I=1,2048
           A(I,J)=0.0
           A(I+2048,J)=0.0
   4000 CONTINUE
     
Then the actual multiplication is as follows.

        DO 5000 L=0,1
         DO 5000 J=1,4096
          DO 5000 K=1,4096,4
           DO 5000 II=1,2048
             I=II+(2048+L)
             A(I,J)=A(I,J)+B(I,K)*C(K,J)+B(I,K+1)*C(K+1,J)
       *            +B(I,K+2)*C(K+2,J)+B(I,K+3)*C(K+3,J)
   5000 CONTINUE

In this case the matrices are large enough that there is significant
memory to register to memory traffic.  Nevertheless, Fujitsu's FORT77/VP
compiler is able to vectorize this effectively and generate 4.8 GFLOPS,
96% of peak performance.

One comment is worth making here. At the InfoJapan 90 meeting a lecture
was presented by Nobuo Uchida, from the Mainframe division of Fujitsu,
on the architecture of the VP2000 series computers. We found it
particularly interesting that his paper made no mention of the /40
series in the VP2000 lineup.  The English product announcement about the
/40 had been distributed shortly before the meeting, and the Japanese
announcement was available weeks before that.  Because the /40 is a
multiprocessor, it represents a most important addition to their product
line. The characteristics and properties of new advanced computers are
of real interest to the research community, especially those who travel
long distances to hear about them.  Perhaps there was a manuscript
revision that we did not notice. Nevertheless, it was disappointing that
this new system was not included in his discussion. Perhaps it is
related to Fujitsu's silence about a VP2600 multiprocessor.


FUJITSU'S ACTIVITIES IN PARALLEL PROCESSING. 

In our recent visit to Fujitsu Laboratories, we visited the 
following three  parallel processing projects.
(1) SP (Logic Simulation Engine). This is a special purpose 64 processor
event driven parallel computer designed to test the logic design of VLSI
chips before they are built.  It is claimed that it has larger capacity
than any other simulator and that simulation times are about 30 times
faster than using Fujitsu's 780 mainframe. Testing a 1MB gate chip takes
about 4 hours on the SP, and this is 1000 times faster than the 780.
The SP is implemented in TTL, with gate arrays for the ECC
implementation. (Fujitsu can build 200K gate, 331-pin arrays currently.)
Ten SP machines have been built, and 2 are in use by Amdahl in the U.S.
The others are for internal use.  Fujitsu claims that partly due to its
use of event driven simulation, SP is 100 times faster than the IBM
Yorktown Simulation Engine and feels that the SP is a successful effort.
(NEC Corp also has a logic simulator, Hal II and TDHal.) It seems that
most computer companies in Japan have developed their own special
purpose parallel engines for logic simulation for their internal use.

(2) AP1000, renamed from older CAP (Cellular Array Processor).  This is
composed of up to 1024 cells or processors.  Each cell is composed of a
SPARC chip (for ease of software development), Weitek floating point
unit and gate array router running at 25MHz, and 16MB of memory.  Cells
can communicate using wormhole routing in a two dimensional mesh using
25MB/sec channel.  The standard structured buffer pool is used to avoid
deadlocks.  The network also supports row and column broadcasting.  The
router and SPARC connection is 40 MBytes/sec.  Since the connection is
also shared by the CPU cache, the actual available bandwidth is still
under evaluation.  In addition, a special frame buffer can read out from
each cell so that image data can be partitioned up among cells
efficiently. Maximum performance is 12.5MFLOPS/cell, and 12.8GFLOPS for
a fully configured 1024 cell system.  AP1000 has good (but not
spectacular) communication and good numerical performance potential.
Fujitsu expects that it will typically be connected to a Sun-4 as a host
via a VME bus. This project has been going on for a number of years
under the old name CAP.  (CAP is also the name of the Cellular Array
Processor developed by Mitsubishi Electric for satellite image
processing. As far as we know there is no relation between these
projects.) A team of about 10 people have been at work on the AP1000 for
two years.  The new AP1000 system is much more powerful, primarily
because of the use of SPARC chips and Weitek floating-point chips.  In
contrast, the old system used Intel 80186 chips.  Present plans are to
begin production this fall with installation of 7 or 8 machines in
spring of 1991.  Of these, most are to be 64-cell systems and one is to
be a 512-cell or 1024-cell system.  (A 1,024-cell system is scheduled to
be built in April 1991.) Currently a 16-cell system is running. The
64-cell system, with about 800 MFLOPS peak performance should cost the
company about $300K U.S.

We were shown a straightforward ray-tracing example which is a perfect
candidate for data parallelism.  The system currently has a home-made
run-time system, and no parallelizing compiler for either C or Fortran.
We were told that in addition to scientific computing, visualization,
and CAD, one potential application was for design rule checking, but in
that case it isn't clear why floating point is necessary.  The
Australian National University will get a 128-node AP1000 system and
will help with software development and evaluation.  (Contact: Prof. M.
McRobbie [mam@arp.anu.edu.au]).

As with the earlier CAP project, Fujitsu has a nice color sales brochure
about the AP1000, but this is still considered an experimental machine.
Probably its most important uses will be internal to Fujitsu, similar
to the SP model. We feel that the project is probably a few years behind
similar work at leading research places in the U.S., primarily because
of the differences in software and interprocessor communications
capabilities.

Two contacts for this project are given below.
             Mitsui Ishii [mishi@flab.fujitsu.co.jp]
             Hiroyuki Sato [hsat@flab.fujitsu.co.jp]
             Fujitsu Laboratories
             1015 Kamikodanaka Nakahara-ku
             Kawasaki 211, Japan             
               Tel: (044) 777-1111, -2327
             
(3) RP (Routing Processor).  This is a special-purpose SIMD machine to
implement the maze routing.  A performance goal is to route large (e.g.,
100K-gate) gate arrays in approximately one hour.  To implement the
machine, bit-serial PEs (Processing Elements) are used.  A 4K-PE system
is operational.  We saw a successful demonstration of the system in
doing a difficult switch box routing.  Since in maze routing only PEs on
the wave front are active at a given time, the system will typically
multiplex four "logical" PEs onto each "physical" PE to ensure efficient
utilization of physical PEs.  Approximately 5 people have been working
on the RP project for two years.  They are currently building a 16K-PE
RP.

A challenge of using special purpose CAD engines such as the RP is its
graceful integration with the rest of the CAD system.  Also, it is not
clear about how the RP can take advantage of hierarchical information
available in a design.  Fujitsu researchers are looking at these issues.

ATM Switch. 
In addition to the three parallel processing projects described above,
we also visited a major project on the development of an ATM
(Asynchronous Transfer Mode) switch.  The basic idea is that data is
divided into cells which are 53 byte packets and then transmitted along
the transmission path without synchronization. The application area here
is ISDN and HDTV. Such a switching system will be able to handle
multi-media communication of voice, data, video, etc.  Fujitsu has been
working on this project for several years, and CUT? -> NEW--I didn't
like your English & changed it below a bit.  claim that they have
prototyped the world's first ATM switch.  Built out of a special IC
using a BI-CMOS RAM and logic gate array, the current system is a 16 by
16 switch, of three stages with two 8 by 8 crossbar switches per stage.
Each port is 78 MHz and 16-bit wide, allowing for 1.2 Gbits/second per
port.  The 16 by 16 switch, housed in one cabinet, therefore can handle
128 150 Mbits/second channels.  There is a 128-cell buffer at each
output port of every crossbar.  Switch routing is based on the
destination tag, corresponding to the virtual circuit identifier (VCI)
number. Cell sequencing is maintained, but cells may lose data if there
is congestion.  Presently, two 16 by 16 prototypes have been built and
are being used to evaluate cell lossage characteristics. Eventually a
SONET interface will be installed, but this is not supported yet.
Instead a proprietary interface is being used during the testing phase
of the project.

In parallel processing the company's research effort emphasizes more
special-purpose machines such as SP and RP than we would expect from a
U.S. company. The best research projects such as ATM switch, SP, and RP,
are completely driven by development needs.  The strongest efforts seem
to be related to switching and the CAD related issues.  Projects more to
do with basic research such as AP1000 do not seem to be as advanced
compared to work in the U.S.

MISCELLANEOUS FUJITSU COMPUTING ACTIVITIES.
Neurocomputing. The usual metric here is the number of changes to the
weight matrix that are possible each second. Earliest research in
neurocomputing used traditional computers to simulate the architecture
of a neural network. The next step is to implement some aspects of the
network in hardware. By using special purpose digital signal processor
chips Fujitsu has demonstrated more than 500 million connection changes
per second.  A longer range goal is to use biological elements as part
of the architecture, but we have seen no substantial results yet.
 
Associated with neuro computers are various forms of inference engines
that are often implemented with robot applications in mind. Fujitsu has
also been working in these areas with particular emphasis on robot
vision. This again relies of special purpose hardware. They have also
used fuzzy logic to study driverless vehicles and obstacle avoidance.
They have developed the Idaten color image processing system which can
be used to distinguish objects moving at different speeds, and so, for
example, to do real time scanning of a runner, determine speed and
stride and then estimate the time to finish line. This particular
research has applications in many other areas and should be followed.

Another neural net research project has been joint with Nikko Securities
to investigate how well neural nets can predict the buy/sell times for
stock transactions and to rate convertible bonds by looking at various
financial indices.
    Takashi Kimoto
    Computer Based Systems Lab
    Fujitsu Laboratories, Kawasaki
    1015 Kamikodanaka, Nakanara-Ku, Kawasaki 211, Japan

Electrical devices, including an 8 bit Josephson digital signal
processor, and room temperature HMETs (High Electron Mobility
Transistor).  In 1980 Fujitsu developed HMET. At liquid nitrogen
temperatures, -196C, electrons move about 200 times as fast as they do
in silicon.  As part of the government sponsored "high speed computing"
project Fujitsu has now developed a 4K-bit static RAM that operates at
room temperature with 500 pico second clock (fastest memory operations
yet reported), and a 4.1K-gate gate array.  Further developments have
resulted in a chip with 3335 HEMTs with 490ps data propagation time.
Fujitsu claims that they will use this in a new version of a
supercomputer they will soon build.   Presently, several prototype
system components at the LSI level have been built. These are a
1.1K-gate bus driver, a 3.3K-gate random number generator (1.6GHz), and
an 8-bit digital-to-analog converter (1.2GHz).  This technology, which
is almost completely proprietary to Fujitsu, may be significantly useful
in future computing systems.  However, since the HPP project is over, it
will not be easy for Fujitsu to build these kind of experimental
supercomputers unless they can be supported by some new government
programs.

Our overall host for this visit was
           Mr. Shigeru Sato
           Board Director
           Fujitsu Laboratories
           1015 Kamikodanaka Nakahara-ku
           Kawasaki 211, Japan
           Tel: (044) 777-1111
Mr. Sato spent many years in one of Fujitsu's development "works" before
moving to the laboratory. We were impressed with his basic grasp of
technical issues and understanding of the role that research plays in
the development cycle. We asked him if the efforts of other Japanese
companies (such as NEC) to establish research laboratories outside of
Japan had any parallel at Fujitsu. He explained that Fujitsu had several
active research collaborations including at the Australian National
University, mentioned above, and it was also looking into the
possibility of having closer contacts with some U.S. universities such
as Carnegie Mellon, in Pittsburgh. Although he was remarkably frank with
us, we didn't have time to discuss strategic issues with Sato. We did
ask about the success of technology transfer, and he suggested that one
reason for its success is that researchers define the research project
with development groups before the project actually begins.

Two days after this initial visit, Kung with T. W. Kang (General
manager, Systems Group of Intel Japan) went back to visit the Fujitsu
Laboratories again for a meeting with their researchers.  The purpose of
the meeting was to discuss applications areas for iWarp-like distributed
memory parallel machines.  We identified several potential areas and had
some lively discussions.  It was generally felt that some CAD areas and
the neural net learning can make the best use of parallel machines.  In
the CAD area, we predicted that the expected speed up ratio due to
parallel processing will be 100,000 for logic simulation, 1,000 for
test-pattern generation and for placement and routing, and 100 for
design rule check and circuit simulation.  The fruitful discussion
meeting was organized by:
      Fumiyasu Hirose,  Senior Researcher 
      Artificial Intelligence Laboratory 
      Fujitsu Laboratories LTD. 
      1015, Kamikodanaka
      Nakahara-Ku Kawasaki 211 
      Tel: (044) 754-2663 FAX: (044) 754-2580 
      Email: hirose@yugao.stars.flab.fujitsu.co.jp


NEC.
Kahaner visited this factory in March 90 and reported on the SX-3 at
that time. Then the only running system has one processor. Now, several
one processor machines are being tested prior to shipment and a two
processor system has been setup and is being debugged.  Chief designer
Watanabe stated that a one processor system depending upon peripheral
options would cost in the neighborhood of $10 million U.S. He claimed
that the 4 processor system will be up in a few months, and we have
heard estimates that it will cost roughly $25 million.

Peak performance of a uniprocessor system is 5.5 GFLOPS, based on a
cycle time of 2.9 nanoseconds and 16 simultaneous operations
(16/2.5=5.5).  The vector unit in such a system consists of one, two, or
four sets of vector pipelines. Each vector pipeline set consists of two
add/shift and two multiply/logical functional pipelines. Each of the
functional pipelines can be operated simultaneously; thus the arithmetic
processor in a uniprocessor system with four vector pipeline sets can
execute up to 16 floating point operations per machine cycle.  To get
near peak performance all 16 pipes must be kept busy.  Data are fed to
and exit from the arithmetic pipes to vector registers, with a maximum
capacity of 144KB.  It is unlikely that an SX-3 system would be
purchased without all four pipes in each processor.

The four processor system is thus capable of 22 GFLOPS peak, although
this assumes that all the data can be kept in the vector registers. To
the extent that data must be brought from main memory to the registers
performance may degrade. The bandwidth between memory and the registers
depends on the memory hardware technology, and on how the data is
arranged in the memory banks, but serious applications must keep data in
registers to get good performance.  Further, 22 GFLOPS requires 64
simultaneous operations, and this will mean that different operations
have to occur simultaneously.  Also, unless the user program can be
divided up into simultaneous, independent tasks that use the same data
in the vector registers, arrays will have to be quite long to absorb the
startup penalty of being parcelled out to several processors.  The most
effective environment for such multiprocessors is a busy multiuser
computer center, similar to that for other large multiprocessors. Most
computer centers will charge a penalty for single users who want to grab
all four processors.  Yoshihara also discussed some aspects of this in
benchmark calculations earlier this year, see Kahaner's distribution 1
May 1990 "yosh".

At least three or four uniprocessor systems have been sold, in Europe.
We were not told about sales of two or four processor systems.

Users can write Fortran without any special directives. NEC provides an
automatic parallelizing and vectorizing compiler option. We had no
opportunity to test this. Watanabe showed us results of running 100 by
100 LINPACK (all Fortran) giving performance on the SX-3 Model 13
(uniprocessor) and several other supercomputers as follows. He also
showed some corresponding figures for 1000 by 1000 linear system and for
1024 by 1024 matrix multiplication given below.  The last two columns
correspond to what Dongarra calls "best effort".  There are no
restrictions on the method used or its implementation.  Matrix
multiplication runs almost at theoretical peak speed. The large linear
system runs at slightly less than 70% of peak, while on the Cray the
same calculation runs at just above 80%. The differences are probably
associated with bandwidth from memory to the vector registers.
Nevertheless, at 3.8 GFLOPS the SX-3 is 80% faster than the Cray.
 
                      Ax=b                  Ax=b            A=B*C
                     LINPACK               Best Effort     Matrix Mult
                 100 x 100  Fortran        1000 x 1000    1024 by 1024
SX-3/14                      216 MFLOPS     3.8 GFLOPS     5.1 GFLOPS
Fujitsu VP2600               147            2.9            4.8 (4096 by 4096)
Hitachi S-820/80             107
Cray Y-MP8 (8 processors)    275            2.1
Cray Y-MP1 (1 processor)      90
Cray X-MP4                                  0.8

(Note: VP2600 model was not specified for the Ax=b figures, and was /10
for A=B*C, but both 2600/10 and /20 have the same peak performance, 5
GFLOPS.) To the best of our knowledge, figures for the NEC and Fujitsu
machines are new.  We asked Watanabe if the SX-3 four processor
performance would scale up, and he only exclaimed "God knows".

NEC's chip technology is very good. Using ECL, they have crammed 20,000
gates with 70 pico second switching time onto one chip.  We think that
this is better than in the U.S.  A 1,200-pin multi-chip package can hold
100 such chips and dissipate 3K watts.  Packaging, carrier, and cooling
technology is about as good as in the U.S.  NEC claims that they have
taken extra care to design in error testing capability and that about
30% of their chip area is associated with diagnostic functions. (This is
certainly different from some U.S. manufacturers.)  The memory system
uses 20ns 256Kbit SRAMs.  A memory card can hold 32 MBytes.  Thus a
memory cabinet with 32 memory cards has 1 Gbytes.  Two peripherals are
worth noting. NEC  makes a cartridge tape unit (IBM compatible tapes),
fully automated, with 1.2 terabyte capacity.  NEC also makes a disk
array made of eight byte-interleaved disks.  Used as a single disk
drive, the disk array has a 5.5 gigabyte capacity.  The burst transfer
rate is 19.6 MBytes/sec, whereas the sustained transfer rate is 15.7
MBytes/sec.

NEC has begun publication of a newsletter about the SX-3, SX World.
Interested readers can obtain a copy by writing NEC, 1st Product
Planning Department, EDP Product Planning Division, 7-1 Shiba 5-chome,
Minato-ku, Tokyo 108-01, Japan. In this their view of supercomputing is
stated explicitly, "the actual performance of a supercomputer is
determined by its scalar performance...NEC's approach to supercomputer
architecture is clear. Our first priority is to provide high-speed
single processor systems which have vector processing functions and are
driven by the fastest technologies, while giving due consideration to
ease of programming and ease of use; we also seek to provide shared
memory multiprocessor systems to further improve performance."

The SX-3 looks like an exciting machine that is on a par with the best
currently available U.S. products.  There is a new U.S.  supercomputer
from Cray Research nearly ready to be released, as well as perhaps
models from Cray Computer Corporation and others,  but we have no
concrete information about their performance.  In its four processor
version, the SX-3 might be the fastest large scale supercomputer, but
this will be entirely dependent on the application and the skill of the
compiler writers. Fujii and Tamura ("Capability of Current
Supercomputers for Computational Fluid Dynamics", Inst of Space and
Astronautical Sci, Yoshinodai 3-1-3, Sagamihara, Kanagawa, 229 Japan),
note that "Basically the speed of the computations simply depend on when
the machines were introduced into the market. Newer machines show better
performance, and companies selling older machines are going to introduce
new machines."

NEC develops software expertise by use of in-house training; they have a
"college" for their employees. For example, Watanabe is in charge of
courses related to machine design. They also have a long history of
vector computing experience, as NEC mainframes have had vector pipes for
many years. They do not have experience in large scale multiprocessors
as far as we know, except through the HPP project, which was never
commercialized.  To developing software, NEC relies on 30 or so of its
subsidiaries in various places of Japan.  So software is often developed
in a distributed manner.

Watanabe told us that NEC did not have any plans to develop a smaller
general purpose multiprocessor, as they felt that the market would not
support the volume that would be required for profitability. Watanabe
has moved from the SX-3 factory to the corporate headquarters as a
strategic product planner.  The latter is one of the largest buildings
in central Tokyo, is shaped exactly like the U.S.  space shuttle except
for a huge gaping hole in its center to reduce wind loading. It is said
to be the "world's smartest building."

Watanabe represents an illustration of the remark made earlier about
senior research people moving into other corporate functions.

              Dr. Tadashi Watanabe
              Assistant General Manager
              EDP Product Planning Division
              NEC Corporation
              7-1, Shiba 5-chome
              Minato-ku, Tokyo 108-01
              Tel: (03) 798-6830 (Direct), (03) 454-1111
              Fax: (03) 798-6838

As far as innovative architectures are concerned, the SX-3 does not seem
to represent a substantial leap from state-of-the-art supercomputers.
Researchers in parallel computing are not excited by shared memory
machines, which they feel cannot scale up to make the kind of quantum
increases in computing speed that they are seeking.  But as an engine
for solving complicated scientific and engineering problems,  a factor
of two, or even a percentage improvement translates into real money and
new science. What is significant is how far NEC has come in a relatively
few years.  Now NEC has state-of-the-art capabilities in all aspects of
supercomputers, except perhaps in some software applications.  They do
not give away any area and make every effort to build everything
themselves. Customers seem quite loyal in Japan. Software compatibility
with existing systems, and personal relationships between vendor and
customer are important here, perhaps taking the edge off price
differences or delivery dates.

It is difficult to accurately judge how the SX-3 will compare with the
new U.S. supercomputers that should be delivered within the next year,
but it is clear that it should be at least competitive with them.  It
would be very useful for western researchers to have an opportunity to
study, test, and use this computer.  We have not had any chance to run
on the SX-3 although potential customers have had a few of their
important programs benchmarked.  One system, probably a two or four
processor version, will be installed in the NEC HNSX facility in
Houston.  We do not know about access to that, however in the past the
most impressive learning has occurred when supercomputers were "on
site".  Real benchmarking can only occur when a computer is used day in,
day out, and all aspects of its capabilities, problems and reliability
are uncovered.  If it isn't practical to get an SX-3 into a major U.S.
laboratory, we should consider the possibility of sending computational
scientists to Japan for several months or even a year, in order to
thoroughly evaluate the machine. NEC should be interested in these
efforts too.

MISCELLANEOUS NEC PARALLEL PROCESSING ACTIVITIES.
Other than the SX-3 series supercomputer, NEC has been involved in at
least four other parallel processing activities.  These are: (1) FMP
(Fingerprint Machine Computer) with 28 processors, (2) VSP-4 (Video
Signal Processor) with 128 processors, (3) HAL (Parallel Processing
Logic Simulator) with 64 processors, and (4) CENJU (Parallel Processing
Circuit Simulator) with 64 processors.  FMP is a commercial product and
about 180 sets of FMP systems have been shipped in the world.  The VSP
effort has influenced the NEC Visualink-1000 which is commercially
available.  HAL has been used in designing NEC SX-3, SX-3 Series and
other general purpose computers since 1985.  CENJU is an experimental
machine being used for design of DRAM, and Kahaner reported on CENJU in
a 2 July 1990 report "spice".

-----------END OF PART 2-------------------------------------------------

rick@cs.arizona.edu (Rick Schlichting) (11/06/90)

  [Dr. David Kahaner is a numerical analyst visiting Japan for two-years
   under the auspices of the Office of Naval Research-Far East (ONRFE).  
   The following is the professional opinion of David Kahaner and in no 
   way has the blessing of the US Government or any agency of it.  All 
   information is dated and of limited life time.  This disclaimer should 
   be noted on ANY attribution.]

  [Copies of previous reports written by Kahaner can be obtained from
   host cs.arizona.edu using anonymous FTP.]

To: Distribution
From: David Kahaner ONRFE [kahaner@xroads.cc.u-tokyo.ac.jp]
      H.T. Kung CMU [ht.kung@cs.cmu.edu]
Re: Aspects of Parallel Computing Research in Japan---Hitachi,
                   Matsushita, and Japan Electronics Show 1990.
Date: 6 Nov 1990

ABSTRACT. Some aspects of parallel computing research in Japan are
analyzed, based on authors' visits to a number of Japanese universities
and industrial laboratories in October 1990. This portion of the report
deals with parallel computing at Hitachi and Matsushita, and some
observations about the Japan Electronics Show 1990.

PART 3.

The following outline describes the topics that are discussed in the
various parts of this report.

PART 1 OUTLINE-----------------------------------------------------------

  INTRODUCTION
  SUMMARY
  RECOMMENDATIONS 

PART 2 OUTLINE-----------------------------------------------------------
  FUJITSU OVERVIEW
    Company profile and computer R&D activities
    VP2000 series supercomputer organization and performance
    PARALLEL PROCESSING ACTIVITIES
     SP (Logic Simulation Engine)
     AP1000 (Cellular Array Processor)
     RP (Routing Processor)
     ATM (Asynchronous Transfer Mode) Switch
    MISCELLANEOUS FUJITSU ACTIVITIES
     Neurocomputing
     HMET 

  NEC
    SX-3 series supercomputer organization and performance
      Benchmark data for SX-3, VP2000, and Cray.
      Comments
    MISCELLANEOUS NEC PARALLEL PROCESSING ACTIVITIES

PART 3 (this part) OUTLINE-------------------------------------------------
  HITACHI CENTRAL RESEARCH LABORATORY
    HDTV
    PARALLEL AND VECTOR PROCESSING
      Hyper crossbar parallel processor, H2P
      Parallel Inference Machine, PIM/C
      Josephson-Junctions
      Molecular Dynamics

   JAPAN ELECTRONICS SHOW, 1990
     HDTV
     Flat Panel Displays

   MATSUSHITA ELECTRIC
     Company profile and computer R&D activities
     ADENA Parallel Processor
     MISCELLANEOUS ACTIVITIES
       HDTV
     Comments about Japanese industry

PART 4 OUTLINE-------------------------------------------------------------
    KYUSHU UNIVERSITY
      Profile of Information Science Department
      Reconfigurable Parallel Processor
      Superscalar Processor
      FIFO Vector Processor
      Comments

    ELECTROTECHNICAL LABORATORY
      Sigma-1 Dataflow Computer and EM-4
      Dataflow Comments
      CODA Multiprocessor

    NEW INFORMATION PROCESSING TECHNOLOGY
      Summary
      Comments

    UNIVERSITY OF TSUKUBA
      PAX

    SANYO ELECTRIC
      Company profile and computer R&D activities
      HDTV
END OF OUTLINE------------------------------------------------------------


HITACHI CENTRAL RESEARCH LABORATORY.
Kahaner has already written about Hitachi generally, and about some
aspects of the CRL activities, see 21 Sept 1990 article "hitachi", so
this report focuses only on those aspects of the visit that provided new
insights.

One important reason for our visit was for Kung to inquire about the
possibility of using the CMU-Intel iWarp parallel processing system in
HDTV applications. We had a meeting with Senior Chief Researcher
Fukinuiki, (Telephone: (423) 23-1111, x 2009), who leads the technical
part of Hitachi's HDTV program.  Fukinuki is well known in the West and
he showed us a paper from David Sarnoff Research Center in which his
ideas were made the basis of a main part of their program. From the
perspective of computation the needs are enormous. Requirements are for
10**5 to 10**10 integer operations per signal sample. At 28.6 MHz this
requires at least (28.6*10**6)*(10**5)=2860 GigaOPS, and at 100 MHz at
least 10 TeraOPS. Three dimensional processing will be even more
demanding. Because of this high processing rate ASIC will be needed, in
Fukinuiki's opinions.  Programmable processors such as iWarp or DSP
(digital signal processor) will be too slow and too expensive.  Also,
division is not needed, and all multiplications are by fixed constants
(associated with filter coefficients) so special ROM for table lookup
can be used.  Hitachi is planning to build a special pipeline of ASIC
blocks with 28.6 MBytes/sec or 100 MBtyes/sec bandwidth between
connecting blocks.  On the other hand, Fukinuki told us that he felt a
fast general purpose parallel processor such as iWarp might be useful in
those areas where real-time processing was not needed or processing rate
needn't be that high, for example document preparation (e.g., image
processing), robotics, or video phone (only requiring  64Kbit/second).
Overall, this was a sobering meeting with a Japanese scientist who had
clearly mastered his subject.

HITACHI PARALLEL PROCESSING ACTIVITIES.
We had very brief opportunities to visit three parallel processing
projects. Below, we sketch the main ideas we were able to cull from our
short visits.

(1) Hyper crossbar parallel processor, H2P.  
This is a MIMD architecture with an unconventional interconnection
network.  The processors are first of all thought of as lying on a
hypercube. The new ingredient is that all the processors on any plane
parallel to a coordinate axis, i.e., on an cube face, are connected by a
crossbar network. Thus only 2 "hops" are needed at most to connect any
two processors.  Hitachi is clearly trying to exploit their outstanding
hardware capability in the design of the required crossbars and network
routers.  At the moment this is pure research, a paper computer.  They
have studied hyper crossbar structures that have the minimum number of
switches for given numbers of processors.  Hitachi is planning to build
the crossbar and router chips within a year.  H2P parallel systems of
more than 1024 processors are envisioned.

(2) Parallel Inference Machine, PIM/C. 
The "C" denotes their working language. This is a fairly conventional
architecture, except for hardware support for typing.  It is designed in
the form of eight processors in a cluster, with one cluster fitting in a
standard rack.  The processors are on a bus using conventional snoopy
caching. Currently there are two clusters built, and plans are that 32
clusters will be complete within a year.  An interesting aspect of of
the PIM/C architecture is that it has hardware support for load
balancing.  Another interesting point is that it uses a very impressive
Hitachi mainframe pin array package, with 50 mil spacing. The contact is
             Dr. Mamoru Sugie
             Central Research Lab.
             Hitachi, Ltd.
             Higashi-Koigakubo, Kokubunji, 
             Tokyo 185, Japan
             Tel: +81-423-23-1111 x3810
             Email: sugie%crl.hitachi.co.jp


(3) Josephson-Junctions.
This work has been going on for about ten years, as part of the MITI HPP
project, which ended this year. Activities involved not only Hitachi,
but Fujitsu, and NEC too. Currently about 10 people are working on J-J.
With the MITI project over Hitachi will certainly scale back their
efforts, but will not terminate them. They have prototyped a 1 billion
instruction per second superconducting microprocessor (with about 2K
gates, in a 7mm by 7mm chip), and a 1KB RAM chip (also 7mm by 7mm).
Currently switching time is about 10ps. Also they have discovered a new
transistor design, which they believe that previous efforts, such as
IBM's abandoned J-J effort, did not have. The researchers told us that
practical application of this technology may be 10 to 20 years away.

(4) Molecular Dynamics
This is really an application of high speed computing needs. Dr. Shigeo
Ihara in Hitachi's 7th Dept (ULSI Research Center) showed us his work on
modeling the surface of Si(100). His research is rather different from
conventional molecular dynamics models as it emphasizes the quantum
mechanical model for computing the forces. Thus computing the forces is
very expensive. His integration scheme is conventional, even somewhat
old-fashioned, but the force calculation is the key time sink here. He
claims that his model requires about 100 hours on an Hitachi S-810,
which for this problem is about three times as fast as on a Cray 1. Even
then he is only able to move around about 100 particles. His results
indicate existence of interstitial dimer, not predicted before, recessed
from the surface instead of vacancies as has been traditionally
believed. However, he also acknowledged that the integration step size
may still be too large and the results might be contaminated with
numerical error.  When we asked if it was possible to run this on a
faster computer he explained that Hitachi would soon announce a faster
supercomputer.

 
JAPAN ELECTRONICS SHOW, 1990.
We spent one free afternoon here and so were only able to get a general
impression. This is surely one of the worlds largest such shows, with
tens of thousands of square meters of exhibits in nine large buildings.
Two hangar sized buildings were associated with consumer electronics,
and all the rest were displays of very specialized parts and component
technology. Not surprisingly, the consumer electronics buildings were
mildly disappointing after seeing vast sections of Tokyo loaded with
electronics stores. Also the exhibitors were not interested in
displaying all their wares. This particular show was very clearly
focused on HDTV, High Definition TeleVision, or HVTV (High Vision TV) as
it is known here, with several hundred systems setup for display. So
many different companies were exhibiting that we wondered why there was
so much emphasis when there are almost no commercial systems available,
videotapes, television broadcasts, and no likelihood of any for at least
a few years.  Serious broadcasting doesn't seem to be any closer than
1995, and some of the people we spoke to suggested that year 2000 was
more likely for widespread household use.  Further the price of current
HDTV systems (to the extent that it can be estimated) is very high, and
unless it can be knocked down by a factor of ten there will be little
consumer interest. But, as all who have seen demonstrations will attest
the systems are visually impressive and might even now be of interest in
some specialized commercial situation where exceptional graphics will be
important.  We can also imagine that bars will buy HDTV for sports fans.

MITI created a Hi-Vision Promotion Center (HVC) in 1988.  This is a
corporation whose members include all major HDTV manufacturers in Japan.
According to their literature, "The Center promotes wider use of
Hi-Vision technology in industry and by government organizations through
identification, research, and analysis of problems existing in such
public services as museums, medicine and education, and industrial areas
(including theaters and amusements)." Currently, the government
television network NHK is providing some HVTV time each week on an
experimental basis, and some advertisements for commercial systems are
beginning to appear in the papers.

For HDTV to be of practical interest for transmission of live programs,
or to store HDTV pictures on either laser disks or CD ROMs substantial
research needs to be done. The main problem is that there is simply too
much data. A typical HDTV picture contains about 1000 x 600 pixels in
each of three colors, and frames come 30 times each second.  Even with
the best data compression algorithms now available a huge amount of data
needs to be processed. This looks like a natural application for
specialized parallel processing hardware and software, but to be
practical the hardware must be inexpensive enough to be placed in every
set. As we learned during our visits to Sanyo, Fujitsu, and Hitachi,
there is active research going on in both real time data compression,
and development of such specialized hardware. There is obviously a
connection between success in this technology and success in other
information processing activities. The Japanese companies are all more
or less at the same point because they have been meeting in committees
to establish standards, and this naturally leads to some sharing of
information. Each company seems to have some very unique
characteristics, although in a global sense they all seem to be pretty
much alike.  HDTV is another example of persistence in research; the
U.S. gave up years ago, although there is still active research in
Europe. The Japanese see the underlying technology here as a key one.
As a specific example of the application of these ideas, the importance
of ASIC, and the "last ten percent", we note that NEC has developed a
hardware data compression system for color photographs using cosine
transformations.  The original color transmission system was developed
earlier in Israel and U.K. but used software for the compression and
decompression cycle. NEC has now built this using special purpose
hardware.  It is still much too slow for HDTV though.

The Electronics show also gave us an opportunity to see not only many
examples of HDTV, but also packaging, keyboards, liquid crystal
projectors, and new flat panel displays. The Japanese have been
performing research in flat panel technology for decades, initially for
TV display, and well before there was even a glimmer about light, laptop
computers. The flat panel screens have been adapted to TV use too, as
can be attested to by Japan Air Lines first class passengers who get a
three inch color set on a stalk attached to their armrest.  As another
example, Japan Broadcasting Corp. (NHK) has developed a 33inch plasma
color display panel with a thickness of 6 mm weighing about 6kg.  The
flat displays also have obvious applications in vehicles, which merges
very nicely with the growth of automobile navigation systems.

MATSUSHITA ELECTRIC.
This company is not well known in the west, even as their product names
Panasonic, National, JVC, and Technics are.  It is best known for its
outstanding manufacturing capabilities (they even do manufacturing for
IBM).  Matsushita, founded in 1918, had sales last year of almost
$40Billion U.S., and employs about 200,000.  Sales, income, and net
income have been growing at nearly 10 percent annually. The company's
main growth areas are in communication and industrial equipment. Audio
equipment, electronic components, semiconductors, batteries, and kitchen
equipment have also grown but not quite as fast. They have identified
six target areas for the future, information/communication, factory
automation, semiconductors, autovisual, automotive electronics and
housing/building products. This includes, specifically, HDTV, where they
admit a huge investment will be needed to keep pace with the rapidly
changing technology.

As with many other large Japanese companies Matsushita hopes to become
more global, and targets 1994 as the year when the ratio of
internationally produced goods to total overseas business will be 50%.
This year the first American President was appointed at Matshushita
Electric Corp of America. In a similar way they hope to localize their
R&D activities. One example is the Panasonic Advanced TV-Video Labs, in
New Jersey. Also, as with other companies Matshushita really means many
subsidiaries; in this case 117 companies in 38 countries.

Corporate sales breakdown is as follows.
        Video equipment         27%
        Communication and
         industrial equipment   23
        Audio equipment          9
        Home appliances         13
        Electronic components   13
        Batteries & kitchen      5
        Other                   10

Kahaner reported on a visit to a National (Matsushita) factory, see
30 July 1990 file "flexible". 

The company was one of the first to incorporate fuzzy logic into their
consumer products. Whatever one may think about the content of this
technology, the public is enthusiastic about buying products described
in this way. In addition to video cameras, Matsushita also markets fuzzy
washing machines, vacuum cleaners, refrigerators, and air conditioners.

The company owns a majority share in the Boulder, Colorado workstation
maker, Solbourne Computer, and has begun to market the workstation. On
our visit to Matsushita we asked why yet another Unix workstation, and
were told that the company feels its performance is better than
comparably priced Suns, and that it can be successful with this product
if it is priced very competitively.

Corporate R&D is divided into seven organizations and their 
suborganizations. We have annotated those labs that have major computer 
related research activities.  

    Kansai
    Tokyo
    Information Equipment Research Laboratory
      Computer related activities include computer systems architecture, 
      operating systems, compilers, natural language processing, machine 
      translation, multimedia database systems, distributed parallel 
      processing knowledge based and expert systems, development tools, 
      image processing, communications systems such as optical, B-ISDN, 
      satellite, networks, data storage equipment and printing equipment.  

 Tokyo Information and Communications Development Center Audio Video 
             Research Center 
    Image Technology Research Laboratory
    Acoustic Research Laboratory
    Display Technology Research Laboratory
    Magnetic Recording Research Laboratory
    Materials and Devices Research Laboratory
      Computer related activities include HDTV research, and basic 
      technology research in areas of video signal generation, 
      processing, recording, display, transmission, compression, as well 
      as display devices.  
 High Definition Television Development Center
 Semiconductor Research Center
    VLSI Technology Research Laboratory
    VLSI Devices Research Laboratory
    Opto-Electronics Research Laboratory
 Living Systems Research Center
    Living Environmental Systems Research Laboratory
    Electrochemical Materials Research Laboratory
    Lighting Research Research Laboratory
 Central Research Laboratories
      Computer related activities in the area of intelligent mechanisms, 
      human brain, natural systems, user friendly interfaces, multistage 
      reasoning, fuzzy logic, neural networks, multimedia and hypermedia.  


Matsushita does not break out the number of employees engaged in
research, but R&D expenditures (currently about $2.5Billion U.S.) are
about 6% of sales and have been increasing at a higher rate.

Confusingly, subsidiary companies have laboratories of their own. For
example, Matsushita Electronics Corp has seven laboratories.

Our visit was to the Central Research Labs in Osaka and focused on
parallel computing and graphics applications. Frankly, we are not sure
how these research projects fit into to list of topics above, as they
seem more naturally associated with some other laboratories.
 
Unlike many other Japanese companies which have prominent statues of
their founders, Matsushita Central Research Laboratory has statues of
great scientists from Japan and other countries, including Marconi, Ohm,
and Edison in their courtyard. On the other hand, the dress code of
everyone wearing overalls has only recently been removed, and we were
also treated to the company marching song played like Muzak during our
visit.  Some of the Central Research Lab buildings (such as the Kadoma
Building) are old and have an informal, cozy feeling, with an atmosphere
like many American labs. This was the site of the oldest company lab and
several of the buildings date back to before WWII.

Our overall host for this visit was:

  Mr. Teiji Nishizawa, Manager Computer Architecture
  Kansai Information and Communications Research Laboratory
  Matsushita Electric Industrial Company Ltd.
  1006 Kadoma, Kadoma-shi
  Osaka 571 Japan
  Tel: (06) 908-1291, Fax: (06) 903-0370
  Email: NISHIZ@SY2.ISI.MEI.CO.JP

ADENA (Alternating Direction Editing Nexus Array).
ADENA was developed by 
  Prof. Tatsuo Nogi
  Department of Applied Mathematics and Physics
  Kyoto University
  Yoshida Honmatchi, Sakyo-ku
  Kyoto 606 Japan 
  Tel: (075) 753-7531 x5871, Fax: (075) 761-2437
  Email: NOGI@KUAMP.KYOTO-U.AC.JP
starting with work about ten years ago. The Matsushita group, while
extremely knowledgeable about ADENA's hardware and system software, were
less familiar with how it was to be used, and in fact we did not see
ADENA operating while we were visiting Matsushita.

Our hosts for this part of the Matsushita visit were
  Dr. Hirosha Kadota, Senior Staff Researcher
  Matsushita Electric Industrial Company Ltd.
  3-15 Yagumo-Nakamachi, Moriguchi
  Osaka 570 Japan
   Tel: (06) 909-1121, Fax: (06) 906-3851.

>From our visit it was not clear exactly what was Matsushita's basic
interest in the machine; was it only to get their feet wet in the
parallel processing area or to really develop and market a parallel
computer for solving problems? However, Kahaner, has subsequently had an
opportunity to see Nogi's laboratory in Kyoto and discuss ADENA with him
in detail. Nogi claims that some Matsushita staff understand ADENA very
well, as they are involved in not only the hardware but also the
software development. Also, at least two of his former students are now
working on the project at Matsushita.  From those visits and examination
of the technical papers the following summary is provided.

At least three versions of a parallel processing computer called ADENA
have been described by Nogi. The first was in 1980. Matsushita's version
appears to be similar to what Nogi calls ADENA II. Basically it is a 256
node processor array that is attached to  a host workstation.  The
current ADENAs are hosted by a Solbourne workstation via a VME bus.
Sixteen processors fit on one board.  The interconnection network is
called a multi-layer crossbar, with maximum data transfer of
5.1Gbytes/second (each processor has about 20Mbytes/second input and
output capability). This network shares one feature of the the hyper-
crossbar network described above (Hitachi) in that communication between
any two processors takes at most two hops, but in other ways is quite
different.  Nogi calls it a "skew" network and we describe it in some
detail below.

Each ADENA processor is a custom RISC. ADENA is organized to support
numerical solution of partial differential equations using ADI
(Alternating Direction Implicit) iteration schemes. Peak performance is
2.5GFLOP (per processor peak performance is 10MFLOP), but Nogi feels
that about 1GFLOP is a more reasonable estimate.  In fact, he has
benchmarked "real" computational fluid dynamics applications at a few
hundred MFLOPS.  A special language ADETRAN, looking a great deal like
Fortran extensions for other multiprocessors, has also been developed.

Solving partial differential equations in three space dimensions and
time has been one of the most important practical problems facing
computational scientists, and is a ferociously active research area.
Typically, integration is done at a discrete set of time points, with
the computation at each time requiring the solution of a three
dimensional potential equation, for which a prototype is

     Uxx+Uyy+Uzz = f(x,y,z)   

plus associated boundary conditions.  The most common approach is to
replace the differential equations with differences resulting in a large
system of linear equations, whose solution u(i,j,k) on a mesh
approximates U at the points (ih,jh,kh), where h is the mesh spacing.

The matrix of the linear system is large and the equations are usually
solved by iteration. In 1955 Peacemann and Rachford described one method
to efficiently perform this iteration which they called ADI, an approach
that is now known as "operator splitting". In simple ADI each iteration
is composed of three sub-parts.  First, one treats the Uyy+Uzz terms as
known and solves the discretized equations associated with Uxx="known",
then solves the Uyy="known", etc. (We are ignoring issues of
acceleration to simplify the description.) This approach is potentially
very efficient because at a fixed j and k solving for the numbers
u(1,j,k), u(2,j,k),..., u(n,j,k) is easy; the system is tridiagonal.
Furthermore, for different j and k the tridiagonal systems are
independent and can be solved in parallel as long as all the data are
available to each parallel solver. Solving Uyy="known" also requires the
solution of a set of independent tridiagonal systems, etc. Thus in a
parallel implementation each processor solves one tridiagonal system.

The key point in any parallel implementation is that for efficient
computation it is necessary for data computed in one processor to be
quickly available to one or more of the others; thus between-processor
data communication is a crucial aspect of parallel processing. The
crossbar network is one solution to this problem; every processor is
connected directly to every other, allowing data to be transferred
between any two processors in one unit of time, or "hop".  But large
crossbars are expensive and difficult to build; the number of
connections grows as the square of the number of processors.  A thrust
in much of today's parallel processing research is to design a
compromise network, one that is not too costly but still efficient.  For
example, a two dimensional (torus) mesh network of k**2 processors has
only about 2*k**2 connections, but communication between two processors
can take as many as k hops. Of course, a good algorithm will not require
data from far away processors and thus can be efficient on compromise
networks.  QCDPAX and AD1000 use torus networks.

In the ADI example, processor (j,k) which solves the tridiagonal system
for fixed j and k,  only needs data from adjacent processors, those
associated with j-1, j+1, k-1, and k+1. But when solving the next set of
equations Uyy="known" the same processor appears to need data from a
processor on the same row, but not adjacent. In the ADENA organization a
set of data from processor (i,j) can be sent to (j,k). What this means
is that when Uyy="known" is to be solved the user can visualize that the
network of processors has "flipped" to allow only adjacent processors to
be accessed.

The actual network consists of 16 planes. On each plane, there are 16
busses of row direction and 16 busses of column direction. A 32 word FIFO
queue is provided for each cross point of these busses. At the end of
the busses Send/Receive Controller elements are provided which can
send/receive group data to/from the addressed FIFO and automatically
synchronize the operations.

The most exciting thing about ADENA is that it is not a hypothetical
machine; it is actually up and running. At Kyoto University, Kahaner
watched the system in action. While he and Nogi were working, several
other "real" users were also accessing the machine from elsewhere on the
campus. Nogi claims that some physicists and engineers in different
departments are doing useful work, primarily CFD. In fact, when Prof.
C.T. Kelley, (North Carolina State University) visited the laboratory a
month earlier he also noted that ADENA seemed to be in use and that "the
computer appeared to be closer to a production model than a prototype."
We also noted, as did Kelley, that the current bottleneck seems to be
communication with the host via the VME bus.  Nogi's users are writing
programs in ADETRAN. We looked at some of these programs and they
appeared perfectly straightforward, much more so than the description
above would suggest. Nogi claims that the language is solid and that
there is even a user's manual, unfortunately only in Japanese.  He has
already written several fundamental routines, not only ADI but FFT and
some others. He also claimed that it was easy to break up problems that
need more mesh cells than a 16 by 16 grid would provide, but we haven't
looked at that issue in detail.

An interesting thing about ADENA is about its possible commercial
availability in the near future.  So far three copies of the machine
have been made.  Matsushita recently made a product announcement, but
while we were visiting the lab, we were told that it was a mistake and
had been retracted. ADENA is the result of more than 10 years of
research, and the originator has solid intuition for numerical
techniques. We were told that the 256 processor (2.5 GFLOPS) ADENA will
be sold for about $1Million U.S.  It is not really possible to evaluate
such a system without spending considerable time working with it on a
day to day basis, but given its current state we feel that it would be
very appropriate for a outside researcher to spend some time at Kyoto
trying ADENA. Nogi explained that such researchers are welcome (in small
numbers) but that he is very busy.

A number of English language reports are available about ADENA. Two of
the most recent and accessible are as follows.
 "Processing Element Design for a Parallel Computer", K Kaneko, M
Nakajima, Y Nakakura, J Nishikawa, I Okabayashi, H Kadota, IEEE Micro,
August 1990, pp26-38.
 "ADENA Computer III", T Nogi, Mem. Fac. Eng., Kyoto U, Vol 51, No. 2, 1989,
pp135-152.


MISCELLANEOUS MATSUSHITA ACTIVITIES.
Matsushita is also hard at work on HDTV. They showed us one lab filled
with HDTV related equipment. One experiment involves storing images on
an optical disk (12" diameter) and studying how fast these can be
brought up on the display. Currently they are able to store 600 images
per disk, about 20 seconds worth of imaging. Recording and replay rates
are 18Mbits/per second, much too slow for real time applications unless
sophisticated image compression techniques are used.  Video and audio
are stored on the same disk, but at this point the key problems are
still quantity of data, and transmission rates.

We also looked some interesting parallel computers devoted to graphics.
We saw photo-realistic image generation for office or home furnitures,
and hardware and software systems to support real-time, interactive
usages.  The Matsushita graphics group has been doing everything, from
hardware to software to application. This is typical of Japanese "don't
give up any part of the technology" approach.

At dinner we had an opportunity for some frank discussions about
Japanese industrial practices, such as the status of women scientists,
and the willingness of Japanese companies to hire Western researchers.
Kung and Kahaner have both noticed the lack of women in research
environments, and their almost total exclusion from more senior
positions.  This is related to Japanese custom, as many men still repeat
the adage that "most women like to get married and stay home".
Nevertheless, with a population predicted to peak in absolute terms
early next century, women represent a critical resource in Japanese
society. Both government and industry recognize this and have policies
encouraging women,  but we will have to wait to see if any real changes
occur. Concerning Western researchers, it is also quite clear that
Japanese industry is very happy to employ and sponsor these people, at
least on a short term basis. When we asked, though, what chances a
Westerner had, even one who was willing to make a long term commitment
to a Japanese company, of working into a manager position, we were told
"that would be very difficult". Perhaps things are better at Japanese
subsidiaries in the west.

---------------END OF PART 3--------------------------------------------

rick@cs.arizona.edu (Rick Schlichting) (11/06/90)

  [Dr. David Kahaner is a numerical analyst visiting Japan for two-years
   under the auspices of the Office of Naval Research-Far East (ONRFE).  
   The following is the professional opinion of David Kahaner and in no 
   way has the blessing of the US Government or any agency of it.  All 
   information is dated and of limited life time.  This disclaimer should 
   be noted on ANY attribution.]

  [Copies of previous reports written by Kahaner can be obtained from
   host cs.arizona.edu using anonymous FTP.]

To: Distribution
From: David Kahaner ONRFE [kahaner@xroads.cc.u-tokyo.ac.jp]
      H.T. Kung CMU [ht.kung@cs.cmu.edu]
Re: Aspects of Parallel Computing Research in Japan---Kyushu & Tsukuba
                  Univ., ETL, Sanyo, New Info Proc Technology project. 
Date: 6 Nov 1990

ABSTRACT. Some aspects of parallel computing research in Japan are
analyzed, based on authors' visits to a number of Japanese universities
and industrial laboratories in October 1990. This portion of the report
deals with parallel computing at Kyushu and Tsukuba Universities,
Electrotechnical Laboratory, Sanyo Electric, and the New Information
Processing Technology project.

PART 4.

The following outline describes the topics that are discussed in the various 
parts of this report.

PART 1 OUTLINE-------------------------------------------------------------
  INTRODUCTION
  SUMMARY
  RECOMMENDATIONS

PART 2 OUTLINE-------------------------------------------------------------
  FUJITSU OVERVIEW
    Company profile and computer R&D activities
    VP2000 series supercomputer organization and performance
    PARALLEL PROCESSING ACTIVITIES
     SP (Logic Simulation Engine)
     AP1000 (Cellular Array Processor)
     RP (Routing Processor)
     ATM (Asynchronous Transfer Mode) Switch
    MISCELLANEOUS FUJITSU ACTIVITIES
     Neurocomputing
     HMET 

  NEC
    SX-3 series supercomputer organization and performance
      Benchmark data for SX-3, VP2000, and Cray.
      Comments
    MISCELLANEOUS NEC PARALLEL PROCESSING ACTIVITIES

PART 3 OUTLINE------------------------------------------------------------
  HITACHI CENTRAL RESEARCH LABORATORY
    HDTV
    PARALLEL AND VECTOR PROCESSING
      Hyper crossbar parallel processor, H2P
      Parallel Inference Machine, PIM/C
      Josephson-Junctions
      Molecular Dynamics

   JAPAN ELECTRONICS SHOW, 1990
     HDTV
     Flat Panel Displays

   MATSUSHITA ELECTRIC
     Company profile and computer R&D activities
     ADENA Parallel Processor
     MISCELLANEOUS ACTIVITIES
       HDTV
     Comments about Japanese industry

PART 4 (this part) OUTLINE--------------------------------------------------
    KYUSHU UNIVERSITY
      Profile of Information Science Department
      Reconfigurable Parallel Processor
      Superscalar Processor
      FIFO Vector Processor
      Comments

    ELECTROTECHNICAL LABORATORY
      Sigma-1 Dataflow Computer and EM-4
      Dataflow Comments
      CODA Multiprocessor

    NEW INFORMATION PROCESSING TECHNOLOGY
      Summary
      Comments

    UNIVERSITY OF TSUKUBA
      PAX

    SANYO ELECTRIC
      Company profile and computer R&D activities
      HDTV
END OF OUTLINE-----------------------------------------------------------


KYUSHU UNIVERSITY.
Kyushu University is in the city of Fukuoka, the largest city on the
island of Kyushu, Japan's southernmost large island. Kyushu is the
closest part of Japan to mainland Asia (Korea) and was the route for
Genghis Khan's unsuccessful invasion attempt in the 13th century.  His
fleet was destroyed by a storm, dubbed heavenly wind or kamikazi.
Fukuoka is about an hour and a quarter by air from Tokyo.

Our host for this visit was
  Prof. Shinji Tomita
  Department of Information Systems
  Interdisciplinary Graduate School of Engineering Sciences
  Kyushu University
  6-1 Kasuga-Koen, Kasuga-shi, Fukuoka 816 Japan
  Tel: (92) 573-9611 Ext. 411
  Email: tomita@is.kyushu-u.ac.jp

Professor Tomita was with Kyoto university, where Kung first met him in
a 1982 visit to Japan sponsored by IBM.  Tomita explained to us that the
Information Science Department is composed of seven labs, Information
Recognition, Information Transmission, Information Organization,
Computational Linguistics, Information Models, Information Retrieval,
and Device Physics. These labs are also associated with the engineering,
math and physics departments. (By lab, we mean a professor and his
associated research assistants and students.) Tomita's lab is
Information Organization.  We spent most of our time hearing about its
activities, which are described briefly below.

(1) Reconfigurable parallel processor. The effort here is to develop a
testbed for parallel computer architecture, operating systems and
parallel programming languages research.  The hardware system consists
of processing elements (PEs) and a crossbar network that can be
reconfigured to fit the communication patterns of different
applications. Consisting of the SPARC processor, a home-made MMU and the
Weitek floating-point chips, the PE is a complete processor supporting
virtual memory and cache.  Each PE has a peak performance of 10 MIPS and
1.6 MFLOPS, and has a 8 MBytes of local memory.   The system is intended
to support all sorts of usage models including tightly coupled (shared
memory) computation models and loosely coupled (distributed memory)
computation models.  A thrust of this effort is therefore in the
operating systems area.   They are planning to build a 128 by 128
crossbar network, supporting both static and dynamic routing.  The
system clock is a modest 16.6 MHz.  The 128 by 128 crossbar will need 32
15"x20" boards.  Currently they have built a subset of the crossbar.
Hardware construction is limited by available funds, and the
128-processor system will take three years to complete. The following
reference gives more details.
   "The Kyushu University Reconfigurable Parallel Processor-Design
Philosophy and Architecture", Info. Proc. 89, Proc of IFIP 11th World
Computer Congress, San Francisco USA (Aug 1989), G.X. Ritter (ed),
Elsevier Science Publishers B.V. (North Holland), pp 995-1000.

(2) Superscalar processor. In this kind of a machine the instruction
word is often quite long and can contain several instructions that can
be decoded and executed in parallel by multiple instruction pipelines.
Performance gains in such a system are crucially dependent on the
run-time method of resolving data dependencies and control dependencies
and on the capabilities of the compiler. Thus there is symbiosis between
hardware and software support. This research project is thus studying
architecture and also compiler development. The hardware supports four
simultaneous instruction issues, and eager execution of predicted
program branches, and shadow registers to recover when branch prediction
is incorrect.

(3) A vector processor based on streaming/FIFO architecture.  The goal
of this project is to do something different from conventional vector
supercomputers, which use vector registers to feed the arithmetic pipes.
The researchers here propose to use a set of FIFOs instead of vector
registers.  Since the FIFOs can be made much larger than registers, the
proposed approach has some potential advantages of sustaining much
higher throughput arithmetic pipes by using chaining.   However, to make
chaining easy, virtual ALU and load/store pipelines are needed.  So this
is a project involving very challenging issues and with real-world
implication.  The researchers promise a "blueprint" of the architecture
by April 1991.

(4) Special purpose machine for high-speed ray tracing.  This project
studies parallellism available at different processing levels of a ray
tracing computation.

Kyushu is one of a few Japanese universities where research is
addressing mainstream computer systems issues. In the U.S., there are
probably no more than ten universities which are able to do similar
kinds of research. Professor Tomita and his two junior project members
all have systems building experiences.  One, Dr. Akira Fukuda, a
graduate of Kyoto University, worked at NTT, and the other worked three
years on mainframes at Fujitsu.  We believe that this kind of industrial
expertise is unusual at Japanese universities.

The faculty members and Ph.D students we talked to seemed capable.
However, these projects have ambitious goals, and their resources are
limited.  The entire group, including undergraduates, is about 20
people, and funds are also very tight.  It is hard to predict if the
four systems or even any one of them will be sufficiently finished in
time to support the planned research.  But even if their research goals
are not completely accomplished, they will have learned valuable
experiences for real systems of the future.

We also had the opportunity to meet Professor Masaaki Shimasaki, who has
recently moved to Kyushu U. from the Computer Center of Kyoto
University.
   Prof Masaaki Shimasaki
   Computer Center, Kyushu University
   Fukuoka 812 Japan
   Tel: (092) 641-1101, ext 2507, Fax: (092) 631-3196
   Email: simasaki@sun4.cc.kyushu-u.ac.jp
In the past Professor Shimasaki worked on finite element for various
kinds of mixed boundary value problems. More recently he has been
studying performance analysis of vector supercomputers and techniques
used in vectorizing and parallelizing compilers. In particular he has
applied Hockney's model to NEC SX-2 and Fujitsu Facom VP-400
supercomputers.  (Hockney proposes that an estimate of the total time
for a vector operation, t, can be given by t=(n+nhalf)/rinf, where n is
the vector length, rinf is the peak speed, and nhalf is the vector
length at which half the maximum speed is obtained). Shimasaki's results
match observed data extremely well. He is going to apply this technique
to newer systems and we will be anxious to see the results.

ELECTROTECHNICAL LABORATORY.
Kahaner wrote about ETL, see 2 July 1990 file "etl", so here we
summarize only our latest impressions based on Kung's recent visit to
ETL.  The main interest in this visit was the Sigma-1 Dataflow computer
and its follow on the EM-4. To review, Sigma-1 now has an operational
128-PE system, in 32 clusters each composed of 4 processors. A single
processor can compute at 3.3 MFLOPS (32 bit arithmetic) and 5 MIPS. Each
processor requires two boards, one for the processor and one for memory.
Connections between processors and clusters are each 100 MBytes/second.
Applications developed on this machine have not been very significant
yet. They demonstrated a trapezoidal integration of sin(x) with 30K mesh
points, for which the calculation rate is 170 MFLOPS. It might be
interesting to try an adaptive integration which could exhibit the
run-time capability of a dataflow architecture. They said that they
would try this.

ETL researchers claim that Sigma-1 is the first and likely the last pure
dataflow machine. The follow up project, EM-4, suggests that traditional
optimization techniques are being used to improve performance of
dataflow architectures. (We saw a similar effort at Kyushu University.)
The new aspects of these dataflow machines are not much different from
those of any advanced high-performance machines.  It is very clear that
distinguishing data flow architectures is no longer an interesting
issue.  However, Japanese researchers working in the area are making
every effort to emphasize that they are still working on dataflow
architectures.

It is worthwhile to repeat some of the essential issues here. Every
calculation can be thought of as being described by a set of tasks. Some
tasks can be done in parallel, others sequentially. Most tasks need data
that will be computed in another task. Tasks may be large, such as a
subroutine, or as small as an arithmetic assignment statement. It is
relatively easy to generate large tasks, but then the amount of
parallelism is limited.  A task graph (or dataflow graph) indicates
which tasks need to be done first, how much time each takes, where data
goes, etc. In principle, using this graph one can determine the absolute
lower bound on the execution time for the problem.  The important
problem for any parallel processor is to allocate a set of tasks having
different execution times and precedence constraints onto a number of
processors.  In practice, tasks cannot be matched perfectly to
processors, and there are overhead and other delays. Further the
execution time for large tasks depends on how their subtasks are broken
up.  Thus the actual execution time will always be greater than the
lower bound. In "real" dataflow, the tasks are low level. If a dataflow
computer can organize processors to execute tasks exactly as they are
presented in the task graph, the possibility exists for a computation to
be done in almost the minimum possible time.  The difficulty with pure
dataflow computers has been that various overheads have been tremendous,
these include difficulty of controlling the sequence of execution,
memory overhead because of contention for data, and communication
overhead.  There is a great deal of dataflow work going on both in Japan
and in the west. But as we have pointed out above current research seems
to involve compromising the pure dataflow concept to bring it back to
practical realization. The EM-4 project is one example; another is the
Harray project at Waseda university in which large tasks are done using
more conventional control flow and within these tasks computations are
done using data flow.  The problem of allocating processors to tasks has
been studied for many years and is known to be a very  intractable
scheduling problem, known as strong NP-hard. Thus various approximate
algorithms are used. One of these has been shown to be near optimal by
H. Kasahara, also of Waseda University.

Kung was given a briefing on the ETL's CODA multiprocessor project.  The
goal of the project is to study scalable prioritized multi-stage
networks which have a predictable delay for communication.  These kinds of
networks are important for sensor fusion in real-time applications such
as process control.  A novel idea of "priority forwarding" is proposed
so that the part of a packet that contains its priority information will
never be blocked.  This will guarantee predictable communication delay
for packets with the highestest priority.

Our overall host for this visit to ETL was:
  Toshio Shimada
  Chief Scientist
  Computer Architecture Section
  Computer Science Division
  Electrotechnical Laboratroy
  1-1-4 Umezono
  Tsukuba, Ibaraki 305
  Tel: 0298-54-5443
  FAX: 0298-58-5882
  Email: shimada@etl.go.jp


NEW INFORMATION PROCESSING TECHNOLOGY. 
This is the follow-on to MITI's Future Information Technology Project
which began in 1986. Some parts ended this year, others end in 1992.
The New Information Processing Technology is MITI's New Initiative in
1990's.  Kahaner reported on aspects of this earlier, see 3 July 1990
file "highspd", and 26 June 1990 "nipt". Recent additional information
was provided by Mr. T.  Yuba of ETL.

The best information we have is that this new follow-on MITI project is
still not officially decided.  For the past two years specialists from
the Japanese government, academic, and industrial organizations in
fields such as mathematics, physiology, psychology, and computer science
have organized three subcommittees and six working groups in order to
make a comprehensive study to define and set project goals. The working
groups meet about once a month and have produced many preliminary
reports. A final report is due soon.

The new project deals with the following fundamental issues. (1)
The capabilities of traditional (Turing) computers have increased
dramatically, but there are still many kinds of information processing
that are easy for living organisms for which conventional computers
perform poorly. (2) In the latter areas, work of the "fifth generation
project" has focused on inference, language, understanding and other
logical processing. (3) Other areas such as pattern recognition,
intuitive information processing, and autonomous and cooperative control
involving systems having many degrees of freedom, seem to be less
suitable to sequential processing. (4) Physiology, cognitive psychology,
and other brain research have produced a great deal of insight into how
the brain learns and processes information. (5) Technology such as
optical and molecular devices are being developed that may make possible
large scale parallel processing. 

While not yet officially set, the project will probably focus on the
following two kinds of research. (1) Basic principles of very highly
parallel and highly distributed information processing, learning,
optical technology and other new devices. (2) Three dimensional
information, visual and auditory recognition and understanding, and
autonomous and cooperative functions as seen in living organisms. Thus
there will be research on something related to "soft logic" supported by
massively parallel processors.  The goal is to handle ambiguous or
incomplete information using a new set of information processing
methods.  These include, but is not limited to neural nets, and also
includes the idea of intelligent databases.  The project will probably
be of the same scale as the 5th Generation Computer Project, and follow
the same organization and setting as ICOT.

The project planners have expressed a strong interest in international
cooperation.   One exciting possibility discussed by Kung is to
establish a research facility containing some massively parallel
hardware of at least 1 million programmable processors.  This can be an
international testbed for applications in massively parallel processing.
Contact on this subject is:
  Mr. Toshitsugu Yuba
  Director
  Intelligent Systems Division
  Electrotechnical Laboratory
  1-1-4 Umezono
  Tsukuba, Ibaraki 305
  Tel: (0298) 54-5412

A project to build a reliable computer with a million or more processors
is the kind of basic research thrust that a great nation could feel very
proud about embarking on.  There would be difficult problems in
designing and building it. But the challenges and the opportunities
would draw the best research minds like a powerful magnet. It is
impossible to say what will really come out of this but every scientist
should be excited about the possibilities.

UNIVERSITY OF TSUKUBA.
Kung made a short visit to University of Tsukuba after his visit to ETL.
The purpose of this visit is to see the 14 GFLOPS, 488-processor MIMD,
QCDPAX machine.  The machine was designed by University of Tsukuba and
manufactured by Anritsu Corporation.  Kahaner had a report on this
machine before, see April 12, 1990 "pax".  The machine has started to
produce interesting results in physics.  One paper reporting these
results has just been presented in a recent physics conference in the
U.S.  According to Professor Hoshino, the next generation machine will
be 100 GFLOPS and will probably be built by physicists.

It is quite an achievement to have built a machine of this scale by any
standard.  This project is an interesting and successful collaboration
example between physicists and computer scientists.
  Contacts are:
  Professor Tsutomu Hoshino
  Institute of Engineering Mechanics
  University of Tsukuba
  Tshukuba-Shi, Ibarari-Ken
  Tel: (0298) 53-5255
  FAX: (0298) 53-5207
  Email: hoshino@kz.tsukuba.ac.jp

  Professor Yoshio Oyanagi
  Institute of Information Sciences
  Unversity of Tsukuba
  Tennodai 1-1-1, Tsukuba 305
  Tel: +81 298-53-5518
  FAX: +81 298-53-5206
  Email: oranagi@is.tsukuba.ac.jp

SANYO ELECTRIC CO.
We had a brief visit in Sanyo's Osaka R&D facility to discuss the
possibility of using the CMU-Intel iWarp in HDTV applications.  We were
given a briefing on Sanyo's research activities.  Our host for this
visit was
  Mr. Yasuhiro Ishii
  Senior Manager
  Sanyo Electric Co. Ltd
  Information & Communication Systems Research Center
  Optoelectronics Dept.
  180 Ohmori, Anpachi-Cho 
  Anpachi-Gun, Gifu, Japan
   Tel: (0584) 64-3996, Fax: (0584) 64-4754.

Sanyo is primarily a consumer products corporation but they have also
made significant advances in amorphous silicon and are very proud of
their research in amorphous silicon solar cells. The R&D organization
works with a budget of about $500Million U.S. divided roughly as
follows.
    R&D Administrative Hq.
    Tsukuba Research Center              100 people (Basic research)
    Functional Materials Res. Center     200        (Fundamental res.)
    Semiconductor Res. Center            200             "
    ULSI Research Center                 200             "
    Control and Systems Res. Center      200             "
    Product Engineering Laboratory       200        (Applied research)
    Audio-Video Research Center          200             "
    Information and Communication
       System Research Center            200             "
The research staff we met were associated with the last three groups.

Most of the work is centered in Osaka, except for the basic research in
Tsukuba for which the most interesting computer applications there have
to do with intelligent systems, such as robots, neurocomputers, and
biocomputers, and the Information and Communication Center that is in
Nagoya. The latter works on parallel processing for display and image
processing, AI, expert systems, natural language processing, optical
disks, digital communications, and research in reliability for
functional and electromechanical components.

Our comments here are not about research in general but only about the
specific interactions we had. The HDTV research group we met were quite
different from approximately similar groups that we visited in that the
scientists (and managers) did not speak much English.  We were accompanied
by Mr. T. W. Kang of Intel Japan who provided a translation into
Japanese, and this was absolutely necessary.

The major interest here was how to compress HDTV images in order to
write them on a CD-ROM. This is the same problem that was raised at
Hitachi and Matsushita. Much better compression algorithms are needed.
Sanyo is hoping for compression ratios of 150 times. This is an ideal
application for parallel processing. It currently takes about eight
hours to compress an image, and of course Sanyo would like to do it in
real time to prepare for future writeable CD technology. There are about
1.7 TeraFLOPS computations. Only parallel machines can deal with this in
any practical way.  Special-purpose parallel hardware cannot really do
the job because of lack of the flexibility needed to implement
high-quality compression algorithms.  New programmable parallel systems
such as iWarp can potentially provide the required power and
flexibility.

---------------END OF PART 4-----------------------------------------------
---------------END OF REPORT-----------------------------------------------