rick@cs.arizona.edu (Rick Schlichting) (06/04/91)
[Dr. David Kahaner is a numerical analyst visiting Japan for two-years
under the auspices of the Office of Naval Research-Asia (ONR/Asia).
The following is the professional opinion of David Kahaner and in no
way has the blessing of the US Government or any agency of it. All
information is dated and of limited life time. This disclaimer should
be noted on ANY attribution.]
[Copies of previous reports written by Kahaner can be obtained from
host cs.arizona.edu using anonymous FTP.]
To: Distribution
From: David K. Kahaner, ONR Asia [kahaner@xroads.cc.u-tokyo.ac.jp]
Re: Cyclic Pipelined Computer and ERATO
3 June 1991
ABSTRACT. A brief description of ERATO and one of its projects, QMFL
(Quantum Magneto Flux Logic), with particular emphasis on the Cyclic
Pipelined Computer (CPC) is given. CPC is a shared pipelined memory,
single processor, multiple instruction stream architecture, originally
designed to be compatible with Josephson junction devices. This ERATO
project ends this year.
This report is jointly authored by myself and
Paul Spee
Computer Architecture Group
Research Development Corporation of Japan
1-280 Higashi-Koigakubo
Kokubunji, Tokyo 185, Japan
Email: spee@jrdc.go.jp
ERATO.
The Exploratory Research for Advanced Technology (ERATO) Projects were
started in 1981 by the Research Development Corporation of Japan (JRDC).
JRDC is set up by Japanese law under the administration of the Science
and Technology Agency (STA), which is a ministerial agency reporting
directly to the prime minister's office, see my report (japgovt.udt, 30
July 1990). ERATO's objective is to conduct interesting basic research.
Essentially it is an experiment in the management of R&D in which mostly
young researchers from industry, government, and universities gather and
conduct multidisciplinary research on high risk projects. A great deal
has already been written about ERATO, see for example [ENGEL90] and
[GIBOR90]. In this report we want to focus on one particular program,
nevertheless, for completeness we present a thumbnail sketch of the
general program.
There are about a dozen ERATO projects at any time; the total budget is
around $30million US, so projects' support levels vary around $2-5million
per year. The staff also varies, but may be as large as about 20
researchers during the most active phase of a project.
One of the most unusual things about ERATO is that all the projects are
of fixed duration, five years. Although the program does not allow for
extensions, promising activities might be continued by other
organizations. To emphasize the temporary nature each project rents
whatever office and lab space it needs at a university, corporation or a
research institute.
ERATO focuses on young researchers, the average age is just slightly more
than 31. They are given good facilities and good salaries. A JRDC study
showed that starting salaries exceeded those of 75% of US PhD chemists in
industry, and that salaries of ERATO researchers with three or more years
of experience exceed those of 90% of US PhD chemists in industry. A key
ingredient of each ERATO project is its Director. The perfect person is
charismatic, with dynamic personality, eminent in his field, who is
capable of attracting and inspiring his co-workers. Once found the
Director is more or less free to recruit and organize the team as he sees
fit. In fact the projects are informally referred to by the Director's
name, i.e., "the Goto Project" etc. Eiichi Goto, who directs the QMFL
project typifies this profile. Goto, who retired from the University of
Tokyo in April 1991 and is now at the University of Kanakawa, invented
the Parametron about 30 years ago. He is an extremely extroverted person,
and still bristles with new ideas. In fact one of the younger scientists
complained to me that Goto has so many ideas that it was difficult to
keep up with his thinking. The proceedings of the latest project
symposium (titles attached) list Goto as a coauthor on all but one of the
papers, including one on a new type of refrigerator.
About half of the ERATO researchers are seconded from industry; a few are
from universities or national labs. The remainder are hired as
individuals. Most of these are Japanese but about 10% are foreign. The
seconding system preserves the researcher's seniority and benefits
because ERATO reimburses the company for the researcher. The non Japanese
researchers give the projects a definite international flavor. Several
of them speak little or no Japanese, and papers in the symposium
mentioned in the preceding paragraph are almost entirely in English,
although most of this was done as a preparation for presentations in the
US in August.
Patents for ERATO projects are jointly owned by the inventors and JRDC.
Researchers share legal expenses for patents they own with JRDC, but they
may also assign ownership of the patent to JRDC. Company researchers may
assign patent ownership to their company. Until 1988 there were 415
patent applications filed in Japan and 82 outside Japan. Up to 1988 the
338 ERATO researchers have written almost 1400 papers, and of these more
than one third were published or presented outside of Japan. Each year
there is an ERATO symposium held in Tokyo. In each of four afternoon
sessions, researchers from four different projects present the progress
in their respective programs. Individual projects can also have symposia,
although these are more informal.
A foreign researcher has in principle a one year contract which may be
renewed. In fact, the ERATO budget explicitly allows for foreign
researchers to stay for the full length of a project, five years, and
through 1989 27 researchers have participated, but only a few have
remained the full five years. (Perhaps there is some concern among these
young non-Japanese researchers about the incremental benefit of staying
all five years. Employment opportunities exist within Japanese
corporations, but upward mobility is questionable.) A few foreign
companies have also sent researchers including Allelix (Canada), Celltech
(UK), Intel (US), and 3M (US). Some formal recruiting occurs but most of
the foreign researchers apply because of word of mouth recruiting. In
1989 there were 5 researchers from the US. Foreign researchers receive
the same base salary as Japanese but they also receive moving expenses, a
housing allowance, and some provision for Japanese language training.
Researchers must locate their own housing; there are no special housing
facilities because the ERATO projects are widely dispersed.
QUANTUM MAGNETIC FLUX PROJECT (GOTO-QMFL PROJECT)
This project began in 1986 and is directed by Professor E. Goto, recently
retired as Professor on Information Science at Tokyo University. Goto is
famous for his patenting in the 1950's of the Parametron which uses
resonating circuits in which current phase is used to store information.
In fact, the first Japanese computers were based on the Parametron. For
example Hitachi HIPAC-xxx (P = Parametron). However, Hitachi eventually
changed to transistor technology, (Hitachi HITAC-xxx).
In 1983 Goto proposed a Parametron-like element using Josephson
junctions. The binary states of the element are the two locations of
magnetic flux. This idea is a natural step in Josephson technology in
which devices use a single quantum of flux. In 1982 IBM's Josephson
program was abandoned; several Japanese companies have continued their
research and have been reporting steady progress. See for example the
comments about Hitachi in my report (parallel.903, 6 Nov 1990). The
current Goto-QMFL project is divided into three groups.
Fundamental Property
Magnetic Shielding
Computer Architecture
The first group within the project is working on a new Josephson device
called QFP [HIOE91] in which the unit of information in not represented
by voltage but by magnetic flux. The second group is researching a
helium liquefying process and magnetic shielding. The third group is
researching a new type of architecture called the Cyclic Pipeline
Computer (CPC) [SHIMIZU89]. Furthermore, software for this highly
pipelined parallel computer is being developed. The three groups
illustrate the temporary nature of ERATO projects. When I first went to
visit the Computer Architecture group, it was housed in an ordinary
office building in central Tokyo. Last fall the group moved to the
Hitachi Central Research Lab in suburban Tokyo. The Fundamental Property
Group is also at Hitachi and the Magnetic Shielding Group is at ULVAC.
The overall project's aims are (1) to demonstrate that QFP devices can
operate in the range of 10GHz, (2) to demonstrate the capability of
removing magnetic flux from superconductors, and (3) to develop a
computer architecture suitable for a QFP computer.
The Fundamental Properties group has six to seven persons, and the
Magnetic Shielding and Architecture groups each have about 4 people,
excluding secretaries. A discussion of the Fundamental Property and
Magnetic Shielding groups, which are essentially associated with
building Josephson devices was given in a recent JTECH report, "The
Japanese Exploratory Research For Advanced Technology (ERATO) Program,
Dec 1988, in the chapter by Dr. John Rowell, "Goto Quantum Magneto Flux
Logic Project" [ROWELL88]. The Architecture group was not in that
author's (Rowell) area of expertise and was only mentioned in his report.
His summary with respect to the Josephson technology is that the project
is "plowing new ground (or old ground with new devices), and it will be
most interesting to see the magnitude of its impact in ten years' time.
A second JTECH study in 1989, "High Temperature Superconductivity in
Japan" also has a short summary of the Goto project written by M.
Dresselhaus, again only focusing on the Josephson aspects and concluding
that "this technology benefits from very high speeds and extremely small
power consumption, and is being examined for a variety of digital
applications including next generation computers." The potential for high
performance using Josephson devices comes from this combination of very
high clock speeds (tens of giga Hertz), and low power (10^(-9) Watts per
gate). Another advantage of the QFP device is the flux transfer
characteristics, and it has just been reported that a prototype of three
dimensional integration was proven by stacking two chips together and by
observing signal transfer between these chips, [HOSOYA91]. The hope, of
course, is to replace the silicon with Josephson devices to build a three
dimensional package which is a computer in a one-cm cube.
The Computer Architecture Group investigates new architectures to take
advantage of specific features of Josephson devices. The main difference
between Josephson devices and conventional devices is that Josephson
devices act as a latch. Because there is no delay caused by the latches
between the pipeline stages in a pipelined computer, the processor may be
deeply pipelined. In pipelining, multiple instructions in a computer are
overlapped in execution. Each instruction is broken into parts, called
stages. Pipelining is a key implementation technique used to make
today's fast CPUs. The figure below shows a simple (and ideal) example of
pipelining.
I1. |-IF--|-ID--|-OF--|-EX--|-WB--|
I2. |-----|-----|-----|xxxxx|-----|
I3. |-----|-----|-----|xxxxx|-----|
I4. |-----|-----|-----|xxxxx|-----|
I5. |-----|-----|-----|xxxxx|-----|
I6. |-----|-----|-----|xxxxx|-----|
In this figure five instructions execute in sequence. The stage of the
instruction denoted with x's represent the actual execution (EX), as
opposed to instruction fetch (IF), decode, etc.
In a super-pipelined computer each stage is divided into smaller pipeline
segments, as in the figure below, which is also idealized.
I1. |-----|-----|-----|xxxxx|-----|
I2. |-----|-----|-----|xxxxx|-----|
I3. |-----|-----|-----|xxxxx|-----|
I4. |-----|-----|-----|xxxxx|-----|
I5. |-----|-----|-----|xxxxx|-----|
I6. |-----|-----|-----|xxxxx|-----|
I7. |-----|-----|-----|xxxxx|-----|
I8. |-----|-----|-----|xxxxx|-----|
I9. |-----|-----|-----|xxxxx|-----|
Pipelining and super-pipelining permit higher potential performance. The
main impediments to achieving this are
(1) The extra overhead associated with a large number of segments.
Circuitry, called latches, are needed between the segments.
(2) A situation that prevents the next instruction in the instruction
stream from executing during its clock cycle. This could be a
hardware resource conflict, a data conflict when an instruction
depends on the results of an unfinished instruction, or a control
problem when the program counter is changed because of a branch
instruction [JOUPPI89].
(3) The memory system. Hennessy and Patterson (Computer Architecture A
Quantitative Approach, Morgan Kaufmann Publ, 1990) claims that the
"biggest impact of pipelining on the machine resources is in the
memory system". Highly pipelined processors require a much higher
memory bandwidth than non pipelined processors because instructions
and data are fetched from and stored to memory at a much higher rate.
Concerning (1).
As mentioned above, one of the distinct characteristics of Josephson
logic is that each basic logic device acts as its own latch, and, in
principle this permits a very large number of segments with little
overhead.
Concerning (2).
The CPC has two main characteristics; pipelined memory and a fixed number
of instruction streams which share the functional units and main memory.
In a CPC, a fixed number of instruction streams share common hardware.
Only the hardware which can be considered part of the context of the
particular instruction stream is duplicated. This hardware includes the
program counter, processor status, registers, etc. By alternating the
instruction streams in a cyclic manner, distinct virtual processors are
created. In effect, the CPC implements a multiple instruction multiple
data (MIMD) computer. The figure below illustrates this idea with three
distinct instruction streams in a pipelined computer. An analogous figure
could be given for a super-pipelined CPC.
Instruction stream A
A1. |-----|-----|-----|xxxxx|-----|
A2. |-----|-----|-----|xxxxx|-----|
A3. |-----|-----|-----|xxxxx|-----|
Instruction stream B
B1. |-----|-----|-----|xxxxx|-----|
B2. |-----|-----|-----|xxxxx|-----|
B3. |-----|-----|-----|xxxxx|-----|
Instruction stream C
C1. |-----|-----|-----|xxxxx|-----|
C2. |-----|-----|-----|xxxxx|-----|
C3. |-----|-----|-----|xxxxx|-----|
Concerning (3).
If the performance of the CPU can be increased by pipelining, then why
not increase the performance, that is, the access rate of the memory, by
pipelining as well. If a memory access can be divided into successive
independent operations, for example decode column, decode row, access
cell, output data, such operations could be executed in parallel, thus
pipelining memory. In Josephson computers, the main memory is to be
built with the same Josephson logic devices as those used in the
processor. For such a computer, both the processor and the main memory
would be naturally pipelined with the same pipeline pitch.
Memory is often a bottleneck in many high-performance computer systems.
By increasing the machine-level parallelism, the number of memory
accesses (instruction fetch, operand fetch, operand store) increases,
making further demands on the design of efficient memory systems. High
performance computers often use techniques as n-way low-order
interleaving (distribute n memory modules over the lower bits) and n-bank
memory where the high order bits specify the bank and the low order bits
are offsets into the bank. Low-order interleaving is especially efficient
for array and vector processors where memory is often addressed
sequentially (access to vector), while n-bank memory is used in a shared
memory multiprocessor where processors and memory modules are connected
through an interconnection network.
The pipelined memory of the CPC has the advantage that it does not suffer
from performance degradation caused by memory access conflicts. Neither
does the CPC require an interconnection network which may suffer either
from path conflicts or memory access conflicts [PFISTER85].
Current high-performance computers require cache memory which can keep up
with the memory access rate. When the processor requests data which is
not in the cache, a cache miss occurs and the data must be fetched from
memory. For super-pipelined and superscalar computers, a cache miss can
easily cause an overhead of a factor of ten. (In a superscalar machine,
the hardware can issue a small number, two to four, independent
instructions in a single clock cycle.) In the CPC, the pipeline pitch of
the main memory is the same as the pipeline pitch of the processor. CPC
does not currently implement a cache, but the group is still researching
this question.
On the other hand, one disadvantage of a CPC is that the random memory
access pattern of different instruction streams decreases locality of
memory reference, but this is not a problem if a cache is not used. The
architecture group feels that CPC can be very well suited for random
memory access patterns such as neural network simulations.
CPC STATUS AND PROSPECTS
The work of the computer architecture group has been overshadowed by the
attention drawn to the hardware. The architecture group has been
designing a computer architecture which is specifically suited for
implementation on a machine with Josephson devices that are used both for
the main processor as well as for the memory. The inherent rapid
switching capability of Josephson devices means that it might be
profitable to rethink some fundamental assumptions about the relationship
of memory to processing. To most effectively implement their ideas it is
necessary to have Josephson technology in place, but all other aspects of
the research are essentially independent of it. In other words, using
basic assumptions about this technology the group can design and simulate
using silicon integrated circuits (ICs). Furthermore the group feels that
it would be reasonable to use CPC even without Josephson technology.
But in a fully Josephson computer the CPC approach claims to be able to
increase clock speeds to 10 GHz, with resulting increases of processing
speed. For example, for a Josephson CPC matrix multiplication is
predicted to execute at 20GFlop peak on a processor equipped with one
floating point adder and one floating point multiplier when two matrix
operands can be fetched from memory in parallel. Fast Fourier Transform
(FFT) performance depends on the number of arithmetic units and the
number of instruction operands that can be fetched in parallel, but a
peak performance of 50GFlop is predicted if 5 operands can be fetched in
parallel, and if there are 3 floating point adders and 2 floating point
multipliers.
Several versions of CPC have been designed and at least one has been
built, FLATS-2, using Silicon (ECL) ICs rather than Josephson junction
technology. FLATS-2 is a CPC with two virtual processors that share ten
pipeline stages. Machine cycle time is 65 ns, which is equivalent to
memory cycle time. Transfer rate of memory is 117MB/sec for instructions
and data. FLATS-2 consists of 26 logic boards, each of which contains
between 200 and 400 IC chips, connected by a backplane board and by front
flat cables, mounted on an air cooled rack chassis (57 x 62 x 37 cm),
which is then packed into a cubic box along with power supplies.
FLATS-2 is running. In addition to the operating system [SPEE89], a
Fortran language based on Jordan's Force with parallel constructs is
available. The architecture group has run simulations on various matrix
computations based on DGEFA and DGESL from LINPACK, conjugate gradient,
FFTs and Livermore loops. The results are interesting but it is still too
early to tell if this technique can really be applied without Josephson
devices. Further, there are some scientists who feel that traditional
methods will be equally efficient
But what is important about this research is that it presents an almost
orthogonal view of how to design very high performance computers. Almost
without exception today, researchers feel that highly parallel is the
future. That is, large numbers of processors each with their own memory.
The CPC approach uses shared pipelined memory, single processor with
multiple instruction streams. Of course to be most practical it may have
to await Josephson technology. Nevertheless, as a research activity it
has demonstrated several extremely innovative approaches and should be
followed closely. Furthermore, there is a chance that new ECL devices
could be built that have the ability to function as their own latches, an
important characteristic of Josephson devices. Goto told me that he had
recently devised such new devices and that Hitachi was sufficiently
excited about their potential to involve several others on their research
staff in a more thorough study of their costs and benefits. Finally it
was reported late last year that members of the Goto project had
successfully fabricated a new chip, 2.5mm square on which four QFP
devices were set. When cooled in a liquid-helium environment (-269C) all
of the single devices had a clock frequency of 16GHz corresponding to a
measured switching speed of 15 picoseconds. Linewidth of the
manufactured device is 5 microns, but when 0.5 micron VLSI technology is
applied it is believed that the speed can be increased by about a factor
of ten.
For additional information about CPC, contact
Dr. Yasuo Wada
Technical Manager,
Quantum Magneto Flux Logic Project,
Bassin Shinobazu 202,
2-1-42 Ikenohata, Taito-ku,
Tokyo 110, Japan
References:
[GIBOR90]
A. Gibor, "The ERATO Program", ONRFE Scientific Information Bulletin,
Vol 15 #3, pp27- 30, 1990.
[ENGEL90]
Alan Engel, "Opportunities for Foreign Researchers in Japan: ERATO",
in Japanese Information in Science, Technology and Commerce, ed Monch,
Wattenberg, Brockdorff, Krempien, Walravens, IOS Press, pp 553-558, 1990.
[HIOE91]
W. Hioe and E. Goto, "Quantum Flux Parametron", World Scientific,
Singapore (1991).
[HOSOYA91]
H. Hosoya, W. Hioe, J. Casas, R. Kamikawai, Y. Harada, Y. Wada, H.
Nakane, R. Suda and E. Goto, to be published in IEEE Trans. Appl.
Superconductivity.
[ICHIKAWA87]
Shuichi Ichikawa, "A Study on the Cyclic Pipeline Computer: FLATS2",
Tokyo University, February 1987.
[JOUPPI89]
Norman P. Jouppi, "The Nonuniform Distribution of Instruction-Level
and Machine Parallelism and Its Effect on Performance", IEEE
Transactions on Computers, December 1989, Vol. 38, No. 12.
[LOE86]
K. F. Loe and E. Goto, "DC Flux Parametron - A New Approach to
Josephson Junction Logic", World Scientific, Singapore, 1986.
[PFISTER85]
G. Pfister and V. Norton, "Hot-spot contention and combining in
multistage interconnection networks", ACM Transactions on Computer
Systems, October 1985, Vol. 3, No. 4.
[ROWELL 88]
J. Rowell et.al., "JTECH Panel Report on the Japanese Exploratory
Reseach for Advanced Technology (ERATO) Program", Science Applications
International Corporation, Mclean VA (1988).
[SHIMIZU89]
Kentaro Shimizu, Eiichi Goto, and Shuichi Ichikawa, "CPC (Cyclic
Pipeline Computer) - An Architecture Suited for Josephson and
Pipelined Machines", IEEE Transactions on Computers, June 1989, pp.
825-832.
[SPEE90]
Paul Spee, Mitsuhisa Sato, Norihiro Fukazawa, and Eiichi Goto, "The
Design and Implementation of the CPX kernel", Proceedings of the 7th
Riken Symposium on Josephson Electronics, Wako-shi, March 23rd, 1990,
pp. 10-20.
-------------------------------------------------------------------------
Papers Presented at the Eighth RIKEN Symposium on Josephson Electonics
March 15, 1991 (RIKEN, Wako-shi)
1. Multiple Instruction Streams in a Highly Pipelined Processor
M. Sato (Research Devel Corp of Japan) (in English)
2. Evaluation of the Continuation Bit in the Cyclic Pipeline Computer
P. Spee (Research Devel Corp of Japan) (in English)
3. Evaluation of FLATS2 Instruction Set Architecture
S. Ichikawa (Research Devel Corp of Japan) (in English)
4. Design and Evaluation of High Efficiency Pulse-Tube Refrigerator
M. Kasuya (Research Devel Corp of Japan) (in English)
5. Detection and Sweeping of Trapped Flux Quanta in Superconducting
Films
Q. Geng (Research Devel Corp of Japan) (in English)
6. High TC Oxide uperconductor Magnetic Shield and SQUID Measurement
H. Ohta (Riken) (in Japanese)
7. Results of the Josephson Computer Project in MITI
S. Takada (ETL) (in Japanese)
8. Cryoelectronics at UC Berkeley
E. Fand (UC Berkeley) (in English)
9. Prototype Model of Three Dimensional QFP Circuits
M. Hosoya (Research Devel Corp of Japan) (in English)
10. Design of D-Gate Logic Circuit
W. Hioe (Research Devel Corp of Japan) (in English)
11. High Speed QFP Testing
J. Casas (Research Devel Corp of Japan) (in English)
12. A Fast A/D Converter Using QFP
Y. Harada (Research Devel Corp of Japan) (in Japanese)
13. Design and Evaluation of QFP 3D Packaging Aligner
T. Tajima (Hitachi) (in Japanese)
-----------------END OF REPORT-------------------------------------------