[comp.research.japan] Kahaner Report: Cyclic Pipelined Computer and ERATO

rick@cs.arizona.edu (Rick Schlichting) (06/04/91)
  [Dr. David Kahaner is a numerical analyst visiting Japan for two-years
   under the auspices of the Office of Naval Research-Asia (ONR/Asia).  
   The following is the professional opinion of David Kahaner and in no 
   way has the blessing of the US Government or any agency of it.  All 
   information is dated and of limited life time.  This disclaimer should 
   be noted on ANY attribution.]

  [Copies of previous reports written by Kahaner can be obtained from
   host cs.arizona.edu using anonymous FTP.]

To: Distribution
From: David K. Kahaner, ONR Asia [kahaner@xroads.cc.u-tokyo.ac.jp]
Re: Cyclic Pipelined Computer and ERATO
3 June 1991

ABSTRACT. A brief description of ERATO and one of its projects, QMFL
(Quantum Magneto Flux Logic), with particular emphasis on the Cyclic
Pipelined Computer (CPC) is given. CPC is a shared pipelined memory,
single processor, multiple instruction stream architecture, originally
designed to be compatible with Josephson junction devices. This ERATO
project ends this year.

This report is jointly authored by myself and
                Paul Spee
                Computer Architecture Group
                Research Development Corporation of Japan
                1-280 Higashi-Koigakubo
                Kokubunji, Tokyo 185, Japan
                 Email: spee@jrdc.go.jp

ERATO.
The Exploratory Research for Advanced Technology (ERATO) Projects were 
started in 1981 by the Research Development Corporation of Japan (JRDC).  
JRDC is set up by Japanese law under the administration of the Science 
and Technology Agency (STA), which is a ministerial agency reporting 
directly to the prime minister's office, see my report (japgovt.udt, 30 
July 1990).  ERATO's objective is to conduct interesting basic research.  
Essentially it is an experiment in the management of R&D in which mostly 
young researchers from industry, government, and universities gather and 
conduct multidisciplinary research on high risk projects. A great deal 
has already been written about ERATO, see for example [ENGEL90] and 
[GIBOR90]. In this report we want to focus on one particular program, 
nevertheless, for completeness we present a thumbnail sketch of the 
general program.  

There are about a dozen ERATO projects at any time; the total budget is 
around $30million US, so projects' support levels vary around $2-5million 
per year.  The staff also varies, but may be as large as about 20 
researchers during the most active phase of a project.  

One of the most unusual things about ERATO is that all the projects are 
of fixed duration, five years. Although the program does not allow for 
extensions, promising activities might be continued by other 
organizations.  To emphasize the temporary nature each project rents 
whatever office and lab space it needs at a university, corporation or a 
research institute.  

ERATO focuses on young researchers, the average age is just slightly more 
than 31. They are given good facilities and good salaries.  A JRDC study 
showed that starting salaries exceeded those of 75% of US PhD chemists in 
industry, and that salaries of ERATO researchers with three or more years 
of experience exceed those of 90% of US PhD chemists in industry. A key 
ingredient of each ERATO project is its Director. The perfect person is 
charismatic, with dynamic personality, eminent in his field, who is 
capable of attracting and inspiring his co-workers. Once found the 
Director is more or less free to recruit and organize the team as he sees 
fit. In fact the projects are informally referred to by the Director's 
name, i.e., "the Goto Project" etc. Eiichi Goto, who directs the QMFL 
project typifies this profile. Goto, who retired from the University of 
Tokyo in April 1991 and is now at the University of Kanakawa, invented 
the Parametron about 30 years ago. He is an extremely extroverted person, 
and still bristles with new ideas. In fact one of the younger scientists 
complained to me that Goto has so many ideas that it was difficult to 
keep up with his thinking. The proceedings of the latest project 
symposium (titles attached) list Goto as a coauthor on all but one of the 
papers, including one on a new type of refrigerator.  

About half of the ERATO researchers are seconded from industry; a few are 
from universities or national labs. The remainder are hired as 
individuals. Most of these are Japanese but about 10% are foreign. The 
seconding system preserves the researcher's seniority and benefits 
because ERATO reimburses the company for the researcher. The non Japanese 
researchers give the projects a definite international flavor.  Several 
of them speak little or no Japanese, and papers in the symposium 
mentioned in the preceding paragraph are almost entirely in English, 
although most of this was done as a preparation for presentations in the 
US in August.  

Patents for ERATO projects are jointly owned by the inventors and JRDC.  
Researchers share legal expenses for patents they own with JRDC, but they 
may also assign ownership of the patent to JRDC. Company researchers may 
assign patent ownership to their company. Until 1988 there were 415 
patent applications filed in Japan and 82 outside Japan. Up to 1988 the 
338 ERATO researchers have written almost 1400 papers, and of these more 
than one third were published or presented outside of Japan.  Each year 
there is an ERATO symposium held in Tokyo. In each of four afternoon 
sessions, researchers from four different projects present the progress 
in their respective programs. Individual projects can also have symposia, 
although these are more informal.  

A foreign researcher has in principle a one year contract which may be 
renewed. In fact, the ERATO budget explicitly allows for foreign 
researchers to stay for the full length of a project, five years, and 
through 1989 27 researchers have participated, but only a few have 
remained the full five years.  (Perhaps there is some concern among these 
young non-Japanese researchers about the incremental benefit of staying 
all five years. Employment opportunities exist within Japanese 
corporations, but upward mobility is questionable.) A few foreign 
companies have also sent researchers including Allelix (Canada), Celltech 
(UK), Intel (US), and 3M (US). Some formal recruiting occurs but most of 
the foreign researchers apply because of word of mouth recruiting. In 
1989 there were 5 researchers from the US.  Foreign researchers receive 
the same base salary as Japanese but they also receive moving expenses, a 
housing allowance, and some provision for Japanese language training.  
Researchers must locate their own housing; there are no special housing 
facilities because the ERATO projects are widely dispersed.  


QUANTUM MAGNETIC FLUX PROJECT (GOTO-QMFL PROJECT)
This project began in 1986 and is directed by Professor E. Goto, recently 
retired as Professor on Information Science at Tokyo University.  Goto is 
famous for his patenting in the 1950's of the Parametron which uses 
resonating circuits in which current phase is used to store information.  
In fact, the first Japanese computers were based on the Parametron.  For 
example Hitachi HIPAC-xxx (P = Parametron). However, Hitachi eventually 
changed to transistor technology, (Hitachi HITAC-xxx).  

In 1983 Goto proposed a Parametron-like element using Josephson 
junctions. The binary states of the element are the two locations of 
magnetic flux. This idea is a natural step in Josephson technology in 
which devices use a single quantum of flux. In 1982 IBM's Josephson 
program was abandoned; several Japanese companies have continued their 
research and have been reporting steady progress. See for example the 
comments about Hitachi in my report (parallel.903, 6 Nov 1990). The 
current Goto-QMFL project is divided into three groups.  
     Fundamental Property
     Magnetic Shielding
     Computer Architecture
The first group within the project is working on a new Josephson device 
called QFP [HIOE91] in which the unit of information in not represented 
by voltage but by magnetic flux.  The second group is researching a 
helium liquefying process and magnetic shielding.  The third group is 
researching a new type of architecture called the Cyclic Pipeline 
Computer (CPC) [SHIMIZU89]. Furthermore, software for this highly 
pipelined parallel computer is being developed.  The three groups 
illustrate the temporary nature of ERATO projects. When I first went to 
visit the Computer Architecture group, it was housed in an ordinary 
office building in central Tokyo.  Last fall the group moved to the 
Hitachi Central Research Lab in suburban Tokyo. The Fundamental Property 
Group is also at Hitachi and the Magnetic Shielding Group is at ULVAC.  

The overall project's aims are (1) to demonstrate that QFP devices can 
operate in the range of 10GHz, (2) to demonstrate the capability of 
removing magnetic flux from superconductors, and (3) to develop a 
computer architecture suitable for a QFP computer.  

The Fundamental Properties group has six to seven persons, and the
Magnetic Shielding and Architecture groups each have about 4 people,
excluding secretaries.  A discussion of the Fundamental Property and
Magnetic Shielding groups, which are essentially associated with
building Josephson devices was given in a recent JTECH report, "The
Japanese Exploratory Research For Advanced Technology (ERATO) Program,
Dec 1988, in the chapter by Dr.  John Rowell, "Goto Quantum Magneto Flux
Logic Project" [ROWELL88]. The Architecture group was not in that 
author's (Rowell) area of expertise and was only mentioned in his report.  
His summary with respect to the Josephson technology is that the project 
is "plowing new ground (or old ground with new devices), and it will be 
most interesting to see the magnitude of its impact in ten years' time.  
A second JTECH study in 1989, "High Temperature Superconductivity in 
Japan" also has a short summary of the Goto project written by M.  
Dresselhaus, again only focusing on the Josephson aspects and concluding 
that "this technology benefits from very high speeds and extremely small 
power consumption, and is being examined for a variety of digital 
applications including next generation computers." The potential for high 
performance using Josephson devices comes from this combination of very 
high clock speeds (tens of giga Hertz), and low power (10^(-9) Watts per 
gate).  Another advantage of the QFP device is the flux transfer 
characteristics, and it has just been reported that a prototype of three 
dimensional integration was proven by stacking two chips together and by 
observing signal transfer between these chips, [HOSOYA91]. The hope, of 
course, is to replace the silicon with Josephson devices to build a three 
dimensional package which is a computer in a one-cm cube.  

The Computer Architecture Group investigates new architectures to take 
advantage of specific features of Josephson devices. The main difference 
between Josephson devices and conventional devices is that Josephson 
devices act as a latch. Because there is no delay caused by the latches 
between the pipeline stages in a pipelined computer, the processor may be 
deeply pipelined.  In pipelining, multiple instructions in a computer are 
overlapped in execution.  Each instruction is broken into parts, called 
stages.  Pipelining is a key implementation technique used to make 
today's fast CPUs. The figure below shows a simple (and ideal) example of 
pipelining.  


 I1.     |-IF--|-ID--|-OF--|-EX--|-WB--|
 I2.           |-----|-----|-----|xxxxx|-----|
 I3.                 |-----|-----|-----|xxxxx|-----|
 I4.                       |-----|-----|-----|xxxxx|-----|
 I5.                             |-----|-----|-----|xxxxx|-----|
 I6.                                   |-----|-----|-----|xxxxx|-----|
 
In this figure five instructions execute in sequence. The stage of the 
instruction denoted with x's represent the actual execution (EX), as 
opposed to instruction fetch (IF), decode, etc.  

In a super-pipelined computer each stage is divided into smaller pipeline 
segments, as in the figure below, which is also idealized.  

 I1.     |-----|-----|-----|xxxxx|-----|
 I2.      |-----|-----|-----|xxxxx|-----|
 I3.       |-----|-----|-----|xxxxx|-----|
 I4.        |-----|-----|-----|xxxxx|-----|
 I5.         |-----|-----|-----|xxxxx|-----|
 I6.          |-----|-----|-----|xxxxx|-----|
 I7.           |-----|-----|-----|xxxxx|-----|
 I8.            |-----|-----|-----|xxxxx|-----|
 I9.             |-----|-----|-----|xxxxx|-----|

Pipelining and super-pipelining permit higher potential performance. The 
main impediments to achieving this are 
 (1) The extra overhead associated with a large number of segments.  
     Circuitry, called latches, are needed between the segments. 
 (2) A situation that prevents the next instruction in the instruction 
     stream from executing during its clock cycle. This could be a 
     hardware resource conflict, a data conflict when an instruction 
     depends on the results of an unfinished instruction, or a control 
     problem when the program counter is changed because of a branch 
     instruction [JOUPPI89].  
 (3) The memory system. Hennessy and Patterson (Computer Architecture A 
     Quantitative Approach, Morgan Kaufmann Publ, 1990) claims that the 
     "biggest impact of pipelining on the machine resources is in the 
     memory system". Highly pipelined processors require a much higher 
     memory bandwidth than non pipelined processors because instructions 
     and data are fetched from and stored to memory at a much higher rate.  


Concerning (1).
As mentioned above, one of the distinct characteristics of Josephson 
logic is that each basic logic device acts as its own latch, and, in 
principle this permits a very large number of segments with little 
overhead.  

Concerning (2).
The CPC has two main characteristics; pipelined memory and a fixed number 
of instruction streams which share the functional units and main memory.  
In a CPC, a fixed number of instruction streams share common hardware.  
Only the hardware which can be considered part of the context of the 
particular instruction stream is duplicated.  This hardware includes the 
program counter, processor status, registers, etc.  By alternating the 
instruction streams in a cyclic manner, distinct virtual processors are 
created.  In effect, the CPC implements a multiple instruction multiple 
data (MIMD) computer.  The figure below illustrates this idea with three 
distinct instruction streams in a pipelined computer. An analogous figure 
could be given for a super-pipelined CPC.  

               Instruction stream A
 A1.     |-----|-----|-----|xxxxx|-----|
 A2.           |-----|-----|-----|xxxxx|-----|
 A3.                 |-----|-----|-----|xxxxx|-----|

               Instruction stream B
 B1.      |-----|-----|-----|xxxxx|-----|
 B2.            |-----|-----|-----|xxxxx|-----|
 B3.                  |-----|-----|-----|xxxxx|-----|

               Instruction stream C
 C1.       |-----|-----|-----|xxxxx|-----|
 C2.             |-----|-----|-----|xxxxx|-----|
 C3.                   |-----|-----|-----|xxxxx|-----|

Concerning (3).
If the performance of the CPU can be increased by pipelining, then why 
not increase the performance, that is, the access rate of the memory, by 
pipelining as well.  If a memory access can be divided into successive 
independent operations, for example decode column, decode row, access 
cell, output data, such operations could be executed in parallel, thus 
pipelining memory.  In Josephson computers, the main memory is to be 
built with the same Josephson logic devices as those used in the 
processor. For such a computer, both the processor and the main memory 
would be naturally pipelined with the same pipeline pitch.  

Memory is often a bottleneck in many high-performance computer systems.  
By increasing the machine-level parallelism, the number of memory 
accesses (instruction fetch, operand fetch, operand store) increases, 
making further demands on the design of efficient memory systems.  High 
performance computers often use techniques as n-way low-order 
interleaving (distribute n memory modules over the lower bits) and n-bank 
memory where the high order bits specify the bank and the low order bits 
are offsets into the bank. Low-order interleaving is especially efficient 
for array and vector processors where memory is often addressed 
sequentially (access to vector), while n-bank memory is used in a shared 
memory multiprocessor where processors and memory modules are connected 
through an interconnection  network.  

The pipelined memory of the CPC has the advantage that it does not suffer 
from performance degradation caused by memory access conflicts.  Neither 
does the CPC require an interconnection network which may suffer either 
from path conflicts or memory access conflicts [PFISTER85].  

Current high-performance computers require cache memory which can keep up 
with the memory access rate.  When the processor requests data which is 
not in the cache, a cache miss occurs and the data must be fetched from 
memory.  For super-pipelined and superscalar computers, a cache miss can 
easily cause an overhead of a factor of ten. (In a superscalar machine, 
the hardware can issue a small number, two to four, independent 
instructions in a single clock cycle.) In the CPC, the pipeline pitch of 
the main memory is the same as the pipeline pitch of the processor. CPC 
does not currently implement a cache, but the group is still researching 
this question.  
 
On the other hand, one disadvantage of a CPC is that the random memory 
access pattern of different instruction streams decreases locality of 
memory reference, but this is not a problem if a cache is not used.  The 
architecture group feels that CPC can be very well suited for random 
memory access patterns such as neural network simulations.  
 
CPC STATUS AND PROSPECTS
The work of the computer architecture group has been overshadowed by the 
attention drawn to the hardware. The architecture group has been 
designing a computer architecture which is specifically suited for 
implementation on a machine with Josephson devices that are used both for 
the main processor as well as for the memory. The inherent rapid 
switching capability of Josephson devices means that it might be 
profitable to rethink some fundamental assumptions about the relationship 
of memory to processing.  To most effectively implement their ideas it is 
necessary to have Josephson technology in place, but all other aspects of 
the research are essentially independent of it. In other words, using 
basic assumptions about this technology the group can design and simulate 
using silicon integrated circuits (ICs). Furthermore the group feels that 
it would be reasonable to use CPC even without Josephson technology.  

But in a fully Josephson computer the CPC approach claims to be able to 
increase clock speeds to 10 GHz, with resulting increases of processing 
speed. For example, for a Josephson CPC matrix multiplication is 
predicted to execute at 20GFlop peak on a processor equipped with one 
floating point adder and one floating point multiplier when two matrix 
operands can be fetched from memory in parallel. Fast Fourier Transform 
(FFT) performance depends on the number of arithmetic units and the 
number of instruction operands that can be fetched in parallel, but a 
peak performance of 50GFlop is predicted if 5 operands can be fetched in 
parallel, and if there are 3 floating point adders and 2 floating point 
multipliers.  

Several versions of CPC have been designed and at least one has been 
built, FLATS-2, using Silicon (ECL) ICs rather than Josephson junction 
technology.  FLATS-2 is a CPC with two virtual processors that share ten 
pipeline stages. Machine cycle time is 65 ns, which is equivalent to 
memory cycle time. Transfer rate of memory is 117MB/sec for instructions 
and data. FLATS-2 consists of 26 logic boards, each of which contains 
between 200 and 400 IC chips, connected by a backplane board and by front 
flat cables, mounted on an air cooled rack chassis (57 x 62 x 37 cm), 
which is then packed into a cubic box along with power supplies.  

FLATS-2 is running. In addition to the operating system [SPEE89], a 
Fortran language based on Jordan's Force with parallel constructs is 
available.  The architecture group has run simulations on various matrix 
computations based on DGEFA and DGESL from LINPACK, conjugate gradient, 
FFTs and Livermore loops. The results are interesting but it is still too 
early to tell if this technique can really be applied without Josephson 
devices. Further, there are some scientists who feel that traditional 
methods will be equally efficient 

But what is important about this research is that it presents an almost 
orthogonal view of how to design very high performance computers. Almost 
without exception today, researchers feel that highly parallel is the 
future. That is, large numbers of processors each with their own memory.  
The CPC approach uses shared pipelined memory, single processor with 
multiple instruction streams.  Of course to be most practical it may have 
to await Josephson technology.  Nevertheless, as a research activity it 
has demonstrated several extremely innovative approaches and should be 
followed closely.  Furthermore, there is a chance that new ECL devices 
could be built that have the ability to function as their own latches, an 
important characteristic of Josephson devices. Goto told me that he had 
recently devised such new devices and that Hitachi was sufficiently 
excited about their potential to involve several others on their research 
staff in a more thorough study of their costs and benefits.  Finally it 
was reported late last year that members of the Goto project had 
successfully fabricated a new chip, 2.5mm square on which four QFP 
devices were set. When cooled in a liquid-helium environment (-269C) all 
of the single devices had a clock frequency of 16GHz corresponding to a 
measured switching speed of 15 picoseconds.  Linewidth of the 
manufactured device is 5 microns, but when 0.5 micron VLSI technology is 
applied it is believed that the speed can be increased by about a factor 
of ten.  

  For additional information about CPC, contact
                Dr. Yasuo Wada
                Technical Manager,
                Quantum Magneto Flux Logic Project,
                Bassin Shinobazu 202,
                2-1-42 Ikenohata, Taito-ku,
                Tokyo 110, Japan
 
 References:

[GIBOR90] 
   A. Gibor, "The ERATO Program", ONRFE Scientific Information Bulletin, 
Vol 15 #3, pp27- 30, 1990.  

[ENGEL90] 
   Alan Engel, "Opportunities for Foreign Researchers in Japan: ERATO", 
in Japanese Information in Science, Technology and Commerce, ed Monch, 
Wattenberg, Brockdorff, Krempien, Walravens, IOS Press, pp 553-558, 1990.  

[HIOE91]
  W. Hioe and E. Goto, "Quantum Flux Parametron", World Scientific,
  Singapore (1991).

[HOSOYA91]
  H. Hosoya, W. Hioe, J. Casas, R. Kamikawai, Y. Harada, Y. Wada, H. 
  Nakane, R. Suda and E. Goto, to be published in IEEE Trans. Appl. 
  Superconductivity.  

[ICHIKAWA87]
   Shuichi Ichikawa, "A Study on the Cyclic Pipeline Computer: FLATS2", 
   Tokyo University, February 1987.  

[JOUPPI89]
   Norman P. Jouppi, "The Nonuniform Distribution of Instruction-Level 
   and Machine Parallelism and Its Effect on Performance", IEEE 
   Transactions on Computers, December 1989, Vol. 38, No. 12.  

[LOE86]
   K. F. Loe and E. Goto, "DC Flux Parametron - A New Approach to 
   Josephson Junction Logic", World Scientific, Singapore, 1986.  

[PFISTER85]
   G. Pfister and V. Norton, "Hot-spot contention and combining in 
   multistage interconnection networks", ACM Transactions on Computer 
   Systems, October 1985, Vol. 3, No. 4.  

[ROWELL 88]
 J. Rowell et.al., "JTECH Panel Report on the Japanese Exploratory 
 Reseach for Advanced Technology (ERATO) Program", Science Applications 
 International Corporation, Mclean VA (1988).  

[SHIMIZU89]
   Kentaro Shimizu, Eiichi Goto, and Shuichi Ichikawa, "CPC (Cyclic 
   Pipeline Computer) - An Architecture Suited for Josephson and 
   Pipelined Machines", IEEE Transactions on Computers, June 1989, pp.  
   825-832.  

[SPEE90]
   Paul Spee, Mitsuhisa Sato, Norihiro Fukazawa, and Eiichi Goto, "The 
   Design and Implementation of the CPX kernel", Proceedings of the 7th 
   Riken Symposium on Josephson Electronics, Wako-shi, March 23rd, 1990, 
   pp. 10-20.  

-------------------------------------------------------------------------
Papers Presented at the Eighth RIKEN Symposium on Josephson Electonics
 March 15, 1991 (RIKEN, Wako-shi)

1.  Multiple Instruction Streams in a Highly Pipelined Processor
     M. Sato (Research Devel Corp of Japan) (in English)

2.  Evaluation of the Continuation Bit in the Cyclic Pipeline Computer
     P. Spee (Research Devel Corp of Japan) (in English)

3.  Evaluation of FLATS2 Instruction Set Architecture
     S. Ichikawa (Research Devel Corp of Japan) (in English)

4.  Design and Evaluation of High Efficiency Pulse-Tube Refrigerator
     M. Kasuya (Research Devel Corp of Japan) (in English)

5.  Detection and Sweeping of Trapped Flux Quanta in Superconducting 
          Films
     Q. Geng  (Research Devel Corp of Japan) (in English)

6.  High TC Oxide uperconductor Magnetic Shield and SQUID Measurement
     H. Ohta (Riken) (in Japanese)

7.  Results of the Josephson Computer Project in MITI
     S. Takada (ETL) (in Japanese)

8.  Cryoelectronics at UC Berkeley
     E. Fand (UC Berkeley) (in English)

9.  Prototype Model of Three Dimensional QFP Circuits
     M. Hosoya (Research Devel Corp of Japan) (in English)

10. Design of D-Gate Logic Circuit
     W. Hioe  (Research Devel Corp of Japan) (in English)

11. High Speed QFP Testing
     J. Casas (Research Devel Corp of Japan)  (in English)

12. A Fast A/D Converter Using QFP
     Y. Harada  (Research Devel Corp of Japan) (in Japanese)

13. Design and Evaluation of QFP 3D Packaging Aligner
     T. Tajima  (Hitachi) (in Japanese)
-----------------END OF REPORT-------------------------------------------