rick@cs.arizona.edu (Rick Schlichting) (06/04/91)
[Dr. David Kahaner is a numerical analyst visiting Japan for two-years under the auspices of the Office of Naval Research-Asia (ONR/Asia). The following is the professional opinion of David Kahaner and in no way has the blessing of the US Government or any agency of it. All information is dated and of limited life time. This disclaimer should be noted on ANY attribution.] [Copies of previous reports written by Kahaner can be obtained from host cs.arizona.edu using anonymous FTP.] To: Distribution From: David K. Kahaner, ONR Asia [kahaner@xroads.cc.u-tokyo.ac.jp] Re: Cyclic Pipelined Computer and ERATO 3 June 1991 ABSTRACT. A brief description of ERATO and one of its projects, QMFL (Quantum Magneto Flux Logic), with particular emphasis on the Cyclic Pipelined Computer (CPC) is given. CPC is a shared pipelined memory, single processor, multiple instruction stream architecture, originally designed to be compatible with Josephson junction devices. This ERATO project ends this year. This report is jointly authored by myself and Paul Spee Computer Architecture Group Research Development Corporation of Japan 1-280 Higashi-Koigakubo Kokubunji, Tokyo 185, Japan Email: spee@jrdc.go.jp ERATO. The Exploratory Research for Advanced Technology (ERATO) Projects were started in 1981 by the Research Development Corporation of Japan (JRDC). JRDC is set up by Japanese law under the administration of the Science and Technology Agency (STA), which is a ministerial agency reporting directly to the prime minister's office, see my report (japgovt.udt, 30 July 1990). ERATO's objective is to conduct interesting basic research. Essentially it is an experiment in the management of R&D in which mostly young researchers from industry, government, and universities gather and conduct multidisciplinary research on high risk projects. A great deal has already been written about ERATO, see for example [ENGEL90] and [GIBOR90]. In this report we want to focus on one particular program, nevertheless, for completeness we present a thumbnail sketch of the general program. There are about a dozen ERATO projects at any time; the total budget is around $30million US, so projects' support levels vary around $2-5million per year. The staff also varies, but may be as large as about 20 researchers during the most active phase of a project. One of the most unusual things about ERATO is that all the projects are of fixed duration, five years. Although the program does not allow for extensions, promising activities might be continued by other organizations. To emphasize the temporary nature each project rents whatever office and lab space it needs at a university, corporation or a research institute. ERATO focuses on young researchers, the average age is just slightly more than 31. They are given good facilities and good salaries. A JRDC study showed that starting salaries exceeded those of 75% of US PhD chemists in industry, and that salaries of ERATO researchers with three or more years of experience exceed those of 90% of US PhD chemists in industry. A key ingredient of each ERATO project is its Director. The perfect person is charismatic, with dynamic personality, eminent in his field, who is capable of attracting and inspiring his co-workers. Once found the Director is more or less free to recruit and organize the team as he sees fit. In fact the projects are informally referred to by the Director's name, i.e., "the Goto Project" etc. Eiichi Goto, who directs the QMFL project typifies this profile. Goto, who retired from the University of Tokyo in April 1991 and is now at the University of Kanakawa, invented the Parametron about 30 years ago. He is an extremely extroverted person, and still bristles with new ideas. In fact one of the younger scientists complained to me that Goto has so many ideas that it was difficult to keep up with his thinking. The proceedings of the latest project symposium (titles attached) list Goto as a coauthor on all but one of the papers, including one on a new type of refrigerator. About half of the ERATO researchers are seconded from industry; a few are from universities or national labs. The remainder are hired as individuals. Most of these are Japanese but about 10% are foreign. The seconding system preserves the researcher's seniority and benefits because ERATO reimburses the company for the researcher. The non Japanese researchers give the projects a definite international flavor. Several of them speak little or no Japanese, and papers in the symposium mentioned in the preceding paragraph are almost entirely in English, although most of this was done as a preparation for presentations in the US in August. Patents for ERATO projects are jointly owned by the inventors and JRDC. Researchers share legal expenses for patents they own with JRDC, but they may also assign ownership of the patent to JRDC. Company researchers may assign patent ownership to their company. Until 1988 there were 415 patent applications filed in Japan and 82 outside Japan. Up to 1988 the 338 ERATO researchers have written almost 1400 papers, and of these more than one third were published or presented outside of Japan. Each year there is an ERATO symposium held in Tokyo. In each of four afternoon sessions, researchers from four different projects present the progress in their respective programs. Individual projects can also have symposia, although these are more informal. A foreign researcher has in principle a one year contract which may be renewed. In fact, the ERATO budget explicitly allows for foreign researchers to stay for the full length of a project, five years, and through 1989 27 researchers have participated, but only a few have remained the full five years. (Perhaps there is some concern among these young non-Japanese researchers about the incremental benefit of staying all five years. Employment opportunities exist within Japanese corporations, but upward mobility is questionable.) A few foreign companies have also sent researchers including Allelix (Canada), Celltech (UK), Intel (US), and 3M (US). Some formal recruiting occurs but most of the foreign researchers apply because of word of mouth recruiting. In 1989 there were 5 researchers from the US. Foreign researchers receive the same base salary as Japanese but they also receive moving expenses, a housing allowance, and some provision for Japanese language training. Researchers must locate their own housing; there are no special housing facilities because the ERATO projects are widely dispersed. QUANTUM MAGNETIC FLUX PROJECT (GOTO-QMFL PROJECT) This project began in 1986 and is directed by Professor E. Goto, recently retired as Professor on Information Science at Tokyo University. Goto is famous for his patenting in the 1950's of the Parametron which uses resonating circuits in which current phase is used to store information. In fact, the first Japanese computers were based on the Parametron. For example Hitachi HIPAC-xxx (P = Parametron). However, Hitachi eventually changed to transistor technology, (Hitachi HITAC-xxx). In 1983 Goto proposed a Parametron-like element using Josephson junctions. The binary states of the element are the two locations of magnetic flux. This idea is a natural step in Josephson technology in which devices use a single quantum of flux. In 1982 IBM's Josephson program was abandoned; several Japanese companies have continued their research and have been reporting steady progress. See for example the comments about Hitachi in my report (parallel.903, 6 Nov 1990). The current Goto-QMFL project is divided into three groups. Fundamental Property Magnetic Shielding Computer Architecture The first group within the project is working on a new Josephson device called QFP [HIOE91] in which the unit of information in not represented by voltage but by magnetic flux. The second group is researching a helium liquefying process and magnetic shielding. The third group is researching a new type of architecture called the Cyclic Pipeline Computer (CPC) [SHIMIZU89]. Furthermore, software for this highly pipelined parallel computer is being developed. The three groups illustrate the temporary nature of ERATO projects. When I first went to visit the Computer Architecture group, it was housed in an ordinary office building in central Tokyo. Last fall the group moved to the Hitachi Central Research Lab in suburban Tokyo. The Fundamental Property Group is also at Hitachi and the Magnetic Shielding Group is at ULVAC. The overall project's aims are (1) to demonstrate that QFP devices can operate in the range of 10GHz, (2) to demonstrate the capability of removing magnetic flux from superconductors, and (3) to develop a computer architecture suitable for a QFP computer. The Fundamental Properties group has six to seven persons, and the Magnetic Shielding and Architecture groups each have about 4 people, excluding secretaries. A discussion of the Fundamental Property and Magnetic Shielding groups, which are essentially associated with building Josephson devices was given in a recent JTECH report, "The Japanese Exploratory Research For Advanced Technology (ERATO) Program, Dec 1988, in the chapter by Dr. John Rowell, "Goto Quantum Magneto Flux Logic Project" [ROWELL88]. The Architecture group was not in that author's (Rowell) area of expertise and was only mentioned in his report. His summary with respect to the Josephson technology is that the project is "plowing new ground (or old ground with new devices), and it will be most interesting to see the magnitude of its impact in ten years' time. A second JTECH study in 1989, "High Temperature Superconductivity in Japan" also has a short summary of the Goto project written by M. Dresselhaus, again only focusing on the Josephson aspects and concluding that "this technology benefits from very high speeds and extremely small power consumption, and is being examined for a variety of digital applications including next generation computers." The potential for high performance using Josephson devices comes from this combination of very high clock speeds (tens of giga Hertz), and low power (10^(-9) Watts per gate). Another advantage of the QFP device is the flux transfer characteristics, and it has just been reported that a prototype of three dimensional integration was proven by stacking two chips together and by observing signal transfer between these chips, [HOSOYA91]. The hope, of course, is to replace the silicon with Josephson devices to build a three dimensional package which is a computer in a one-cm cube. The Computer Architecture Group investigates new architectures to take advantage of specific features of Josephson devices. The main difference between Josephson devices and conventional devices is that Josephson devices act as a latch. Because there is no delay caused by the latches between the pipeline stages in a pipelined computer, the processor may be deeply pipelined. In pipelining, multiple instructions in a computer are overlapped in execution. Each instruction is broken into parts, called stages. Pipelining is a key implementation technique used to make today's fast CPUs. The figure below shows a simple (and ideal) example of pipelining. I1. |-IF--|-ID--|-OF--|-EX--|-WB--| I2. |-----|-----|-----|xxxxx|-----| I3. |-----|-----|-----|xxxxx|-----| I4. |-----|-----|-----|xxxxx|-----| I5. |-----|-----|-----|xxxxx|-----| I6. |-----|-----|-----|xxxxx|-----| In this figure five instructions execute in sequence. The stage of the instruction denoted with x's represent the actual execution (EX), as opposed to instruction fetch (IF), decode, etc. In a super-pipelined computer each stage is divided into smaller pipeline segments, as in the figure below, which is also idealized. I1. |-----|-----|-----|xxxxx|-----| I2. |-----|-----|-----|xxxxx|-----| I3. |-----|-----|-----|xxxxx|-----| I4. |-----|-----|-----|xxxxx|-----| I5. |-----|-----|-----|xxxxx|-----| I6. |-----|-----|-----|xxxxx|-----| I7. |-----|-----|-----|xxxxx|-----| I8. |-----|-----|-----|xxxxx|-----| I9. |-----|-----|-----|xxxxx|-----| Pipelining and super-pipelining permit higher potential performance. The main impediments to achieving this are (1) The extra overhead associated with a large number of segments. Circuitry, called latches, are needed between the segments. (2) A situation that prevents the next instruction in the instruction stream from executing during its clock cycle. This could be a hardware resource conflict, a data conflict when an instruction depends on the results of an unfinished instruction, or a control problem when the program counter is changed because of a branch instruction [JOUPPI89]. (3) The memory system. Hennessy and Patterson (Computer Architecture A Quantitative Approach, Morgan Kaufmann Publ, 1990) claims that the "biggest impact of pipelining on the machine resources is in the memory system". Highly pipelined processors require a much higher memory bandwidth than non pipelined processors because instructions and data are fetched from and stored to memory at a much higher rate. Concerning (1). As mentioned above, one of the distinct characteristics of Josephson logic is that each basic logic device acts as its own latch, and, in principle this permits a very large number of segments with little overhead. Concerning (2). The CPC has two main characteristics; pipelined memory and a fixed number of instruction streams which share the functional units and main memory. In a CPC, a fixed number of instruction streams share common hardware. Only the hardware which can be considered part of the context of the particular instruction stream is duplicated. This hardware includes the program counter, processor status, registers, etc. By alternating the instruction streams in a cyclic manner, distinct virtual processors are created. In effect, the CPC implements a multiple instruction multiple data (MIMD) computer. The figure below illustrates this idea with three distinct instruction streams in a pipelined computer. An analogous figure could be given for a super-pipelined CPC. Instruction stream A A1. |-----|-----|-----|xxxxx|-----| A2. |-----|-----|-----|xxxxx|-----| A3. |-----|-----|-----|xxxxx|-----| Instruction stream B B1. |-----|-----|-----|xxxxx|-----| B2. |-----|-----|-----|xxxxx|-----| B3. |-----|-----|-----|xxxxx|-----| Instruction stream C C1. |-----|-----|-----|xxxxx|-----| C2. |-----|-----|-----|xxxxx|-----| C3. |-----|-----|-----|xxxxx|-----| Concerning (3). If the performance of the CPU can be increased by pipelining, then why not increase the performance, that is, the access rate of the memory, by pipelining as well. If a memory access can be divided into successive independent operations, for example decode column, decode row, access cell, output data, such operations could be executed in parallel, thus pipelining memory. In Josephson computers, the main memory is to be built with the same Josephson logic devices as those used in the processor. For such a computer, both the processor and the main memory would be naturally pipelined with the same pipeline pitch. Memory is often a bottleneck in many high-performance computer systems. By increasing the machine-level parallelism, the number of memory accesses (instruction fetch, operand fetch, operand store) increases, making further demands on the design of efficient memory systems. High performance computers often use techniques as n-way low-order interleaving (distribute n memory modules over the lower bits) and n-bank memory where the high order bits specify the bank and the low order bits are offsets into the bank. Low-order interleaving is especially efficient for array and vector processors where memory is often addressed sequentially (access to vector), while n-bank memory is used in a shared memory multiprocessor where processors and memory modules are connected through an interconnection network. The pipelined memory of the CPC has the advantage that it does not suffer from performance degradation caused by memory access conflicts. Neither does the CPC require an interconnection network which may suffer either from path conflicts or memory access conflicts [PFISTER85]. Current high-performance computers require cache memory which can keep up with the memory access rate. When the processor requests data which is not in the cache, a cache miss occurs and the data must be fetched from memory. For super-pipelined and superscalar computers, a cache miss can easily cause an overhead of a factor of ten. (In a superscalar machine, the hardware can issue a small number, two to four, independent instructions in a single clock cycle.) In the CPC, the pipeline pitch of the main memory is the same as the pipeline pitch of the processor. CPC does not currently implement a cache, but the group is still researching this question. On the other hand, one disadvantage of a CPC is that the random memory access pattern of different instruction streams decreases locality of memory reference, but this is not a problem if a cache is not used. The architecture group feels that CPC can be very well suited for random memory access patterns such as neural network simulations. CPC STATUS AND PROSPECTS The work of the computer architecture group has been overshadowed by the attention drawn to the hardware. The architecture group has been designing a computer architecture which is specifically suited for implementation on a machine with Josephson devices that are used both for the main processor as well as for the memory. The inherent rapid switching capability of Josephson devices means that it might be profitable to rethink some fundamental assumptions about the relationship of memory to processing. To most effectively implement their ideas it is necessary to have Josephson technology in place, but all other aspects of the research are essentially independent of it. In other words, using basic assumptions about this technology the group can design and simulate using silicon integrated circuits (ICs). Furthermore the group feels that it would be reasonable to use CPC even without Josephson technology. But in a fully Josephson computer the CPC approach claims to be able to increase clock speeds to 10 GHz, with resulting increases of processing speed. For example, for a Josephson CPC matrix multiplication is predicted to execute at 20GFlop peak on a processor equipped with one floating point adder and one floating point multiplier when two matrix operands can be fetched from memory in parallel. Fast Fourier Transform (FFT) performance depends on the number of arithmetic units and the number of instruction operands that can be fetched in parallel, but a peak performance of 50GFlop is predicted if 5 operands can be fetched in parallel, and if there are 3 floating point adders and 2 floating point multipliers. Several versions of CPC have been designed and at least one has been built, FLATS-2, using Silicon (ECL) ICs rather than Josephson junction technology. FLATS-2 is a CPC with two virtual processors that share ten pipeline stages. Machine cycle time is 65 ns, which is equivalent to memory cycle time. Transfer rate of memory is 117MB/sec for instructions and data. FLATS-2 consists of 26 logic boards, each of which contains between 200 and 400 IC chips, connected by a backplane board and by front flat cables, mounted on an air cooled rack chassis (57 x 62 x 37 cm), which is then packed into a cubic box along with power supplies. FLATS-2 is running. In addition to the operating system [SPEE89], a Fortran language based on Jordan's Force with parallel constructs is available. The architecture group has run simulations on various matrix computations based on DGEFA and DGESL from LINPACK, conjugate gradient, FFTs and Livermore loops. The results are interesting but it is still too early to tell if this technique can really be applied without Josephson devices. Further, there are some scientists who feel that traditional methods will be equally efficient But what is important about this research is that it presents an almost orthogonal view of how to design very high performance computers. Almost without exception today, researchers feel that highly parallel is the future. That is, large numbers of processors each with their own memory. The CPC approach uses shared pipelined memory, single processor with multiple instruction streams. Of course to be most practical it may have to await Josephson technology. Nevertheless, as a research activity it has demonstrated several extremely innovative approaches and should be followed closely. Furthermore, there is a chance that new ECL devices could be built that have the ability to function as their own latches, an important characteristic of Josephson devices. Goto told me that he had recently devised such new devices and that Hitachi was sufficiently excited about their potential to involve several others on their research staff in a more thorough study of their costs and benefits. Finally it was reported late last year that members of the Goto project had successfully fabricated a new chip, 2.5mm square on which four QFP devices were set. When cooled in a liquid-helium environment (-269C) all of the single devices had a clock frequency of 16GHz corresponding to a measured switching speed of 15 picoseconds. Linewidth of the manufactured device is 5 microns, but when 0.5 micron VLSI technology is applied it is believed that the speed can be increased by about a factor of ten. For additional information about CPC, contact Dr. Yasuo Wada Technical Manager, Quantum Magneto Flux Logic Project, Bassin Shinobazu 202, 2-1-42 Ikenohata, Taito-ku, Tokyo 110, Japan References: [GIBOR90] A. Gibor, "The ERATO Program", ONRFE Scientific Information Bulletin, Vol 15 #3, pp27- 30, 1990. [ENGEL90] Alan Engel, "Opportunities for Foreign Researchers in Japan: ERATO", in Japanese Information in Science, Technology and Commerce, ed Monch, Wattenberg, Brockdorff, Krempien, Walravens, IOS Press, pp 553-558, 1990. [HIOE91] W. Hioe and E. Goto, "Quantum Flux Parametron", World Scientific, Singapore (1991). [HOSOYA91] H. Hosoya, W. Hioe, J. Casas, R. Kamikawai, Y. Harada, Y. Wada, H. Nakane, R. Suda and E. Goto, to be published in IEEE Trans. Appl. Superconductivity. [ICHIKAWA87] Shuichi Ichikawa, "A Study on the Cyclic Pipeline Computer: FLATS2", Tokyo University, February 1987. [JOUPPI89] Norman P. Jouppi, "The Nonuniform Distribution of Instruction-Level and Machine Parallelism and Its Effect on Performance", IEEE Transactions on Computers, December 1989, Vol. 38, No. 12. [LOE86] K. F. Loe and E. Goto, "DC Flux Parametron - A New Approach to Josephson Junction Logic", World Scientific, Singapore, 1986. [PFISTER85] G. Pfister and V. Norton, "Hot-spot contention and combining in multistage interconnection networks", ACM Transactions on Computer Systems, October 1985, Vol. 3, No. 4. [ROWELL 88] J. Rowell et.al., "JTECH Panel Report on the Japanese Exploratory Reseach for Advanced Technology (ERATO) Program", Science Applications International Corporation, Mclean VA (1988). [SHIMIZU89] Kentaro Shimizu, Eiichi Goto, and Shuichi Ichikawa, "CPC (Cyclic Pipeline Computer) - An Architecture Suited for Josephson and Pipelined Machines", IEEE Transactions on Computers, June 1989, pp. 825-832. [SPEE90] Paul Spee, Mitsuhisa Sato, Norihiro Fukazawa, and Eiichi Goto, "The Design and Implementation of the CPX kernel", Proceedings of the 7th Riken Symposium on Josephson Electronics, Wako-shi, March 23rd, 1990, pp. 10-20. ------------------------------------------------------------------------- Papers Presented at the Eighth RIKEN Symposium on Josephson Electonics March 15, 1991 (RIKEN, Wako-shi) 1. Multiple Instruction Streams in a Highly Pipelined Processor M. Sato (Research Devel Corp of Japan) (in English) 2. Evaluation of the Continuation Bit in the Cyclic Pipeline Computer P. Spee (Research Devel Corp of Japan) (in English) 3. Evaluation of FLATS2 Instruction Set Architecture S. Ichikawa (Research Devel Corp of Japan) (in English) 4. Design and Evaluation of High Efficiency Pulse-Tube Refrigerator M. Kasuya (Research Devel Corp of Japan) (in English) 5. Detection and Sweeping of Trapped Flux Quanta in Superconducting Films Q. Geng (Research Devel Corp of Japan) (in English) 6. High TC Oxide uperconductor Magnetic Shield and SQUID Measurement H. Ohta (Riken) (in Japanese) 7. Results of the Josephson Computer Project in MITI S. Takada (ETL) (in Japanese) 8. Cryoelectronics at UC Berkeley E. Fand (UC Berkeley) (in English) 9. Prototype Model of Three Dimensional QFP Circuits M. Hosoya (Research Devel Corp of Japan) (in English) 10. Design of D-Gate Logic Circuit W. Hioe (Research Devel Corp of Japan) (in English) 11. High Speed QFP Testing J. Casas (Research Devel Corp of Japan) (in English) 12. A Fast A/D Converter Using QFP Y. Harada (Research Devel Corp of Japan) (in Japanese) 13. Design and Evaluation of QFP 3D Packaging Aligner T. Tajima (Hitachi) (in Japanese) -----------------END OF REPORT-------------------------------------------