eugene@wilbur.nas.nasa.gov (Eugene N. Miya) (06/16/90)
I do not know the structure of the lsi or hardware groups. Some one should (might) repost these over to those groups. --eugene Software-type ====== To: Distribution From: David K. Kahaner, ONRFE [kahaner@xroads.cc.u-tokyo.ac.jp] Re: Two Japanese Approaches to Circuit Simulation 15 June 1990 TWO JAPANESE APPROACHES TO CIRCUIT SIMULATION Iain S. Duff Rutherford Appleton Laboratory duff@antares.mcs.anl.gov David K. Kahaner Office of Naval Research, Far East kahaner@xroads.cc.u-tokyo.ac.jp ABSTRACT. We review research at Toshiba in vectorizing the circuit simulation program SPICE, and research at NEC at building a special purpose multiprocessor for circuit simulation modeling. INTRODUCTION. Circuit simulation is heavily used in electronic design. It allows engineers to evaluate and alter design parameters to obtain high performance or cost effective utilization of parts. One of its most important applications is in the design of memory chips, because it is the only way that chip designers can study variations in MOS transistors. Naturally, the more detailed the design, the more computationally intensive the simulation. Even on supercomputers such as Cray, simulations can take hours, as it is often necessary to repeat computations with various parameter values. Since these simulations are done frequently it is not surprising that large electronics companies are engaged in a variety of research to improve their simulation algorithms with respect to computer time, accuracy, and other performance characteristics. This report describes two different approaches taken by NEC and Toshiba, both large Japanese firms with substantial circuit design requirements. The authors of this report are numerical analysts, not circuit designers or computer architects, and the article reflects this orientation. Toshiba's approach, developed by Yoshinari Fukui Toshiba CAE Systems Kowa-Kawasaki-Nishiguchi Bldg 66-2, Horikawa-cho, Saiwai-ku, Kawasaki 210, Japan Fax: 81-44-548-4069 FUKUI@TCI.TOSHIBA.CO.JP begins with the well known program SPICE 2G6 [1] and rewrites it for Cray X/MP and Y/MP computers. Fukui reports performance gains of up to 300 (see below) compared to old versions of public domain SPICE. Toshiba does not manufacture a supercomputer, but they have a great many circuit design problems which are usually run on one of their Crays. Fukui's work has already paid off for Toshiba, as the code is being used by engineers as a production tool. NEC's work, spearheaded by Toshiyuki Nakata C&C Systems Research Laboratories NEC Corporation 4-1-1 Miyazaki, Miyamae-Ku Kawasaki, Kanagawa 213, Japan NAKATA@CSL.CS.NEC.CO.JP (044) 856-2127, Fax: (044) 856-2231 and Norio Tanabe VLSI CAD Engineering Div. NEC NAKATA@CSL.CL.NEC.CO.JP essentially rewrites the circuit simulation algorithms from scratch and runs them on a specially designed 64 node multiprocessor. Nakata feels that vectorization has only limited potential for speed improvements, and that the rate of vectorization will decrease as more detail is added to the circuit models because of growing numbers of conditional statements. The NEC project is the more experimental of the two but has the potential for greater payoffs in the future. On two models of Dynamic Random Access Memory (DRAM) circuits Nakata reports parallel speedups of over 20 with 64 processors, but feels that he can increase that by future improvements (see below). As a very large electronics organization NEC has many circuit design problems. One of the specially designed circuit simulation machines has been installed in NEC's VLSI CAD Engineering Division. The description and comments about these projects are based upon discussions with Fukui and Nakata, as well as reviews of published papers and preprints. In all cases we asked not to be told about any confidential activities. BRIEF REVIEW OF CIRCUIT SIMULATION. The circuit to be modeled is specified by describing its components and their connectivity. Kirchoff's current and voltage laws are used to make statements about conservation and total current flow. These laws are static, that is they hold at all times. In addition there are equations describing the time rate of change of circuit voltage and current, due to conductance, capacitance, and inductive effects. The result is that circuit voltage and current are specified as the solution of a system of ordinary differential equations, Ay'=f where A is a large, sparse (many zero entries) matrix determined by the physical structure of the circuit. This system of differential equations must be solved to determine the evolving characteristics of the circuit. The equations are known to be "stiff", because of the large circuit time- constants, and hence can only be solved by special techniques. In particular it is necessary to use implicit integration, such as backward Euler, or higher order backward differention (Gear) methods [11]. This eventually results in a system of nonlinear equations to be solved for the solution y at each time step. The solution of these equations is done by iteration, usually employing Newton's method. At every time step each iteration requires the solution of a system of linear equations. The matrix of the linear equations is formed by adding the identity to a scalar multiple of the Jacobian matrix. The latter is the matrix of partial derivatives of the right hand side of the differential equations, denoted f above. Thus the essential ingredients in performing transient circuit simulation analysis are (a) Setting up the problem. This includes formulating the matrix. It also includes computing equivalent conductances and current sources for linear reactive circuit elements. (b) Integration, including error control and time step strategy. (c) Solving the nonlinear equations by iteration. Each iteration requires solution of a sparse system of linear equations. Circuit designers think of the Newton Raphson iteration as a process that linearizes the nonlinear circuit elements. In practice deriving the Jacobian matrix often accounts for most of the computation time. In a specific case sited by Nakata this amounted to more than eighty percent of the total computation time. About sixteen percent of the time was spent solving the sparse linear equations. These figures will vary with the complexity of the model. Nevertheless, to make large reductions in the CPU time all aspects must be accelerated. Parts (b) and (c) are related because if the integration time steps are chosen carefully, Newton's method will converge rapidly and the computed values will be acceptably accurate. Small integration steps give more accuracy and require fewer iterations to converge but require more computer time. If the stepsize is chosen too large it can take many iterations to converge or actually diverge, and of course larger steps can reduce accuracy. There is a delicate balance between these and most simulation programs undergo a large amount of experimental "tuning"; many do not use the most modern step size selection strategy. SPEEDING UP CIRCUIT SIMULATION. The two major approaches to increasing the performance of circuit simulation programs are (1) to take advantage of the partitioning suggested by the physical circuit in order to perform certain functions in parallel, and (2) to take advantage of the structure of the Jacobian matrix to perform as many operations as possible in parallel or in vector mode, in particular assembling and solving the sparse system of linear equations. Of course, the physical circuit also determines the structure of the matrix. NEC focuses on parallelization by use of a specially developed 64 CPU multiprocessor. Toshiba has parallelized portions of their program, and also made major strides in vectorizing the solution of the linear equations. Many portions of the overall computation can be performed in parallel, in particular the setup or assembly phase. Often the circuit is partitioned into subcircuits each of which is independently calculated, with an iteration until the various parts converge. This is often called a "relaxation algorithm" [2]. However if there is substantial feedback between circuit parts convergence may not occur. Another approach is to subdivide the circuit but force the subcircuits to interact via a network whose parameters are determined anew at each iteration. Nakata calls this "modular simulation" [3]. NEC'S APPROACH. Nakata and Tanabe have concentrated on rearranging the circuit simulation algorithm to take advantage of those aspects of it that have inherent parallelism. They have designed the algorithm to allocate subcircuits to different processors each of which independently performs a simulation using local memory. The determination of the interconnection between various subcircuits must be performed, serially, at every iteration by one "interconnection" processor. This processor is locked until all the subcircuits are finished. The actual algorithm is recursive, so there is a tree of such processors. The essential function of the interconnection processor is to solve a system of linear equations. The size of the interconnection matrix is determined by the number of interconnections. For example, in a DRAM circuit with 7,000 transistors, broken into 191 subcircuits, the interconnection matrix is about 200 by 200, with about 20 percent of its entries nonzero. (Research based on the same mathematical foundations, but without the multiple hierarchy of subcircuits is being done at Texas Instrument Corp, [10].) Consideration of the form of the algorithm suggested that for many problems the subcircuit processors would compute independently and then all communicate to the interconnection processor which would then compute and finally communicate back to the subcircuit processors. Thus it will mostly be the case that communication is many-to-one or one-to-many. Nakata felt that a bus structure had better through-put than more general organizations, such as a hypercube. He also felt that shared memory would be more efficient than message passing systems. In 1987 the NEC group developed a protoype system composed of 4 processors. From that experience they have developed a 64 processor system, called Cenju, after the Buddhist goddess with one thousand arms. (Nakata feels that about 100 processors is the maximum number that can effectively used on circuit problems.) He decided to use a distributed-shared memory. This means that there is a unique global address for the local memory in each processor, and that each processor has a local memory bus so it can access its local memory without worrying about bus contention. Since some interprocessor communication occasionally takes place, and building a common bus with high data-transfer-rate suitable to so many processors was difficult he has designed the data bus in a hierarchical (tree-like) form. Cenju is thus composed of 64 processors. These are divided into 8 "clusters". Processors in a cluster are connected by a bus which is connected via a multi-stage network. Each processor is built around a 20MHz Motorola MC68020 CPU which has 4MB of dual ported local memory connected by a local memory bus. The cluster bus only needs to be accessed when communicating with other processors. The design allows up to 256 processors. Since floating point computation is a fundamental part of circuit simulation, Nakata decided to use fast Weitek WTL1167 chips instead of the Motorola MC68882 chips that are most commonly used with MC68000 systems. An MC68882 is on the board and is used for math functions such as sin, cos, exp, etc.,, but floating point addition, subtraction, multiplication, and division are done through calls to the Weitek chip. Other details of the hardware organization can be found in the paper [4]. Nakata is a computer architect, not a numerical analyst or a circuit designer. In developing Cenju he collaborated heavily with Norio Tanabe who has been doing circuit simulation for many years, and is in NEC's VLSI CAD Engineering Division. Some of the aspects of the algorithm were done in collaboration with other scientists in NEC's Computer and Communication Systems Research Division, and Scientific Information System Development Ltd (a subsidiary). For complete addresses, see references [3] and [4]. Nakata has tested the running system on two circuits, one with about 1700 transistors, and another with about 7000. Both are part of the control circuitry for DRAMs. Speedups were initially reported as 15 for the small circuit and 16 for the larger. The system designer enters the circuit. With respect to domain decomposition, Nakata and Tanabe feel that it is best to take advantage of hierarchical design used by the designers. When that cannot be done, a domain processor should be available to automatically generate the subcircuits, although for these examples they utilized the designers hierarchical design, 64 subcircuits for the first and 191 for the second. When there are more subcircuits than processors there is quite a bit of extra overhead and Nakata feels that this accounted for some degradation in performance; the matrix problem to be solved for the interconnection matrix is larger and the matrix has about 20% nonzero elements. Its solution is done serially with respect to the other phases of the calculation. Nakata feels that additional improvements would occur if the domain decomposer was more sophisticated and produced fewer subcircuits as this would directly reduce the size of the matrix. He has also made some initial efforts to parallelize solving the matrix problem, and work in this direction has improved the speedup for the large problem to over 20. Nakata has not done as much work on solving the matrix problem as Fukui at Toshiba (below), because most of the original sparse matrix solution has been parallelized in the subcircuit processors; what is left is the sparse matrix solution for the interconnection points. However, he feels that by utilizing better domain decomposition he can substantially increase the speedup. Nakata originally built Cenju as a special purpose computer, but now is studying the possibilities of using it for other applications. Nakata notes that inter-cluster read access is very slow due to overhead in bus- error interrupt handling, and timeout control. He dealt with this in circuit simulation by utilizing, whenever possible, algorithms in which the producer of data writes that data to its consumer. TOSHIBA'S APPROACH. One of the early applications of direct methods for the solution of sparse linear systems was in the solution of problems from circuit simulation. In particular, this was the main reason for the interest in sparse matrix research of scientists from the Mathematical Sciences Department at IBM Yorktown Heights. This group, which included Bob Brayton, Gary Hachtel, Fred Gustavson, Werner Liniger, and Ralph Willoughby conducted much fundamental research into sparse matrix techniques in the late 60s and early 70s. In 1970, Gustavson et. al., published a paper [6] describing the use of a code-generation approach, which they termed GNSO (GeNerate and SOlve), to the solution of sparse equations. The idea of the code- generation approach is to perform a factorization of the sparse matrix using Gaussian elimination and to record each arithmetic operation involved in a linear (loop-free) set of statements. This could be Fortran, assembler code, or in a compressed coded format (interpretative code approach). The initial elimination might be a regular factorization or a symbolic one, depending on sparsity alone and not on the numerical values. As mentioned above in the circuit context (and in many other application areas) during the Newton iteration it is required to solve several systems with a coefficient matrix of the same structure and so this list of operations can be used for these subsequent factorizations. In the Toshiba experiments, sometimes about 900 iterations with a Jacobian matrix of the same structure were required so, in common with most other experiments using this scheme, the initial cost of generating the loop free code is not reported. Duff [7], conducted some comparisons on the early code of Gustavson, for example, and concluded that their approach was of limited use for general problems. Duff, and his coauthors also comment on this approach in the book [8], section 9.6, where the main problems are stated to be storage (with consequent speed problems if data has to be held on auxiliary devices) and stability. Additionally loop-free code is not well optimized by most current compilers and may not be vectorized by them. Since about one line of code is generated for each entry of the factors, this approach is most attractive when the factors can be kept relatively sparse. Thus it would not be of interest in the solution of elliptic partial differential equations but is much more suited to the types of matrices occurring in power systems and circuit simulation problems. Additionally, in many circuit problems, and especially for memory modules, there is a natural stability in the devices and this is reflected in the equations, so that diagonal pivoting is usually stable. These features make the code-generation approach potentially intersting in this application. We now discuss the improvements that Fukui et al have made to the basic method to give them a competitive code. The improvements are: (1) generation of assembler code using vector operations, (2) packing of integer index information, and (3) use of block elimination. We discuss each in turn. Using the example from their paper [5], a fragment of generated code (in Fortran) may look like A(1) = A(1) - B*C(2) A(3) = A(3) - B*C(4) A(5) = A(5) - B*C(6) ........ and, if the programmer groups these statements suitably, one can see that the indirectly addressed vector A (addressed above through index vector 1 3 5 ...) is updated by a scalar multiple of the indirectly addressed vector C (index vector 2 4 6...). Thus, using a GATHER on A and C and a final SCATTER on A, the code can be vectorized on nearly all current vector computers. This gives significant gains in both the storage of the generated code and the speed of execution. On the examples in their paper [5], the savings are a factor of just under three in both storage and time. It should be noted that the difference between generating Fortran code and assembler (scalar assembler) is a factor of about 30 in time. If two integers are packed into a 64-bit CRAY word, the constituent integers can be quickly recovered using a shift instruction. This device is used both to decrease the memory required by the address data and to speed up the code through a reduction in memory references. A 30% speedup in cpu time is reported by the authors through this device. Fukui says that his block elimination was similar to the use of matrix-vector or level 2 BLAS (Basic Linear Algebra Subroutines) in Gaussian elimination on full systems. However, it is not clear how he would do that with his current approach and it does not appear to be illustrated in the paper. From the example there, it would appear that what is done is to hold the pivot row in vector registers and use it to update all the appropriate non-pivot rows without further memory references for the pivot row. It would also be possible to update several rows as a single vector operation but it is also not clear if this is done. The gains reported for block elimination are a factor of 1.5 in both storage and time over the vector assembler code. Fukui et al also discuss in [5] other issues in their adaptation of SPICE-GT, based on SPICE 2G.6 from Berkeley. The one other area of main interest is in the assembly or formation of the circuit simulation matrices. It is like the assembly of a finite-element matrix and thus admits of easy parallelism. Vectorization is not so easy because of complicated conditionals. They vectorize by grouping together (using indirect addressing) elements satisfying the same condition and then vectorize these subsets using vector GATHER/SCATTER. This is very similar to techniques used by Miura and others (for example, [9]) in vectorizing Monte-Carlo codes. (At NEC, Nakata claims that the matrix setup phase is difficult to parallelize in a naive manner using a great many processors because of high overhead associated with shared memory contention.) The gains over a 1984 code (which generated Fortran) were a factor of 4 in matrix assembly, 100 in solution, and under two in other parts of the code. The overall gain of the 1989 code is a factor of around 10. A factor of 300 compared with the original SPICE is claimed, presumably because of other algorithmic (or machine!?) changes not reported on in [5]. SUMMARY The NEC-Nakata-Tanabe approach is essentially one of understanding the physical problem and then developing a hardware-software solution. Toshiba-Fukui focus much more directly on the sparse matrix problem that needs to be solved. Nakata's work really begins with a clean sheet. Many of the relatively special needs of circuit simulation such as many-to-one and one- to-many communication, non-contentious local memory access, fast floating point, etc., are designed into the Cenju multiprocessor. More work still needs to be done on the serial portion of the algorithm, the sparse linear equation solver for the interconnection points, which is a bottleneck. Work is also going forward on automatic domain decomposition. Nevertheless, the current speedups of over 20 are impressive. The Cenju design does not seem so specialized that it cannot be useful for other problems. This looks like a exciting project with a substantial amount of room for further research. Fukui's work is very interesting, not least because it illustrates the use of a technique thought impractical by many sparse matrix researchers. It does not seem, however, to be suitable for a wider range of applications because of stability and storage issues. Also, there is a great deal of machine dependence, although code-generation routines are usually in two parts, the first being machine independent. This seems to be the case here, so that potentially a separate code generator (and index packer) could be used for each machine. It is difficult to say from the data provided how good the generated code is. The improvements recorded are impressive but would need to be benchmarked against other current codes or on standard test problems to obtain an accurate picture. Fukui has indicated his willingness to collaborate in benchmarking experiments. It should be pointed out that compilers are getting much better at generating vector code from standard matrix factorization routines, so that their output may not now end up so different from the code generated by the Fukui approach!! REFERENCES. [1] Nagel L, "SPICE2, A Computer Program to Simulate Semiconductor Circuits", Tech Rpt ERL-M520, Univ Cal Berkeley, May 1975. [2] Deutsch J., Newton A., "A Multiprocessor Implementation of Relaxation-Based Electrical Circuit Simulation", in Proc 21st DA Conference, pp350-357, 1984. [3] Nakata T., Tanabe N., Onozuka H., Kurobe T., Koike N., "A Multiprocessor System for Modular Circuit Simulation", in Proc ICCAD 87 pp364-367, 1987. [4] Nakata T., Tanabe N., Kajihara N., Matsushita S., Onozuka H., Asano Y., Koike N., "Cenju: A Multiprocessor System for Modular Circuit Simulation", to be published in "Computer Systems in Engineering", 1990. [5] Fukui Y., Yoshida H., Higono S. "Supercomputing of Circuit Simulation", in Proc Supercomputing '89, ACM Publication Order Number 415892, pp81-85, 1990 [6] Gustavson F., Liniger W., Willoughby R., "Symbolic Generation of an Optimal Crout Algorithm for Sparse Systems of Linear Equations", JACM vol 17, pp87-109 1970. [7] Duff I., "Practical Comparisons of Codes for the Solution of Sparse Linear Systems", In Sparse Matrix Proceedings 1978, Duff I., and Stewart G., ed., SIAM Press, pp107-134, 1979. [8] Duff I, Erisman A, Reid J., "Direct Methods for Sparse Matrices", Oxford University Press, 1986. [9] Miura K., Babb R II, "Tradeoffs in Granularity and Parallelization for a Monte Carlo Shower Simulation Code", Parallel Computing, vol 8, pp91-100, 1988. [10] Cox P., Butler R.,Epler B., " Circuit Partitioning for Parallel Processing" Proc. ICCAD-86 pp186-189, 1986. [11] Kahaner D., Moler C., Nash S., "Numerical Methods and Software" Prentice-Hall, 1989. -------------------END OF REPORT------------------------------------