[soc.culture.japan] Kahaner Report on Japanese Circuit sim.

eugene@wilbur.nas.nasa.gov (Eugene N. Miya) (06/16/90)
I do not know the structure of the lsi or hardware groups.
Some one should (might) repost these over to those groups.

--eugene
  Software-type
======

To: Distribution
From: David K. Kahaner, ONRFE [kahaner@xroads.cc.u-tokyo.ac.jp]
Re: Two Japanese Approaches to Circuit Simulation
15 June 1990

            TWO JAPANESE APPROACHES TO CIRCUIT SIMULATION

                            Iain S. Duff
                    Rutherford Appleton Laboratory
                       duff@antares.mcs.anl.gov

                          David K. Kahaner
                  Office of Naval Research, Far East
                    kahaner@xroads.cc.u-tokyo.ac.jp


ABSTRACT.
        We review research at Toshiba in vectorizing the circuit simulation 
program SPICE, and research at NEC at building a special purpose 
multiprocessor for circuit simulation modeling.  

INTRODUCTION.
        Circuit simulation is heavily used in electronic design. It allows 
engineers to evaluate and alter design parameters to obtain high performance 
or cost effective utilization of parts. One of its most important 
applications is in the design of memory chips, because it is the only way 
that chip designers can study variations in MOS transistors.  Naturally, the 
more detailed the design, the more computationally intensive the simulation. 
Even on supercomputers such as Cray, simulations can take hours, as it is 
often necessary to repeat computations with various parameter values. Since 
these simulations are done frequently it is not surprising that large 
electronics companies are engaged in a variety of research to improve their 
simulation algorithms with respect to computer time, accuracy, and other 
performance characteristics. This report describes two different approaches 
taken by NEC and Toshiba, both large Japanese firms with substantial circuit 
design requirements. The authors of this report are numerical analysts, not 
circuit designers or computer architects, and the article reflects this 
orientation.  

        Toshiba's approach, developed by 
                 Yoshinari Fukui
                 Toshiba CAE Systems
                 Kowa-Kawasaki-Nishiguchi Bldg
                 66-2, Horikawa-cho, Saiwai-ku,
                 Kawasaki 210, Japan
                 Fax: 81-44-548-4069
                 FUKUI@TCI.TOSHIBA.CO.JP

begins with the well known program SPICE 2G6 [1] and rewrites it for Cray 
X/MP and Y/MP computers. Fukui reports performance gains of up to 300 (see 
below) compared to old versions of public domain SPICE.  Toshiba does not 
manufacture a supercomputer, but they have a great many circuit design 
problems which are usually run on one of their Crays.  Fukui's work has 
already paid off for Toshiba, as the code is being used by engineers as a 
production tool.  

        NEC's work, spearheaded by
                 Toshiyuki Nakata
                 C&C Systems Research Laboratories
                 NEC Corporation
                 4-1-1 Miyazaki, Miyamae-Ku
                 Kawasaki, Kanagawa 213, Japan
                 NAKATA@CSL.CS.NEC.CO.JP
                 (044) 856-2127, Fax: (044) 856-2231

and              Norio Tanabe
                 VLSI CAD Engineering Div. NEC
                 NAKATA@CSL.CL.NEC.CO.JP

essentially rewrites the circuit simulation algorithms from scratch and runs 
them on a specially designed 64 node multiprocessor. Nakata feels that 
vectorization has only limited potential for speed improvements, and that 
the rate of vectorization will decrease as more detail is added to the 
circuit models because of growing numbers of conditional statements.  
The NEC project is the more experimental of the two but has the potential 
for greater payoffs in the future. On two models of Dynamic Random Access 
Memory (DRAM) circuits Nakata reports parallel speedups of over 20 with 64 
processors, but feels that he can increase that by future improvements (see 
below). As a very large electronics organization NEC has many circuit design 
problems.  One of the specially designed circuit simulation machines has 
been installed in NEC's VLSI CAD Engineering Division.  

The description and comments about these projects are based upon discussions 
with Fukui and Nakata, as well as reviews of published papers and preprints. 
In all cases we asked not to be told about any confidential activities.  

BRIEF REVIEW OF CIRCUIT SIMULATION.
        The circuit to be modeled is specified by describing its components 
and their connectivity. Kirchoff's current and voltage laws are used to make 
statements about conservation and total current flow.  These laws are 
static, that is they hold at all times. In addition there are equations 
describing the time rate of change of circuit voltage and current, due to 
conductance, capacitance, and inductive effects. The result is that circuit 
voltage and current are specified as the solution of a system of ordinary 
differential equations, 
   Ay'=f
where A is a large, sparse (many zero entries) matrix determined by the 
physical structure of the circuit. This system of differential equations 
must be solved to determine the evolving characteristics of the circuit.  
The equations are known to be "stiff", because of the large circuit time-
constants, and hence can only be solved by special techniques.  In 
particular it is necessary to use implicit integration, such as backward 
Euler, or higher order backward differention (Gear) methods [11]. This 
eventually results in a system of nonlinear equations to be solved for the 
solution y at each time step. The solution of these equations is done by 
iteration, usually employing Newton's method. At every time step each 
iteration requires the solution of a system of linear equations. The matrix 
of the linear equations is formed by adding the identity to a scalar 
multiple of the Jacobian matrix. The latter is the matrix of partial 
derivatives of the right hand side of the differential equations, denoted 
f above.  
 

        Thus the essential ingredients in performing transient circuit 
simulation analysis are 
(a) Setting up the problem. This includes formulating the matrix.  It 
     also includes computing equivalent conductances and current sources 
     for linear reactive circuit elements.  
(b) Integration, including error control and time step strategy.
(c) Solving the nonlinear equations by iteration. Each iteration requires 
     solution of a sparse system of linear equations. Circuit designers 
     think of the Newton Raphson iteration as a process that linearizes the 
     nonlinear circuit elements.  

In practice deriving the Jacobian matrix often accounts for most of the 
computation time. In a specific case sited by Nakata this amounted to more 
than eighty percent of the total computation time. About sixteen percent of 
the time was spent solving the sparse linear equations. These figures will 
vary with the complexity of the model. Nevertheless, to make large 
reductions in the CPU time all aspects must be accelerated.  Parts (b) and 
(c) are related because if the integration time steps are chosen carefully, 
Newton's method will converge rapidly and the computed values will be 
acceptably accurate.  Small integration steps give more accuracy and require 
fewer iterations to converge but require more computer time.  If the 
stepsize is chosen too large it can take many iterations to converge or 
actually diverge, and of course larger steps can reduce accuracy.  There is 
a delicate balance between these and most simulation programs undergo a 
large amount of experimental "tuning"; many do not use the most modern step 
size selection strategy.  

SPEEDING UP CIRCUIT SIMULATION.
        The two major approaches to increasing the performance of circuit 
simulation programs are (1) to take advantage of the partitioning suggested 
by the physical circuit in order to perform certain functions in parallel, 
and (2) to take advantage of the structure of the Jacobian matrix to perform 
as many operations as possible in parallel or in vector mode, in particular 
assembling and solving the sparse system of linear equations. Of course, the 
physical circuit also determines the structure of the matrix.  NEC focuses 
on parallelization by use of a specially developed 64 CPU multiprocessor.  
Toshiba has parallelized portions of their program, and also made major 
strides in vectorizing the solution of the linear equations.  

        Many portions of the overall computation can be performed in 
parallel, in particular the setup or assembly phase. Often the circuit is 
partitioned into subcircuits each of which is independently calculated, with 
an iteration until the various parts converge. This is often called a 
"relaxation algorithm" [2]. However if there is substantial feedback between 
circuit parts convergence may not occur.  Another approach is to subdivide 
the circuit but force the subcircuits to interact via a network whose 
parameters are determined anew at each iteration. Nakata calls this "modular 
simulation" [3].  


NEC'S APPROACH.
        Nakata and Tanabe have concentrated on rearranging the circuit 
simulation algorithm to take advantage of those aspects of it that have 
inherent parallelism.  They have designed the algorithm to allocate 
subcircuits to different processors each of which independently performs a 
simulation using local memory. The determination of the interconnection 
between various subcircuits must be performed, serially, at every iteration 
by one "interconnection" processor. This processor is locked until all the 
subcircuits are finished. The actual algorithm is recursive, so there is a 
tree of such processors. The essential function of the interconnection 
processor is to solve a system of linear equations. The size of the 
interconnection matrix is determined by the number of interconnections.  For 
example, in a DRAM circuit with 7,000 transistors, broken into 191 
subcircuits, the interconnection matrix is about 200 by 200, with about 20 
percent of its entries nonzero. (Research based on the same mathematical 
foundations, but without the multiple hierarchy of subcircuits is being done 
at Texas Instrument Corp, [10].) 

        Consideration of the form of the algorithm suggested that for many 
problems the subcircuit processors would compute independently and then all 
communicate to the interconnection processor which would then compute and 
finally communicate back to the subcircuit processors. Thus it will mostly 
be the case that communication is many-to-one or one-to-many. Nakata felt 
that a bus structure had better through-put than more general organizations, 
such as a hypercube. He also felt that shared memory would be more efficient 
than message passing systems. In 1987 the NEC group developed a protoype 
system composed of 4 processors. From that experience they have developed a 
64 processor system, called Cenju, after the Buddhist goddess with one 
thousand arms. (Nakata feels that about 100 processors is the maximum number 
that can effectively used on circuit problems.)  He decided to use a 
distributed-shared memory. This means that there is a unique global address 
for the local memory in each processor, and that each processor has a local 
memory bus so it can access its local memory without worrying about bus 
contention. Since some interprocessor communication occasionally takes 
place, and building a common bus with high data-transfer-rate suitable to so 
many processors was difficult he has designed the data bus in a hierarchical 
(tree-like) form.  

        Cenju is thus composed of 64 processors. These are divided into 8 
"clusters". Processors in a cluster are connected by a bus which is 
connected via a multi-stage network.  Each processor is built around a 20MHz 
Motorola MC68020 CPU which has 4MB of dual ported local memory connected by 
a local memory bus. The cluster bus only needs to be accessed when 
communicating with other processors. The design allows up to 256 processors.  
Since floating point computation is a fundamental part of circuit 
simulation, Nakata decided to use fast Weitek WTL1167 chips instead of the 
Motorola MC68882 chips that are most commonly used with MC68000 systems.  An 
MC68882 is on the board and is used for math functions such as sin, cos, 
exp, etc.,, but floating point addition, subtraction, multiplication, and 
division are done through calls to the Weitek chip.  Other details of the 
hardware organization can be found in the paper [4].  

        Nakata is a computer architect, not a numerical analyst or a circuit 
designer. In developing Cenju he collaborated heavily with Norio Tanabe who 
has been doing circuit simulation for many years, and is in NEC's VLSI CAD 
Engineering Division. Some of the aspects of the algorithm were done in 
collaboration with other scientists in NEC's Computer and Communication 
Systems Research Division, and Scientific Information System Development Ltd 
(a subsidiary). For complete addresses, see references [3] and [4].  

        Nakata has tested the running system on two circuits, one with about 
1700 transistors, and another with about 7000. Both are part of the control 
circuitry for DRAMs. Speedups were initially reported as 15 for the small 
circuit and 16 for the larger. The system designer enters the circuit. With 
respect to domain decomposition, Nakata and Tanabe feel that it is best to 
take advantage of hierarchical design used by the designers. When that 
cannot be done, a domain processor should be available to automatically 
generate the subcircuits, although for these examples they utilized the 
designers hierarchical design, 64 subcircuits for the first and 191 for the 
second. When there are more subcircuits than processors there is quite a bit 
of extra overhead and Nakata feels that this accounted for some degradation 
in performance; the matrix problem to be solved for the interconnection 
matrix is larger and the matrix has about 20% nonzero elements. Its solution 
is done serially with respect to the other phases of the calculation.  
Nakata feels that additional improvements would occur if the domain 
decomposer was more sophisticated and produced fewer subcircuits as this 
would directly reduce the size of the matrix. He has also made some initial 
efforts to parallelize solving the matrix problem, and work in this 
direction has improved the speedup for the large problem to over 20. Nakata 
has not done as much work on solving the matrix problem as Fukui at Toshiba 
(below), because most of the original sparse matrix solution has been 
parallelized in the subcircuit processors; what is left is the sparse matrix 
solution for the interconnection points.  However, he feels that by 
utilizing better domain decomposition he can substantially increase the 
speedup.  

        Nakata originally built Cenju as a special purpose computer, but now 
is studying the possibilities of using it for other applications.  Nakata 
notes that inter-cluster read access is very slow due to overhead in bus-
error interrupt handling, and timeout control. He dealt with this in circuit 
simulation by utilizing, whenever possible, algorithms in which the producer 
of data writes that data to its consumer.  


TOSHIBA'S APPROACH.
        One of the early applications of direct methods for the solution 
of sparse linear systems was in the solution of problems from circuit 
simulation.  In particular, this was the main reason for the interest in 
sparse matrix research of scientists from the Mathematical Sciences 
Department at IBM Yorktown Heights.  This group, which included Bob Brayton, 
Gary Hachtel, Fred Gustavson, Werner Liniger, and Ralph Willoughby conducted 
much fundamental research into sparse matrix techniques in the late 60s and 
early 70s.  In 1970, Gustavson et. al., published a paper [6] describing the 
use of a code-generation approach, which they termed GNSO (GeNerate and 
SOlve), to the solution of sparse equations.  The idea of the code-
generation approach is to perform a factorization of the sparse matrix using 
Gaussian elimination and to record each arithmetic operation involved in a 
linear (loop-free) set of statements. This could be Fortran, assembler code, 
or in a compressed coded format (interpretative code approach).  The initial 
elimination might be a regular factorization or a symbolic one, depending on 
sparsity alone and not on the numerical values. As mentioned above in the 
circuit context (and in many other application areas) during the Newton 
iteration it is required to solve several systems with a coefficient matrix 
of the same structure and so this list of operations can be used for these 
subsequent factorizations. In the Toshiba experiments, sometimes about 900 
iterations with a Jacobian matrix of the same structure were required so, in 
common with most other experiments using this scheme, the initial cost of 
generating the loop free code is not reported.  

        Duff [7], conducted some comparisons on the early code of Gustavson, 
for example, and concluded that their approach was of limited use for 
general problems. Duff, and his coauthors also comment on this approach in 
the book [8], section 9.6, where the main problems are stated to be storage 
(with consequent speed problems if data has to be held on auxiliary devices) 
and stability.  Additionally loop-free code is not well optimized by most 
current compilers and may not be vectorized by them.  Since about one line 
of code is generated for each entry of the factors, this approach is most 
attractive when the factors can be kept relatively sparse.  Thus it would 
not be of interest in the solution of elliptic partial differential 
equations but is much more suited to the types of matrices occurring in 
power systems and circuit simulation problems.  Additionally, in many 
circuit problems, and especially for memory modules, there is a natural 
stability in the devices and this is reflected in the equations, so that 
diagonal pivoting is usually stable.  

        These features make the code-generation approach potentially 
intersting in this application. We now discuss the improvements that Fukui 
et al have made to the basic method to give them a competitive code.  
   The improvements are:
 (1) generation of assembler code using vector operations,
 (2) packing of integer index information, and
 (3) use of block elimination.
We discuss each in turn.

        Using the example from their paper [5], a fragment of generated code 
(in Fortran) may look like 
           A(1) = A(1) - B*C(2)
           A(3) = A(3) - B*C(4)
           A(5) = A(5) - B*C(6)
           ........
and, if the programmer groups these statements suitably, one can see that 
the indirectly addressed vector A (addressed above through index vector 1 3 
5 ...) is updated by a scalar multiple of the indirectly addressed vector C 
(index vector 2 4 6...). Thus, using a GATHER on A and C and a final SCATTER 
on A, the code can be vectorized on nearly all current vector computers.  
This gives significant gains in both the storage of the generated code and 
the speed of execution.  On the examples in their paper [5], the savings are 
a factor of just under three in both storage and time. It should be noted 
that the difference between generating Fortran code and assembler (scalar 
assembler) is a factor of about 30 in time.  

        If two integers are packed into a 64-bit CRAY word, the constituent 
integers can be quickly recovered using a shift instruction.  This device is 
used both to decrease the memory required by the address data and to speed 
up the code through a reduction in memory references. A 30% speedup in cpu 
time is reported by the authors through this device.  

        Fukui says that his block elimination was similar to the use of 
matrix-vector or level 2 BLAS (Basic Linear Algebra Subroutines) in Gaussian 
elimination on full systems.  However, it is not clear how he would do that 
with his current approach and it does not appear to be illustrated in the 
paper.  From the example there, it would appear that what is done is to hold 
the pivot row in vector registers and use it to update all the appropriate 
non-pivot rows without further memory references for the pivot row.  It 
would also be possible to update several rows as a single vector operation 
but it is also not clear if this is done.  The gains reported for block 
elimination are a factor of 1.5 in both storage and time over the vector 
assembler code.  

       Fukui et al also discuss in [5] other issues in their adaptation of 
SPICE-GT, based on SPICE 2G.6 from Berkeley. The one other area of main 
interest is in the assembly or formation of the circuit simulation matrices.  
It is like the assembly of a finite-element matrix and thus admits of easy 
parallelism.  Vectorization is not so easy because of complicated 
conditionals.  They vectorize by grouping together (using indirect 
addressing) elements satisfying the same condition and then vectorize these 
subsets using vector GATHER/SCATTER.  This is very similar to techniques 
used by Miura and others (for example, [9]) in vectorizing Monte-Carlo 
codes. (At NEC, Nakata claims that the matrix setup phase is difficult to 
parallelize in a naive manner using a great many processors because of high 
overhead associated with shared memory contention.) 

      The gains over a 1984 code (which generated Fortran) were a factor of 
4 in matrix assembly, 100 in solution, and under two in other parts of the 
code.  The overall gain of the 1989 code is a factor of around 10. A factor 
of 300 compared with the original SPICE is claimed, presumably because of 
other algorithmic (or machine!?) changes not reported on in [5].  

SUMMARY
        The NEC-Nakata-Tanabe approach is essentially one of understanding 
the physical problem and then developing a hardware-software solution.  
Toshiba-Fukui focus much more directly on the sparse matrix problem that 
needs to be solved.  

        Nakata's work really begins with a clean sheet. Many of the 
relatively special needs of circuit simulation such as many-to-one and one-
to-many communication, non-contentious local memory access, fast floating 
point, etc., are designed into the Cenju multiprocessor. More work still 
needs to be done on the serial portion of the algorithm, the sparse linear 
equation solver for the interconnection points, which is a bottleneck. Work 
is also going forward on automatic domain decomposition.  Nevertheless, the 
current speedups of over 20 are impressive.  The Cenju design does not seem 
so specialized that it cannot be useful for other problems.  This looks like 
a exciting project with a substantial amount of room for further research.  

        Fukui's work is very interesting, not least because it illustrates 
the use of a technique thought impractical by many sparse matrix 
researchers. It does not seem, however, to be suitable for a wider range of 
applications because of stability and storage issues.  Also, there is a 
great deal of machine dependence, although code-generation routines are 
usually in two parts, the first being machine independent.  This seems to be 
the case here, so that potentially a separate code generator (and index 
packer) could be used for each machine. It is difficult to say from the data 
provided how good the generated code is.  The improvements recorded are 
impressive but would need to be benchmarked against other current codes or 
on standard test problems to obtain an accurate picture.  Fukui has 
indicated his willingness to collaborate in benchmarking experiments. It 
should be pointed out that compilers are getting much better at generating 
vector code from standard matrix factorization routines, so that their 
output may not now end up so different from the code generated by the Fukui 
approach!!  

REFERENCES.

[1] Nagel L, "SPICE2, A Computer Program to Simulate Semiconductor 
Circuits", Tech Rpt ERL-M520, Univ Cal Berkeley, May 1975.

[2] Deutsch J., Newton A., "A Multiprocessor Implementation of 
Relaxation-Based Electrical Circuit Simulation", in Proc 21st DA 
Conference, pp350-357, 1984.

[3] Nakata T., Tanabe N., Onozuka H., Kurobe T., Koike N., "A 
Multiprocessor System for Modular Circuit Simulation", in Proc ICCAD 87 
pp364-367, 1987.

[4] Nakata T., Tanabe N., Kajihara N., Matsushita S., Onozuka H., Asano 
Y., Koike N., "Cenju: A Multiprocessor System for Modular Circuit 
Simulation", to be published in "Computer Systems in Engineering", 1990.

[5] Fukui Y., Yoshida H., Higono S. "Supercomputing of Circuit 
Simulation", in Proc Supercomputing '89, ACM Publication Order Number 
415892, pp81-85, 1990 

[6] Gustavson F., Liniger W., Willoughby R., "Symbolic Generation of an 
Optimal Crout Algorithm for Sparse Systems of Linear Equations", JACM vol 
17, pp87-109 1970.  

[7] Duff I., "Practical Comparisons of Codes for the Solution of Sparse 
Linear Systems", In Sparse Matrix Proceedings 1978, Duff I., and Stewart 
G., ed.,  SIAM Press, pp107-134, 1979.  

[8] Duff I, Erisman A, Reid J., "Direct Methods for Sparse Matrices", 
Oxford University Press, 1986.  

[9] Miura K., Babb R II, "Tradeoffs in Granularity and Parallelization 
for a Monte Carlo Shower Simulation Code", Parallel Computing, vol 8, 
pp91-100, 1988.  

[10] Cox P., Butler R.,Epler B., " Circuit Partitioning for Parallel
Processing"  Proc. ICCAD-86 pp186-189, 1986.

[11] Kahaner D., Moler C., Nash S., "Numerical Methods and Software" 
Prentice-Hall, 1989.
-------------------END OF REPORT------------------------------------