eugene@wilbur.nas.nasa.gov (Eugene N. Miya) (06/16/90)
I do not know the structure of the lsi or hardware groups.
Some one should (might) repost these over to those groups.
--eugene
Software-type
======
To: Distribution
From: David K. Kahaner, ONRFE [kahaner@xroads.cc.u-tokyo.ac.jp]
Re: Two Japanese Approaches to Circuit Simulation
15 June 1990
TWO JAPANESE APPROACHES TO CIRCUIT SIMULATION
Iain S. Duff
Rutherford Appleton Laboratory
duff@antares.mcs.anl.gov
David K. Kahaner
Office of Naval Research, Far East
kahaner@xroads.cc.u-tokyo.ac.jp
ABSTRACT.
We review research at Toshiba in vectorizing the circuit simulation
program SPICE, and research at NEC at building a special purpose
multiprocessor for circuit simulation modeling.
INTRODUCTION.
Circuit simulation is heavily used in electronic design. It allows
engineers to evaluate and alter design parameters to obtain high performance
or cost effective utilization of parts. One of its most important
applications is in the design of memory chips, because it is the only way
that chip designers can study variations in MOS transistors. Naturally, the
more detailed the design, the more computationally intensive the simulation.
Even on supercomputers such as Cray, simulations can take hours, as it is
often necessary to repeat computations with various parameter values. Since
these simulations are done frequently it is not surprising that large
electronics companies are engaged in a variety of research to improve their
simulation algorithms with respect to computer time, accuracy, and other
performance characteristics. This report describes two different approaches
taken by NEC and Toshiba, both large Japanese firms with substantial circuit
design requirements. The authors of this report are numerical analysts, not
circuit designers or computer architects, and the article reflects this
orientation.
Toshiba's approach, developed by
Yoshinari Fukui
Toshiba CAE Systems
Kowa-Kawasaki-Nishiguchi Bldg
66-2, Horikawa-cho, Saiwai-ku,
Kawasaki 210, Japan
Fax: 81-44-548-4069
FUKUI@TCI.TOSHIBA.CO.JP
begins with the well known program SPICE 2G6 [1] and rewrites it for Cray
X/MP and Y/MP computers. Fukui reports performance gains of up to 300 (see
below) compared to old versions of public domain SPICE. Toshiba does not
manufacture a supercomputer, but they have a great many circuit design
problems which are usually run on one of their Crays. Fukui's work has
already paid off for Toshiba, as the code is being used by engineers as a
production tool.
NEC's work, spearheaded by
Toshiyuki Nakata
C&C Systems Research Laboratories
NEC Corporation
4-1-1 Miyazaki, Miyamae-Ku
Kawasaki, Kanagawa 213, Japan
NAKATA@CSL.CS.NEC.CO.JP
(044) 856-2127, Fax: (044) 856-2231
and Norio Tanabe
VLSI CAD Engineering Div. NEC
NAKATA@CSL.CL.NEC.CO.JP
essentially rewrites the circuit simulation algorithms from scratch and runs
them on a specially designed 64 node multiprocessor. Nakata feels that
vectorization has only limited potential for speed improvements, and that
the rate of vectorization will decrease as more detail is added to the
circuit models because of growing numbers of conditional statements.
The NEC project is the more experimental of the two but has the potential
for greater payoffs in the future. On two models of Dynamic Random Access
Memory (DRAM) circuits Nakata reports parallel speedups of over 20 with 64
processors, but feels that he can increase that by future improvements (see
below). As a very large electronics organization NEC has many circuit design
problems. One of the specially designed circuit simulation machines has
been installed in NEC's VLSI CAD Engineering Division.
The description and comments about these projects are based upon discussions
with Fukui and Nakata, as well as reviews of published papers and preprints.
In all cases we asked not to be told about any confidential activities.
BRIEF REVIEW OF CIRCUIT SIMULATION.
The circuit to be modeled is specified by describing its components
and their connectivity. Kirchoff's current and voltage laws are used to make
statements about conservation and total current flow. These laws are
static, that is they hold at all times. In addition there are equations
describing the time rate of change of circuit voltage and current, due to
conductance, capacitance, and inductive effects. The result is that circuit
voltage and current are specified as the solution of a system of ordinary
differential equations,
Ay'=f
where A is a large, sparse (many zero entries) matrix determined by the
physical structure of the circuit. This system of differential equations
must be solved to determine the evolving characteristics of the circuit.
The equations are known to be "stiff", because of the large circuit time-
constants, and hence can only be solved by special techniques. In
particular it is necessary to use implicit integration, such as backward
Euler, or higher order backward differention (Gear) methods [11]. This
eventually results in a system of nonlinear equations to be solved for the
solution y at each time step. The solution of these equations is done by
iteration, usually employing Newton's method. At every time step each
iteration requires the solution of a system of linear equations. The matrix
of the linear equations is formed by adding the identity to a scalar
multiple of the Jacobian matrix. The latter is the matrix of partial
derivatives of the right hand side of the differential equations, denoted
f above.
Thus the essential ingredients in performing transient circuit
simulation analysis are
(a) Setting up the problem. This includes formulating the matrix. It
also includes computing equivalent conductances and current sources
for linear reactive circuit elements.
(b) Integration, including error control and time step strategy.
(c) Solving the nonlinear equations by iteration. Each iteration requires
solution of a sparse system of linear equations. Circuit designers
think of the Newton Raphson iteration as a process that linearizes the
nonlinear circuit elements.
In practice deriving the Jacobian matrix often accounts for most of the
computation time. In a specific case sited by Nakata this amounted to more
than eighty percent of the total computation time. About sixteen percent of
the time was spent solving the sparse linear equations. These figures will
vary with the complexity of the model. Nevertheless, to make large
reductions in the CPU time all aspects must be accelerated. Parts (b) and
(c) are related because if the integration time steps are chosen carefully,
Newton's method will converge rapidly and the computed values will be
acceptably accurate. Small integration steps give more accuracy and require
fewer iterations to converge but require more computer time. If the
stepsize is chosen too large it can take many iterations to converge or
actually diverge, and of course larger steps can reduce accuracy. There is
a delicate balance between these and most simulation programs undergo a
large amount of experimental "tuning"; many do not use the most modern step
size selection strategy.
SPEEDING UP CIRCUIT SIMULATION.
The two major approaches to increasing the performance of circuit
simulation programs are (1) to take advantage of the partitioning suggested
by the physical circuit in order to perform certain functions in parallel,
and (2) to take advantage of the structure of the Jacobian matrix to perform
as many operations as possible in parallel or in vector mode, in particular
assembling and solving the sparse system of linear equations. Of course, the
physical circuit also determines the structure of the matrix. NEC focuses
on parallelization by use of a specially developed 64 CPU multiprocessor.
Toshiba has parallelized portions of their program, and also made major
strides in vectorizing the solution of the linear equations.
Many portions of the overall computation can be performed in
parallel, in particular the setup or assembly phase. Often the circuit is
partitioned into subcircuits each of which is independently calculated, with
an iteration until the various parts converge. This is often called a
"relaxation algorithm" [2]. However if there is substantial feedback between
circuit parts convergence may not occur. Another approach is to subdivide
the circuit but force the subcircuits to interact via a network whose
parameters are determined anew at each iteration. Nakata calls this "modular
simulation" [3].
NEC'S APPROACH.
Nakata and Tanabe have concentrated on rearranging the circuit
simulation algorithm to take advantage of those aspects of it that have
inherent parallelism. They have designed the algorithm to allocate
subcircuits to different processors each of which independently performs a
simulation using local memory. The determination of the interconnection
between various subcircuits must be performed, serially, at every iteration
by one "interconnection" processor. This processor is locked until all the
subcircuits are finished. The actual algorithm is recursive, so there is a
tree of such processors. The essential function of the interconnection
processor is to solve a system of linear equations. The size of the
interconnection matrix is determined by the number of interconnections. For
example, in a DRAM circuit with 7,000 transistors, broken into 191
subcircuits, the interconnection matrix is about 200 by 200, with about 20
percent of its entries nonzero. (Research based on the same mathematical
foundations, but without the multiple hierarchy of subcircuits is being done
at Texas Instrument Corp, [10].)
Consideration of the form of the algorithm suggested that for many
problems the subcircuit processors would compute independently and then all
communicate to the interconnection processor which would then compute and
finally communicate back to the subcircuit processors. Thus it will mostly
be the case that communication is many-to-one or one-to-many. Nakata felt
that a bus structure had better through-put than more general organizations,
such as a hypercube. He also felt that shared memory would be more efficient
than message passing systems. In 1987 the NEC group developed a protoype
system composed of 4 processors. From that experience they have developed a
64 processor system, called Cenju, after the Buddhist goddess with one
thousand arms. (Nakata feels that about 100 processors is the maximum number
that can effectively used on circuit problems.) He decided to use a
distributed-shared memory. This means that there is a unique global address
for the local memory in each processor, and that each processor has a local
memory bus so it can access its local memory without worrying about bus
contention. Since some interprocessor communication occasionally takes
place, and building a common bus with high data-transfer-rate suitable to so
many processors was difficult he has designed the data bus in a hierarchical
(tree-like) form.
Cenju is thus composed of 64 processors. These are divided into 8
"clusters". Processors in a cluster are connected by a bus which is
connected via a multi-stage network. Each processor is built around a 20MHz
Motorola MC68020 CPU which has 4MB of dual ported local memory connected by
a local memory bus. The cluster bus only needs to be accessed when
communicating with other processors. The design allows up to 256 processors.
Since floating point computation is a fundamental part of circuit
simulation, Nakata decided to use fast Weitek WTL1167 chips instead of the
Motorola MC68882 chips that are most commonly used with MC68000 systems. An
MC68882 is on the board and is used for math functions such as sin, cos,
exp, etc.,, but floating point addition, subtraction, multiplication, and
division are done through calls to the Weitek chip. Other details of the
hardware organization can be found in the paper [4].
Nakata is a computer architect, not a numerical analyst or a circuit
designer. In developing Cenju he collaborated heavily with Norio Tanabe who
has been doing circuit simulation for many years, and is in NEC's VLSI CAD
Engineering Division. Some of the aspects of the algorithm were done in
collaboration with other scientists in NEC's Computer and Communication
Systems Research Division, and Scientific Information System Development Ltd
(a subsidiary). For complete addresses, see references [3] and [4].
Nakata has tested the running system on two circuits, one with about
1700 transistors, and another with about 7000. Both are part of the control
circuitry for DRAMs. Speedups were initially reported as 15 for the small
circuit and 16 for the larger. The system designer enters the circuit. With
respect to domain decomposition, Nakata and Tanabe feel that it is best to
take advantage of hierarchical design used by the designers. When that
cannot be done, a domain processor should be available to automatically
generate the subcircuits, although for these examples they utilized the
designers hierarchical design, 64 subcircuits for the first and 191 for the
second. When there are more subcircuits than processors there is quite a bit
of extra overhead and Nakata feels that this accounted for some degradation
in performance; the matrix problem to be solved for the interconnection
matrix is larger and the matrix has about 20% nonzero elements. Its solution
is done serially with respect to the other phases of the calculation.
Nakata feels that additional improvements would occur if the domain
decomposer was more sophisticated and produced fewer subcircuits as this
would directly reduce the size of the matrix. He has also made some initial
efforts to parallelize solving the matrix problem, and work in this
direction has improved the speedup for the large problem to over 20. Nakata
has not done as much work on solving the matrix problem as Fukui at Toshiba
(below), because most of the original sparse matrix solution has been
parallelized in the subcircuit processors; what is left is the sparse matrix
solution for the interconnection points. However, he feels that by
utilizing better domain decomposition he can substantially increase the
speedup.
Nakata originally built Cenju as a special purpose computer, but now
is studying the possibilities of using it for other applications. Nakata
notes that inter-cluster read access is very slow due to overhead in bus-
error interrupt handling, and timeout control. He dealt with this in circuit
simulation by utilizing, whenever possible, algorithms in which the producer
of data writes that data to its consumer.
TOSHIBA'S APPROACH.
One of the early applications of direct methods for the solution
of sparse linear systems was in the solution of problems from circuit
simulation. In particular, this was the main reason for the interest in
sparse matrix research of scientists from the Mathematical Sciences
Department at IBM Yorktown Heights. This group, which included Bob Brayton,
Gary Hachtel, Fred Gustavson, Werner Liniger, and Ralph Willoughby conducted
much fundamental research into sparse matrix techniques in the late 60s and
early 70s. In 1970, Gustavson et. al., published a paper [6] describing the
use of a code-generation approach, which they termed GNSO (GeNerate and
SOlve), to the solution of sparse equations. The idea of the code-
generation approach is to perform a factorization of the sparse matrix using
Gaussian elimination and to record each arithmetic operation involved in a
linear (loop-free) set of statements. This could be Fortran, assembler code,
or in a compressed coded format (interpretative code approach). The initial
elimination might be a regular factorization or a symbolic one, depending on
sparsity alone and not on the numerical values. As mentioned above in the
circuit context (and in many other application areas) during the Newton
iteration it is required to solve several systems with a coefficient matrix
of the same structure and so this list of operations can be used for these
subsequent factorizations. In the Toshiba experiments, sometimes about 900
iterations with a Jacobian matrix of the same structure were required so, in
common with most other experiments using this scheme, the initial cost of
generating the loop free code is not reported.
Duff [7], conducted some comparisons on the early code of Gustavson,
for example, and concluded that their approach was of limited use for
general problems. Duff, and his coauthors also comment on this approach in
the book [8], section 9.6, where the main problems are stated to be storage
(with consequent speed problems if data has to be held on auxiliary devices)
and stability. Additionally loop-free code is not well optimized by most
current compilers and may not be vectorized by them. Since about one line
of code is generated for each entry of the factors, this approach is most
attractive when the factors can be kept relatively sparse. Thus it would
not be of interest in the solution of elliptic partial differential
equations but is much more suited to the types of matrices occurring in
power systems and circuit simulation problems. Additionally, in many
circuit problems, and especially for memory modules, there is a natural
stability in the devices and this is reflected in the equations, so that
diagonal pivoting is usually stable.
These features make the code-generation approach potentially
intersting in this application. We now discuss the improvements that Fukui
et al have made to the basic method to give them a competitive code.
The improvements are:
(1) generation of assembler code using vector operations,
(2) packing of integer index information, and
(3) use of block elimination.
We discuss each in turn.
Using the example from their paper [5], a fragment of generated code
(in Fortran) may look like
A(1) = A(1) - B*C(2)
A(3) = A(3) - B*C(4)
A(5) = A(5) - B*C(6)
........
and, if the programmer groups these statements suitably, one can see that
the indirectly addressed vector A (addressed above through index vector 1 3
5 ...) is updated by a scalar multiple of the indirectly addressed vector C
(index vector 2 4 6...). Thus, using a GATHER on A and C and a final SCATTER
on A, the code can be vectorized on nearly all current vector computers.
This gives significant gains in both the storage of the generated code and
the speed of execution. On the examples in their paper [5], the savings are
a factor of just under three in both storage and time. It should be noted
that the difference between generating Fortran code and assembler (scalar
assembler) is a factor of about 30 in time.
If two integers are packed into a 64-bit CRAY word, the constituent
integers can be quickly recovered using a shift instruction. This device is
used both to decrease the memory required by the address data and to speed
up the code through a reduction in memory references. A 30% speedup in cpu
time is reported by the authors through this device.
Fukui says that his block elimination was similar to the use of
matrix-vector or level 2 BLAS (Basic Linear Algebra Subroutines) in Gaussian
elimination on full systems. However, it is not clear how he would do that
with his current approach and it does not appear to be illustrated in the
paper. From the example there, it would appear that what is done is to hold
the pivot row in vector registers and use it to update all the appropriate
non-pivot rows without further memory references for the pivot row. It
would also be possible to update several rows as a single vector operation
but it is also not clear if this is done. The gains reported for block
elimination are a factor of 1.5 in both storage and time over the vector
assembler code.
Fukui et al also discuss in [5] other issues in their adaptation of
SPICE-GT, based on SPICE 2G.6 from Berkeley. The one other area of main
interest is in the assembly or formation of the circuit simulation matrices.
It is like the assembly of a finite-element matrix and thus admits of easy
parallelism. Vectorization is not so easy because of complicated
conditionals. They vectorize by grouping together (using indirect
addressing) elements satisfying the same condition and then vectorize these
subsets using vector GATHER/SCATTER. This is very similar to techniques
used by Miura and others (for example, [9]) in vectorizing Monte-Carlo
codes. (At NEC, Nakata claims that the matrix setup phase is difficult to
parallelize in a naive manner using a great many processors because of high
overhead associated with shared memory contention.)
The gains over a 1984 code (which generated Fortran) were a factor of
4 in matrix assembly, 100 in solution, and under two in other parts of the
code. The overall gain of the 1989 code is a factor of around 10. A factor
of 300 compared with the original SPICE is claimed, presumably because of
other algorithmic (or machine!?) changes not reported on in [5].
SUMMARY
The NEC-Nakata-Tanabe approach is essentially one of understanding
the physical problem and then developing a hardware-software solution.
Toshiba-Fukui focus much more directly on the sparse matrix problem that
needs to be solved.
Nakata's work really begins with a clean sheet. Many of the
relatively special needs of circuit simulation such as many-to-one and one-
to-many communication, non-contentious local memory access, fast floating
point, etc., are designed into the Cenju multiprocessor. More work still
needs to be done on the serial portion of the algorithm, the sparse linear
equation solver for the interconnection points, which is a bottleneck. Work
is also going forward on automatic domain decomposition. Nevertheless, the
current speedups of over 20 are impressive. The Cenju design does not seem
so specialized that it cannot be useful for other problems. This looks like
a exciting project with a substantial amount of room for further research.
Fukui's work is very interesting, not least because it illustrates
the use of a technique thought impractical by many sparse matrix
researchers. It does not seem, however, to be suitable for a wider range of
applications because of stability and storage issues. Also, there is a
great deal of machine dependence, although code-generation routines are
usually in two parts, the first being machine independent. This seems to be
the case here, so that potentially a separate code generator (and index
packer) could be used for each machine. It is difficult to say from the data
provided how good the generated code is. The improvements recorded are
impressive but would need to be benchmarked against other current codes or
on standard test problems to obtain an accurate picture. Fukui has
indicated his willingness to collaborate in benchmarking experiments. It
should be pointed out that compilers are getting much better at generating
vector code from standard matrix factorization routines, so that their
output may not now end up so different from the code generated by the Fukui
approach!!
REFERENCES.
[1] Nagel L, "SPICE2, A Computer Program to Simulate Semiconductor
Circuits", Tech Rpt ERL-M520, Univ Cal Berkeley, May 1975.
[2] Deutsch J., Newton A., "A Multiprocessor Implementation of
Relaxation-Based Electrical Circuit Simulation", in Proc 21st DA
Conference, pp350-357, 1984.
[3] Nakata T., Tanabe N., Onozuka H., Kurobe T., Koike N., "A
Multiprocessor System for Modular Circuit Simulation", in Proc ICCAD 87
pp364-367, 1987.
[4] Nakata T., Tanabe N., Kajihara N., Matsushita S., Onozuka H., Asano
Y., Koike N., "Cenju: A Multiprocessor System for Modular Circuit
Simulation", to be published in "Computer Systems in Engineering", 1990.
[5] Fukui Y., Yoshida H., Higono S. "Supercomputing of Circuit
Simulation", in Proc Supercomputing '89, ACM Publication Order Number
415892, pp81-85, 1990
[6] Gustavson F., Liniger W., Willoughby R., "Symbolic Generation of an
Optimal Crout Algorithm for Sparse Systems of Linear Equations", JACM vol
17, pp87-109 1970.
[7] Duff I., "Practical Comparisons of Codes for the Solution of Sparse
Linear Systems", In Sparse Matrix Proceedings 1978, Duff I., and Stewart
G., ed., SIAM Press, pp107-134, 1979.
[8] Duff I, Erisman A, Reid J., "Direct Methods for Sparse Matrices",
Oxford University Press, 1986.
[9] Miura K., Babb R II, "Tradeoffs in Granularity and Parallelization
for a Monte Carlo Shower Simulation Code", Parallel Computing, vol 8,
pp91-100, 1988.
[10] Cox P., Butler R.,Epler B., " Circuit Partitioning for Parallel
Processing" Proc. ICCAD-86 pp186-189, 1986.
[11] Kahaner D., Moler C., Nash S., "Numerical Methods and Software"
Prentice-Hall, 1989.
-------------------END OF REPORT------------------------------------