leff@smu.CSNET.UUCP (07/10/87)
Naomi Schulman
Publications
COMPUTER SYSTEMS LABORATORY
STANFORD UNIVERSITY
Stanford, CA 94305
RECENT REPORTS & NOTES
LIST #11
JULY 1987
To order reports, see end of this announcement.
ABSTRACTS
CSL-TR-87-323
Improving Garbage Collector Performance in Virtual Memory
Robert A. Shaw
March 1987 42 pages.....$4.60
Garbage collection and virtual memory have long been in an adversarial
relationship. Implementers of virtual memory systems have cited garbage
collectors as an excellent test case for undermining page replacement
algorithms and users of Lisp systems have turned off garbage collection (and
suffered the consequences) rather than live with the slowdown of garbage
collector paging.
This paper describes a way in which the garbage collector and virtual memory
system can work together to improve overall system performance. By using a
simple layout for storage and information already maintained by most virtual
memory systems, a garbage collector can substantially reduce the amount of
effort necessary to reclaim a large majority of the available space. The
techniques presented require no special hardware and minimal disruption of the
runtime environment.
-------------------------------------------------------------------------------
CSL-TR-87-324
LISP on a Reduced-Instruction-Set Processor: Characterization and
Optimization
by Peter Steenkiste
March 1987 173 pages....$11.15
As a result of advances in compiler technology, almost all programs are written
in high-level languages, and the effectiveness of a computer architecture is
determined by its suitability as a compiler target. The central role of
compilers in the use of computers has led computer architects to study the
implementation of high-level language programs. This thesis presents
measurements for a set of Portable Standard Lisp programs that were executed on
a reduced-instruction-set processor (MIPS-X),examining what instructions LISP
uses at the assembly level, and how much time is spent on the most common
primitive LISP operations. This information makes it possible to determine
which operations are time-critical and to evaluate how well architectural
features address these operations.
Based on these data, three areas for optimization are proposed: the
implementation of the tags used for run-time type checking, reducing the cost
of procedure calls, and interprocedural register allocation. A number of
methods to implement tags, both with and without hardware support, are
presented, and the performance of the different implementation strategies is
compared. To reduce the cost of procedure calls, time-critical LISP system
functions were optimized and inlined, and procedure integration under control
of the compiler was implemented. The effectiveness of these optimizations, and
their effect on the miss rate in the MIPS-X on-chip instruction cache are
studied. A simple register allocator uses interprocedural information to
reduce the cost of saving and restoring registers across procedure calls. This
register allocation scheme is evaluated, and its performance is compared with
hardware register windows.
-------------------------------------------------------------------------------
CSL-TR-87-325
Spectral Lower-Bound Techniques for Logic Circuits
by Yigal Brandman
March 1987 74 pages.....$6.20
This work shows a relationship between the implementation of any Boolean
function f as a decision tree or as a two-level logic circuit and the
power-spectrum coefficients of its n-dimensional Fourier transform. Based on
this relationship, a universal lower-bound method is introduced for the size of
any decision tree that computes f, the average number of decisions in any
decision tree that computes f, the length of any path in any decision tree that
computes f, the number of AND gates in any two level AND/OR logic circuit that
computes f, and the number of OR gates in any two-evel OR/AND logic circuit
that computes f.
The bounding techniques are also applicable to the following distributed-
communication problem. Assume that n inputs are distributed among n persons,
each person knows only the value of one input bit. On the average, what is the
minimum number of bits that must be exchanged among the individuals to compute
f?
The bounding method is applied to several important functions. The bounds vary
from constant to exponential and are tight in many cases.
-------------------------------------------------------------------------------
CSL-TR-87-326
SRT Division Diagrams and their Usage in Designing Custom Integrated
Circuits for Division
Ted E. Williams and Mark Horowitz
April 1987 24 pages.....$3.70
This paper describes the construction and analysis of several diagrams which
depict SRT division algorithms. These diagrams yield insight into the
operation of the algorithms and the many implementation tradeoffs available in
custom circuit design. Examples of simple low radix diagrams are shown, as
well as tables for higher radices. The tables were generated by a program
which can create and verify the diagrams for different division schemes. Also
discussed is a custom CMOS integrated circuit designed which performs SRT
division using self-timed circuit techniques. This chip implements an
intermediate approach between a fully combinational array and a fully iterative
in time method in order to get both speed and small silicon area.
-------------------------------------------------------------------------------
CSL-TR-87-327
Object Communication in Allegro
Mark A. Linton
March 1987 11 pages.....$3.05
Large scale object-oriented systems must be able to span machines while
providing efficient and transparent access to small objects. To build a
distributed programming environment, we are using the concept of an object
space that provides remote access to a group of objects. Object spaces are
independent servers that unify the traditional concepts of commands and files,
thereby simplifying the problems of data management and concurrency control.
Objects communicate with remote objects synchronously or asynchronously,
multiplexing messages through an underlying connection between object spaces.
We are implementing a prototype system, named Allegro, in which program objects
are distributed across multiple object spaces, and spaces can be distributed
across a network of machines. In this paper, we describe the Allegro object
model, the protocol we use for accessing remote objects, and the runtime
support necessary for building servers. We have implemented an object space
for a set of primitive user interface objects that demonstrates the viability
of our approach.
-------------------------------------------------------------------------------
CSL-TR-87-328
PARTITIONING AND SCHEDULING PARALLEL PROGRAMS FOR EXECUTION
ON MULTIPROCESSORS
by Vivek Sarkar
April 1987 196 pages.....$12.30
There are three fundamental problems to be solved in the execution of a
parallel program on a multiprocessor -- identifying the parallelism in the
program, partitioning the program into tasks and scheduling the tasks on
processors. Whereas the problem of identifying parallelism is a programming
language issue, the partitioning and scheduling problems are intimately related
to parameters of the target multiprocessor, like the number of processors and
synchronisation (1) and communication overhead. It is desirable for the
partitioning and scheduling to be performed automatically, so that the same
parallel program can execute efficiently on different multiprocessors. This
dissertation presents two solutions to the partitioning and scheduling
problems. The first approach is based on a macro-dataflow model, where the
program is partitioned into tasks at compile-time and the tasks are scheduled
on processors at run-time. The second approach is based on a compile-time
scheduling model, where the partitioning of the program and the scheduling of
tasks on processors are both performed at compile-time.
Both approaches have been implemented to partition programs written in the
single-assignment language, SISAL. The inputs to the partitioning and
scheduling algorithms are a graphical representation of the program and a list
of parameters describing the target multiprocessor. Execution profile
information is used to derive compile-time estimates of execution times and
data sizes in the program. Both the macro-dataflow and compile-time scheduling
problems are expressed as optimisation problems, which are proved to be
NP-complete in the strong sense. This dissertation presents efficient
approximation algorithms for these problems. The effectiveness of the
partitioning and scheduling algorithms is studied by multiprocessor simulations
of various SISAL benchmark programs for different target multiprocessor
parameters.
-------------------------------------------------------------------------------
CSL-TR-87-330
PERFORMANCE-ORIENTED SYNTHESIS OF LARGE SCALE DOMINO CMOS
CIRCUITS
Giovanni De Micheli
May 1987 46 pages.....$4.80
The quality of the design of large scale integrated circuits is determined by
some figures of merit, such as silicon area, power consumption and
switching-time performance. We address here the problem of automatic synthesis
of digital circuits with the goal of achieving high performance designs. We
assume we are given an intermediate circuit representation that optimizes area
and/or power. We use timing optimization techniques to improve the circuit
performance, possibly at the expense of the other figures of merit.
We consider general classes of digital circuits, with a given partition into
registers, combinational blocks and I/O ports. Circuit performance is related
to the worst-case propagation delay of signals between two register boundaries.
In this context, circuit performance optimization is equivalent to minimizing
the critical path delay through the combinational circuits. We assume a
multiple-level implementation of the combinational logic, by means of an
interconnection of logic gates implementing arbitrary multiple-input
single-output logic functions. We consider dynamic CMOS implementation of the
logic gates, operating in the domino mode.
We present a global approach to timing performance optimization, which involves
operations at the logic, topological and physical level of abstraction of the
circuit. In particular, at the logic level, we look for optimal structures of
multiple-level combinational networks. At the topological level, we search for
the optimal positions of gates or groups of gates. At the physical design
level, we optimize MOS device sizes.
The algorithms are described as well as their implementation and the interface
to the Yorktown Silicon Compiler system, which is an automated synthesis system
described in [BRAY87]. The results of applying timing-performance optimization
to a 32-bit microprocessor design are reported.
-------------------------------------------------------------------------------
CSL-TR-87-331
THROUGHPUT ANALYSIS OF THE IEEE 802.4 TOKEN BUS STANDARD UNDER HEAVY LOAD
Joseph Pang and Fouad Tobagi
May 1987 43 pages.....$4.65
It has become clear in the last few years that there is a trend towards
integrated digital services. Parallel to the development of public Integrated
Services Digital Network (ISDN) is service integration in the local area (e.g.
a campus, a building, an aircraft). The types of services to be integrated
depend very much on the specific local environment. However, applications tend
to generate data traffic belonging to one of two classes. According to IEEE
802.4 terminology, the first major class of traffic is termed synchronous, such
as packetized voice and data generated from other applications with real-time
constraints, and the second class is called asynchronous which includes most
computer data traffic such as file transfer or facsimile.
In this report, we examine the IEEE 802.4 token bus protocol which has been
designed to support both synchronous and asynchronous traffic. The protocol is
basically a timer-controlled token bus access scheme. By a suitable choice of
the design parameters, it can be shown that access delay is bounded for
synchronous traffic. As well, the bandwidth allocated to asynchronous traffic
can be controlled. We present a throughput analysis of the protocol under
heavy load with constant channel occupation of synchronous traffic and constant
token-passing times.
-------------------------------------------------------------------------------
CSL-TR-87-332
ANALYSIS OF CACHE PERFORMANCE FOR OPERATING SYSTEMS AND
MULTIPROGRAMMING
Anant Agarwal
May 1987 170 pages.....$11.00
Advances in high-performance processors continue to create an increased need
for memory bandwidth. Caches can provide this bandwidth cost-effectively.
However, minimizing the performance loss due to caching requires that our
analysis and prediction of cache performance become more exact. Although
previous studies have shown that operating system and multiprogramming
activities affect the cache performance, those studies did not deal with these
issues in detail, largely because of the unavailability of efficient analysis
techniques and the difficulty in collecting data for this analysis. To obtain
the higher hit rates needed to sustain the effectiveness of caches, we must
address these issues completely.
This dissertation investigates the performance of large caches for realistic
operating system and multiprogramming workloads. A suite of efficient and
accurate cache analysis techniques is developed. These include: a new data
collection method, a mathematical cache model, and a trace sampling and a trace
stitching procedure. The analyses use a data collection technique called ATUM
to obtain realistic system traces of multitasking workloads with little
distortion.
Aaccurately characterizing cache behavior using ATUM traces shows that both
operating system and multiprogramming activity significantly degrade cache
performance, with an even greater proportional impact on large caches. From a
careful analysis of the causes of this degradation, we explore various
techniques to reduce this loss. While seemingly little can be done to mitigate
the effect of system references, multitasking cache misses can be reduced with
little effort. We also demonstrate how analytical cache modeling, and trace
sampling -- with a new approach to cold-start and warm-start analysis -- can be
used to make large cache studies insightful and efficient.
-------------------------------------------------------------------------------
Publications
COMPUTER SYSTEMS LABORATORY
Stanford University
Stanford, CA 94305
415-723-1430
(Hours: M-Th, 8-1)
ORDER FORM
To Order Reports: Print or type your name and address in the space provided.
Check or circle the report(s) you wish to purchase, whether hardcopy or
microfiche. All orders must be PREPAID. CALIFORNIA RESIDENTS must add sales
tax of their local county. Return this order form with your check or money
order made payable to Stanford University. FOREIGN ORDERS must be paid with an
international money order or a check drawn on a U.S. bank, payable in dollars.
Please type or print your name and complete address:
______________________________________________
______________________________________________
______________________________________________
______________________________________________
______________________________________________
______________________________________________
Report # Hardcopy Microfiche Report # Hardcopy Microfiche
-------- -------- ---------- -------- -------- ----------
TR 323 $4.60 $2.00 TR 328 $12.30 $3.00
TR 324 $11.15 $3.00 TR 329 $14.50 $3.00
TR 325 $6.20 $2.00 TR 330 $4.80 $2.00
TR 326 $3.70 $2.00 TR 331 $4.65 $2.00
TR 327 $3.05 $2.00 TR 332 $11.00 $3.00
Subtotal __________
Your Local County CA tax __________
Total __________
-------