leff@smu.CSNET.UUCP (07/10/87)
Naomi Schulman Publications COMPUTER SYSTEMS LABORATORY STANFORD UNIVERSITY Stanford, CA 94305 RECENT REPORTS & NOTES LIST #11 JULY 1987 To order reports, see end of this announcement. ABSTRACTS CSL-TR-87-323 Improving Garbage Collector Performance in Virtual Memory Robert A. Shaw March 1987 42 pages.....$4.60 Garbage collection and virtual memory have long been in an adversarial relationship. Implementers of virtual memory systems have cited garbage collectors as an excellent test case for undermining page replacement algorithms and users of Lisp systems have turned off garbage collection (and suffered the consequences) rather than live with the slowdown of garbage collector paging. This paper describes a way in which the garbage collector and virtual memory system can work together to improve overall system performance. By using a simple layout for storage and information already maintained by most virtual memory systems, a garbage collector can substantially reduce the amount of effort necessary to reclaim a large majority of the available space. The techniques presented require no special hardware and minimal disruption of the runtime environment. ------------------------------------------------------------------------------- CSL-TR-87-324 LISP on a Reduced-Instruction-Set Processor: Characterization and Optimization by Peter Steenkiste March 1987 173 pages....$11.15 As a result of advances in compiler technology, almost all programs are written in high-level languages, and the effectiveness of a computer architecture is determined by its suitability as a compiler target. The central role of compilers in the use of computers has led computer architects to study the implementation of high-level language programs. This thesis presents measurements for a set of Portable Standard Lisp programs that were executed on a reduced-instruction-set processor (MIPS-X),examining what instructions LISP uses at the assembly level, and how much time is spent on the most common primitive LISP operations. This information makes it possible to determine which operations are time-critical and to evaluate how well architectural features address these operations. Based on these data, three areas for optimization are proposed: the implementation of the tags used for run-time type checking, reducing the cost of procedure calls, and interprocedural register allocation. A number of methods to implement tags, both with and without hardware support, are presented, and the performance of the different implementation strategies is compared. To reduce the cost of procedure calls, time-critical LISP system functions were optimized and inlined, and procedure integration under control of the compiler was implemented. The effectiveness of these optimizations, and their effect on the miss rate in the MIPS-X on-chip instruction cache are studied. A simple register allocator uses interprocedural information to reduce the cost of saving and restoring registers across procedure calls. This register allocation scheme is evaluated, and its performance is compared with hardware register windows. ------------------------------------------------------------------------------- CSL-TR-87-325 Spectral Lower-Bound Techniques for Logic Circuits by Yigal Brandman March 1987 74 pages.....$6.20 This work shows a relationship between the implementation of any Boolean function f as a decision tree or as a two-level logic circuit and the power-spectrum coefficients of its n-dimensional Fourier transform. Based on this relationship, a universal lower-bound method is introduced for the size of any decision tree that computes f, the average number of decisions in any decision tree that computes f, the length of any path in any decision tree that computes f, the number of AND gates in any two level AND/OR logic circuit that computes f, and the number of OR gates in any two-evel OR/AND logic circuit that computes f. The bounding techniques are also applicable to the following distributed- communication problem. Assume that n inputs are distributed among n persons, each person knows only the value of one input bit. On the average, what is the minimum number of bits that must be exchanged among the individuals to compute f? The bounding method is applied to several important functions. The bounds vary from constant to exponential and are tight in many cases. ------------------------------------------------------------------------------- CSL-TR-87-326 SRT Division Diagrams and their Usage in Designing Custom Integrated Circuits for Division Ted E. Williams and Mark Horowitz April 1987 24 pages.....$3.70 This paper describes the construction and analysis of several diagrams which depict SRT division algorithms. These diagrams yield insight into the operation of the algorithms and the many implementation tradeoffs available in custom circuit design. Examples of simple low radix diagrams are shown, as well as tables for higher radices. The tables were generated by a program which can create and verify the diagrams for different division schemes. Also discussed is a custom CMOS integrated circuit designed which performs SRT division using self-timed circuit techniques. This chip implements an intermediate approach between a fully combinational array and a fully iterative in time method in order to get both speed and small silicon area. ------------------------------------------------------------------------------- CSL-TR-87-327 Object Communication in Allegro Mark A. Linton March 1987 11 pages.....$3.05 Large scale object-oriented systems must be able to span machines while providing efficient and transparent access to small objects. To build a distributed programming environment, we are using the concept of an object space that provides remote access to a group of objects. Object spaces are independent servers that unify the traditional concepts of commands and files, thereby simplifying the problems of data management and concurrency control. Objects communicate with remote objects synchronously or asynchronously, multiplexing messages through an underlying connection between object spaces. We are implementing a prototype system, named Allegro, in which program objects are distributed across multiple object spaces, and spaces can be distributed across a network of machines. In this paper, we describe the Allegro object model, the protocol we use for accessing remote objects, and the runtime support necessary for building servers. We have implemented an object space for a set of primitive user interface objects that demonstrates the viability of our approach. ------------------------------------------------------------------------------- CSL-TR-87-328 PARTITIONING AND SCHEDULING PARALLEL PROGRAMS FOR EXECUTION ON MULTIPROCESSORS by Vivek Sarkar April 1987 196 pages.....$12.30 There are three fundamental problems to be solved in the execution of a parallel program on a multiprocessor -- identifying the parallelism in the program, partitioning the program into tasks and scheduling the tasks on processors. Whereas the problem of identifying parallelism is a programming language issue, the partitioning and scheduling problems are intimately related to parameters of the target multiprocessor, like the number of processors and synchronisation (1) and communication overhead. It is desirable for the partitioning and scheduling to be performed automatically, so that the same parallel program can execute efficiently on different multiprocessors. This dissertation presents two solutions to the partitioning and scheduling problems. The first approach is based on a macro-dataflow model, where the program is partitioned into tasks at compile-time and the tasks are scheduled on processors at run-time. The second approach is based on a compile-time scheduling model, where the partitioning of the program and the scheduling of tasks on processors are both performed at compile-time. Both approaches have been implemented to partition programs written in the single-assignment language, SISAL. The inputs to the partitioning and scheduling algorithms are a graphical representation of the program and a list of parameters describing the target multiprocessor. Execution profile information is used to derive compile-time estimates of execution times and data sizes in the program. Both the macro-dataflow and compile-time scheduling problems are expressed as optimisation problems, which are proved to be NP-complete in the strong sense. This dissertation presents efficient approximation algorithms for these problems. The effectiveness of the partitioning and scheduling algorithms is studied by multiprocessor simulations of various SISAL benchmark programs for different target multiprocessor parameters. ------------------------------------------------------------------------------- CSL-TR-87-330 PERFORMANCE-ORIENTED SYNTHESIS OF LARGE SCALE DOMINO CMOS CIRCUITS Giovanni De Micheli May 1987 46 pages.....$4.80 The quality of the design of large scale integrated circuits is determined by some figures of merit, such as silicon area, power consumption and switching-time performance. We address here the problem of automatic synthesis of digital circuits with the goal of achieving high performance designs. We assume we are given an intermediate circuit representation that optimizes area and/or power. We use timing optimization techniques to improve the circuit performance, possibly at the expense of the other figures of merit. We consider general classes of digital circuits, with a given partition into registers, combinational blocks and I/O ports. Circuit performance is related to the worst-case propagation delay of signals between two register boundaries. In this context, circuit performance optimization is equivalent to minimizing the critical path delay through the combinational circuits. We assume a multiple-level implementation of the combinational logic, by means of an interconnection of logic gates implementing arbitrary multiple-input single-output logic functions. We consider dynamic CMOS implementation of the logic gates, operating in the domino mode. We present a global approach to timing performance optimization, which involves operations at the logic, topological and physical level of abstraction of the circuit. In particular, at the logic level, we look for optimal structures of multiple-level combinational networks. At the topological level, we search for the optimal positions of gates or groups of gates. At the physical design level, we optimize MOS device sizes. The algorithms are described as well as their implementation and the interface to the Yorktown Silicon Compiler system, which is an automated synthesis system described in [BRAY87]. The results of applying timing-performance optimization to a 32-bit microprocessor design are reported. ------------------------------------------------------------------------------- CSL-TR-87-331 THROUGHPUT ANALYSIS OF THE IEEE 802.4 TOKEN BUS STANDARD UNDER HEAVY LOAD Joseph Pang and Fouad Tobagi May 1987 43 pages.....$4.65 It has become clear in the last few years that there is a trend towards integrated digital services. Parallel to the development of public Integrated Services Digital Network (ISDN) is service integration in the local area (e.g. a campus, a building, an aircraft). The types of services to be integrated depend very much on the specific local environment. However, applications tend to generate data traffic belonging to one of two classes. According to IEEE 802.4 terminology, the first major class of traffic is termed synchronous, such as packetized voice and data generated from other applications with real-time constraints, and the second class is called asynchronous which includes most computer data traffic such as file transfer or facsimile. In this report, we examine the IEEE 802.4 token bus protocol which has been designed to support both synchronous and asynchronous traffic. The protocol is basically a timer-controlled token bus access scheme. By a suitable choice of the design parameters, it can be shown that access delay is bounded for synchronous traffic. As well, the bandwidth allocated to asynchronous traffic can be controlled. We present a throughput analysis of the protocol under heavy load with constant channel occupation of synchronous traffic and constant token-passing times. ------------------------------------------------------------------------------- CSL-TR-87-332 ANALYSIS OF CACHE PERFORMANCE FOR OPERATING SYSTEMS AND MULTIPROGRAMMING Anant Agarwal May 1987 170 pages.....$11.00 Advances in high-performance processors continue to create an increased need for memory bandwidth. Caches can provide this bandwidth cost-effectively. However, minimizing the performance loss due to caching requires that our analysis and prediction of cache performance become more exact. Although previous studies have shown that operating system and multiprogramming activities affect the cache performance, those studies did not deal with these issues in detail, largely because of the unavailability of efficient analysis techniques and the difficulty in collecting data for this analysis. To obtain the higher hit rates needed to sustain the effectiveness of caches, we must address these issues completely. This dissertation investigates the performance of large caches for realistic operating system and multiprogramming workloads. A suite of efficient and accurate cache analysis techniques is developed. These include: a new data collection method, a mathematical cache model, and a trace sampling and a trace stitching procedure. The analyses use a data collection technique called ATUM to obtain realistic system traces of multitasking workloads with little distortion. Aaccurately characterizing cache behavior using ATUM traces shows that both operating system and multiprogramming activity significantly degrade cache performance, with an even greater proportional impact on large caches. From a careful analysis of the causes of this degradation, we explore various techniques to reduce this loss. While seemingly little can be done to mitigate the effect of system references, multitasking cache misses can be reduced with little effort. We also demonstrate how analytical cache modeling, and trace sampling -- with a new approach to cold-start and warm-start analysis -- can be used to make large cache studies insightful and efficient. ------------------------------------------------------------------------------- Publications COMPUTER SYSTEMS LABORATORY Stanford University Stanford, CA 94305 415-723-1430 (Hours: M-Th, 8-1) ORDER FORM To Order Reports: Print or type your name and address in the space provided. Check or circle the report(s) you wish to purchase, whether hardcopy or microfiche. All orders must be PREPAID. CALIFORNIA RESIDENTS must add sales tax of their local county. Return this order form with your check or money order made payable to Stanford University. FOREIGN ORDERS must be paid with an international money order or a check drawn on a U.S. bank, payable in dollars. Please type or print your name and complete address: ______________________________________________ ______________________________________________ ______________________________________________ ______________________________________________ ______________________________________________ ______________________________________________ Report # Hardcopy Microfiche Report # Hardcopy Microfiche -------- -------- ---------- -------- -------- ---------- TR 323 $4.60 $2.00 TR 328 $12.30 $3.00 TR 324 $11.15 $3.00 TR 329 $14.50 $3.00 TR 325 $6.20 $2.00 TR 330 $4.80 $2.00 TR 326 $3.70 $2.00 TR 331 $4.65 $2.00 TR 327 $3.05 $2.00 TR 332 $11.00 $3.00 Subtotal __________ Your Local County CA tax __________ Total __________ -------