[comp.doc.techreports] stanford8

leff@smu.CSNET.UUCP (07/10/87)
                                Naomi Schulman
                                 Publications
                          COMPUTER SYSTEMS LABORATORY
                              STANFORD UNIVERSITY
                              Stanford, CA 94305

                            RECENT REPORTS & NOTES
                                   LIST #11
                                   JULY 1987

To order reports, see end of this announcement.

                                   ABSTRACTS


CSL-TR-87-323
Improving Garbage Collector Performance in Virtual Memory
Robert A. Shaw
March 1987                                                   42 pages.....$4.60
Garbage  collection  and  virtual  memory  have  long  been  in  an adversarial
relationship.  Implementers  of  virtual  memory  systems  have  cited  garbage
collectors   as  an  excellent  test  case  for  undermining  page  replacement
algorithms and users of Lisp systems have turned off  garbage  collection  (and
suffered  the  consequences)  rather  than  live  with  the slowdown of garbage
collector paging.
This paper describes a way in which the garbage collector  and  virtual  memory
system  can  work  together  to improve overall system performance.  By using a
simple layout for storage and information already maintained  by  most  virtual
memory  systems,  a  garbage  collector  can substantially reduce the amount of
effort necessary to reclaim a large majority  of  the  available  space.    The
techniques  presented require no special hardware and minimal disruption of the
runtime environment.  
-------------------------------------------------------------------------------
CSL-TR-87-324
LISP on a Reduced-Instruction-Set Processor: Characterization and
Optimization
by Peter Steenkiste
March 1987                                                  173 pages....$11.15
As a result of advances in compiler technology, almost all programs are written
in high-level languages, and the effectiveness of a  computer  architecture  is
determined  by  its  suitability  as  a  compiler  target.  The central role of
compilers in the use of computers has led  computer  architects  to  study  the
implementation   of   high-level  language  programs.    This  thesis  presents
measurements for a set of Portable Standard Lisp programs that were executed on
a reduced-instruction-set processor (MIPS-X),examining what  instructions  LISP
uses  at  the  assembly  level,  and  how much time is spent on the most common
primitive LISP operations.  This information makes  it  possible  to  determine
which  operations  are  time-critical  and  to  evaluate how well architectural
features address these operations.
Based  on  these  data,  three  areas  for  optimization  are  proposed:    the
implementation  of  the tags used for run-time type checking, reducing the cost
of procedure calls, and interprocedural  register  allocation.    A  number  of
methods  to  implement  tags,  both  with  and  without  hardware  support, are
presented, and the performance of the different  implementation  strategies  is
compared.    To  reduce  the cost of procedure calls, time-critical LISP system
functions were optimized and inlined, and procedure integration  under  control
of the compiler was implemented.  The effectiveness of these optimizations, and
their  effect  on  the  miss  rate  in the MIPS-X on-chip instruction cache are
studied.  A simple  register  allocator  uses  interprocedural  information  to
reduce the cost of saving and restoring registers across procedure calls.  This
register  allocation  scheme is evaluated, and its performance is compared with
hardware register windows.  
-------------------------------------------------------------------------------
CSL-TR-87-325
Spectral Lower-Bound Techniques for Logic Circuits
by Yigal Brandman
March 1987                                                   74 pages.....$6.20
This work shows a  relationship  between  the  implementation  of  any  Boolean
function  f  as  a  decision  tree  or  as  a  two-level  logic circuit and the
power-spectrum coefficients of its n-dimensional Fourier transform.   Based  on
this relationship, a universal lower-bound method is introduced for the size of
any  decision  tree  that  computes  f,  the average number of decisions in any
decision tree that computes f, the length of any path in any decision tree that
computes f, the number of AND gates in any two level AND/OR logic circuit  that
computes  f,  and  the  number of OR gates in any two-evel OR/AND logic circuit
that computes f. 
The bounding techniques are  also  applicable  to  the  following  distributed-
communication  problem.   Assume that n inputs are distributed among n persons,
each person knows only the value of one input bit.  On the average, what is the
minimum number of bits that must be exchanged among the individuals to  compute
f?
The bounding method is applied to several important functions.  The bounds vary
from constant to exponential and are tight in many cases.  
-------------------------------------------------------------------------------

CSL-TR-87-326
SRT Division Diagrams and their Usage in Designing Custom Integrated
Circuits for Division
Ted E. Williams and Mark Horowitz
April 1987                                                   24 pages.....$3.70
This  paper  describes  the construction and analysis of several diagrams which
depict SRT  division  algorithms.    These  diagrams  yield  insight  into  the
operation  of the algorithms and the many implementation tradeoffs available in
custom circuit design.  Examples of simple low radix  diagrams  are  shown,  as
well  as  tables  for  higher  radices.  The tables were generated by a program
which can create and verify the diagrams for different division schemes.   Also
discussed  is  a  custom  CMOS  integrated  circuit designed which performs SRT
division  using  self-timed  circuit  techniques.    This  chip  implements  an
intermediate approach between a fully combinational array and a fully iterative
in time method in order to get both speed and small silicon area.  
-------------------------------------------------------------------------------

CSL-TR-87-327
Object Communication in Allegro
Mark A. Linton
March 1987                                                   11 pages.....$3.05
Large  scale  object-oriented  systems  must  be  able  to  span machines while
providing efficient and transparent access  to  small  objects.    To  build  a
distributed  programming  environment,  we  are  using the concept of an object
space that provides remote access to a group of objects.    Object  spaces  are
independent  servers that unify the traditional concepts of commands and files,
thereby simplifying the problems of data management  and  concurrency  control.
Objects  communicate  with  remote  objects  synchronously  or  asynchronously,
multiplexing messages through an underlying connection between object spaces.
We are implementing a prototype system, named Allegro, in which program objects
are distributed across multiple object spaces, and spaces  can  be  distributed
across  a  network  of machines.  In this paper, we describe the Allegro object
model, the protocol we use  for  accessing  remote  objects,  and  the  runtime
support  necessary  for  building servers.  We have implemented an object space
for a set of primitive user interface objects that demonstrates  the  viability
of our approach.  
-------------------------------------------------------------------------------

CSL-TR-87-328
PARTITIONING AND SCHEDULING PARALLEL PROGRAMS FOR EXECUTION
ON MULTIPROCESSORS
by Vivek Sarkar
April 1987                                                 196 pages.....$12.30
There  are  three  fundamental  problems  to  be  solved  in the execution of a
parallel program on a multiprocessor --  identifying  the  parallelism  in  the
program,  partitioning  the  program  into  tasks  and  scheduling the tasks on
processors.  Whereas the problem of identifying parallelism  is  a  programming
language issue, the partitioning and scheduling problems are intimately related
to  parameters  of the target multiprocessor, like the number of processors and
synchronisation (1) and communication  overhead.    It  is  desirable  for  the
partitioning  and  scheduling  to  be performed automatically, so that the same
parallel program can execute efficiently on different  multiprocessors.    This
dissertation   presents  two  solutions  to  the  partitioning  and  scheduling
problems.  The first approach is based on a  macro-dataflow  model,  where  the
program  is  partitioned into tasks at compile-time and the tasks are scheduled
on processors at run-time.  The second approach  is  based  on  a  compile-time
scheduling  model,  where the partitioning of the program and the scheduling of
tasks on processors are both performed at compile-time.
Both approaches have been implemented to  partition  programs  written  in  the
single-assignment  language,  SISAL.    The  inputs  to  the  partitioning  and
scheduling algorithms are a graphical representation of the program and a  list
of   parameters  describing  the  target  multiprocessor.    Execution  profile
information is used to derive compile-time estimates  of  execution  times  and
data sizes in the program.  Both the macro-dataflow and compile-time scheduling
problems  are  expressed  as  optimisation  problems,  which  are  proved to be
NP-complete  in  the  strong  sense.    This  dissertation  presents  efficient
approximation  algorithms  for  these  problems.    The  effectiveness  of  the
partitioning and scheduling algorithms is studied by multiprocessor simulations
of  various  SISAL  benchmark  programs  for  different  target  multiprocessor
parameters.  
-------------------------------------------------------------------------------

CSL-TR-87-330
PERFORMANCE-ORIENTED SYNTHESIS OF LARGE SCALE DOMINO CMOS
CIRCUITS
Giovanni De Micheli
May 1987                                                     46 pages.....$4.80
The  quality  of the design of large scale integrated circuits is determined by
some  figures  of  merit,  such  as  silicon  area,   power   consumption   and
switching-time performance.  We address here the problem of automatic synthesis
of  digital  circuits  with the goal of achieving high performance designs.  We
assume we are given an intermediate circuit representation that optimizes  area
and/or  power.    We  use timing optimization techniques to improve the circuit
performance, possibly at the expense of the other figures of merit.
We consider general classes of digital circuits, with a  given  partition  into
registers,  combinational blocks and I/O ports.  Circuit performance is related
to the worst-case propagation delay of signals between two register boundaries.
In this context, circuit performance optimization is equivalent  to  minimizing
the  critical  path  delay  through  the  combinational  circuits.  We assume a
multiple-level implementation of  the  combinational  logic,  by  means  of  an
interconnection   of   logic   gates   implementing   arbitrary  multiple-input
single-output logic functions.  We consider dynamic CMOS implementation of  the
logic gates, operating in the domino mode.
We present a global approach to timing performance optimization, which involves
operations  at  the logic, topological and physical level of abstraction of the
circuit.  In particular, at the logic level, we look for optimal structures  of
multiple-level combinational networks.  At the topological level, we search for
the  optimal  positions  of  gates  or  groups of gates. At the physical design
level, we optimize MOS device sizes.
The algorithms are described as well as their implementation and the  interface
to the Yorktown Silicon Compiler system, which is an automated synthesis system
described in [BRAY87].  The results of applying timing-performance optimization
to a 32-bit microprocessor design are reported.  
-------------------------------------------------------------------------------

CSL-TR-87-331
THROUGHPUT ANALYSIS OF THE IEEE 802.4 TOKEN BUS STANDARD UNDER HEAVY LOAD
Joseph Pang and Fouad Tobagi
May 1987                                                     43 pages.....$4.65
It  has  become  clear  in  the  last  few  years that there is a trend towards
integrated digital services.  Parallel to the development of public  Integrated
Services  Digital Network (ISDN) is service integration in the local area (e.g.
a campus, a building, an aircraft).  The types of  services  to  be  integrated
depend very much on the specific local environment.  However, applications tend
to  generate  data  traffic belonging to one of two classes.  According to IEEE
802.4 terminology, the first major class of traffic is termed synchronous, such
as packetized voice and data generated from other applications  with  real-time
constraints,  and  the  second class is called asynchronous which includes most
computer data traffic such as file transfer or facsimile.
In this report, we examine the IEEE 802.4 token bus  protocol  which  has  been
designed to support both synchronous and asynchronous traffic.  The protocol is
basically  a timer-controlled token bus access scheme.  By a suitable choice of
the design parameters, it can  be  shown  that  access  delay  is  bounded  for
synchronous  traffic.  As well, the bandwidth allocated to asynchronous traffic
can be controlled.  We present a throughput  analysis  of  the  protocol  under
heavy load with constant channel occupation of synchronous traffic and constant
token-passing times.  
-------------------------------------------------------------------------------

CSL-TR-87-332
ANALYSIS OF CACHE PERFORMANCE FOR OPERATING SYSTEMS AND
       MULTIPROGRAMMING
Anant Agarwal
May 1987                                                   170 pages.....$11.00
Advances  in  high-performance  processors continue to create an increased need
for memory bandwidth.  Caches  can  provide  this  bandwidth  cost-effectively.
However,  minimizing  the  performance  loss  due  to caching requires that our
analysis and prediction of cache  performance  become  more  exact.    Although
previous   studies  have  shown  that  operating  system  and  multiprogramming
activities affect the cache performance, those studies did not deal with  these
issues  in  detail, largely because of the unavailability of efficient analysis
techniques and the difficulty in collecting data for this analysis.  To  obtain
the  higher  hit  rates  needed to sustain the effectiveness of caches, we must
address these issues completely.
This dissertation investigates the performance of large  caches  for  realistic
operating  system  and  multiprogramming  workloads.   A suite of efficient and
accurate cache analysis techniques is developed.  These include:   a  new  data
collection method, a mathematical cache model, and a trace sampling and a trace
stitching  procedure.  The analyses use a data collection technique called ATUM
to obtain  realistic  system  traces  of  multitasking  workloads  with  little
distortion.
Aaccurately  characterizing  cache  behavior  using ATUM traces shows that both
operating system and  multiprogramming  activity  significantly  degrade  cache
performance,  with an even greater proportional impact on large caches.  From a
careful analysis  of  the  causes  of  this  degradation,  we  explore  various
techniques to reduce this loss.  While seemingly little can be done to mitigate
the  effect of system references, multitasking cache misses can be reduced with
little effort.  We also demonstrate how analytical cache  modeling,  and  trace
sampling -- with a new approach to cold-start and warm-start analysis -- can be
used to make large cache studies insightful and efficient.  
-------------------------------------------------------------------------------

                                 Publications

                          COMPUTER SYSTEMS LABORATORY

                              Stanford University

                              Stanford, CA 94305

                                 415-723-1430

                              (Hours: M-Th, 8-1)

                                  ORDER FORM

To Order Reports: Print or type your name and address in the space provided.
Check  or  circle  the  report(s)  you  wish  to  purchase, whether hardcopy or
microfiche.  All orders must be PREPAID.  CALIFORNIA RESIDENTS must  add  sales
tax  of  their  local  county.  Return this order form with your check or money
order made payable to Stanford University. FOREIGN ORDERS must be paid with  an
international money order or a check drawn on a U.S. bank, payable in dollars.
Please type or print your name and complete address:
______________________________________________
______________________________________________
______________________________________________
______________________________________________
______________________________________________
______________________________________________
Report #  Hardcopy  Microfiche     Report #  Hardcopy  Microfiche
--------  --------  ----------     --------  --------  ----------
TR 323      $4.60     $2.00        TR 328     $12.30     $3.00
TR 324     $11.15     $3.00        TR 329     $14.50     $3.00
TR 325      $6.20     $2.00        TR 330      $4.80     $2.00
TR 326      $3.70     $2.00        TR 331      $4.65     $2.00
TR 327      $3.05     $2.00        TR 332     $11.00     $3.00


                          Subtotal __________
          Your Local County CA tax __________
                             Total __________
-------