[comp.lang.lisp] big lisp, locality, and multiproces

wilson@uicbert.eecs.uic.edu (12/27/88)

I am looking for general views of the future of Lisp systems, especially
parallel systems.  A major concern is the effect of Lisp's locality
problems on the performance of multiprocessors, and I am wondering
about the compatibility of "big" systems (with a single many-megabyte
heap) with multiprocessors.

What worries me is the continuing trend for processor speeds to increase
faster than memory speeds.  For now, people mostly build faster and
faster uniprocessors (or not-many-processors) with bigger and bigger
main and cache memories.

Generational garbage collection helps keep things in RAM.  With big
enough caches, I guess we can keep the youngest generation in cache,
but the locality problem will probably just keep replicating itself
at a new level.  And when do we have to make some painful tradeoffs
between clustering and scattering data, to optimize the tradeoff
between locality and contention in multiprocessors?

Bob Courts' adaptive incremental collector seems to give another nice
improvement in locality, halving RAM requirements.  (I have a couple
of ideas of my own, one of which I'm pretty sure will also give a
reasonable boost.)  But are these techniques going to be enough? 

Some questions:

  1.  What is the irreducible locality cost of garbage
      collection?  (i.e., if you gc a generation too
      often, you copy too much data without giving it time to die,
      but if you gc it less often, you have too much garbage
      interspersed with live data.)

  2.  What does the heap of a big Lisp system (i.e., 3600, Explorer)
      mostly hold?  Would locality be much better if there were
      heavyweight process "firewalls" separating the data and code
      of different utilities (like the editor, browser, compiler)?
      I've heard Kyoto Common Lisp forks a separate process for
      compiling, and usually manages to finish a compile and just
      throw away the process without gc'ing it.  Does that really
      work well, and does a generational gc remove the need for it?

  3.  Does anybody have any data on the effect of different scavenging
      algorithms on cache-block-scale locality and/or contention?  On
      page-scale contention in distributed Lisp systems?  (Do people
      do distributed Lisp systems?)

  4.  Just how serious is the locality issue, really, both for now
      and in the future?  (How much time does your system lose to
      paging, and what happens when we have big multiprocessors?) 


I'd appreciate any views and/or references on these issues, especially
references to overviews of relevant topics.

Thanks prematurely,

Paul


Paul R. Wilson                         
Human-Computer Interaction Laboratory
U. of Illin. at C. EECS Dept. (M/C 154)   wilson%uicbert@uxc.cso.uiuc.edu
Box 4348   Chicago,IL 60680 

jeff@aipna.ed.ac.uk (Jeff Dalton) (01/05/89)

In article <63200004@uicbert.eecs.uic.edu> wilson@uicbert.eecs.uic.edu writes:
>
>      I've heard Kyoto Common Lisp forks a separate process for
>      compiling, and usually manages to finish a compile and just
>      throw away the process without gc'ing it.

I don't know about Ibuki Common Lisp, but the standard KCl does not
fork a Lisp process for compiling (though it does fork a C compilation).

In my experience, KCl compilations spend most of their time in the C
compiler, and garbage collection isn't very noticeable during the Lisp
phases.  However, for very large images the GCs might take longer.

One more thing.  The KCl compiler is often run as a separate process
by the user, using the shell command "lc".  It is also used in makefiles.
And then, of course, the compilations are separate processes.

wilson@uicbert.eecs.uic.edu (01/06/89)

 Thanks to those who replied to the posting on locality and lisp and
multiprocessors.  I got more questions than answers, though, so I'll
post some clarifications of what I'm talking about.  This is long,
but in skippable chunks about different topics.

-----------------------------------------------------------------------
About the adaptive incremental gc to improve locality:

   Bob Courts' adaptive incremental gc is the one used on the TI Explorer,
and it was described in the September 1988 CACM ("Improving Locality of
Reference in a Garbage-Collecting Memory Management System," _CACM_ 31,9,
pp 1128-1138.)

   This garbage collector is similar to Moon's Ephemeral Garbage
Collector (for background, you might want to look at his paper in the 1984
Lisp and Functional Programming conference proceedings.)

   Both of these gc's are incremental in the basic manner of Baker's 
algorithm.  (After a generation has been flipped, the program can resume while
objects are still being transitively traversed and copied from fromspace to
tospace.  This could cause trouble if a program reached the old version
of an object that had already been transported by a pointer to it that
had not yet been updated.  To avoid that, the program is never allowed
to see a pointer into fromspace -- if the program grabs such a pointer
from memory, it is transparently "fixed" before the program is allowed
to use it.  That is, the object in fromspace is examined; if it has
not been moved to tospace yet, it is moved; either way, the program is
handed a pointer to the new version of the object (in tospace) and allowed
to resume.)

   The nice thing about such an system is that the running program
causes fromspace objects that it references to be transported to tospace
*in the order that it first touches them*.  This reorganizes the objects
with very good locality.  The standard Baker algorithm doesn't derive
this benefit because of another aspect of its design.  All of these
systems must have a "background" scavenger that traverses *all* of the
live data that are not reached by the running program, to ensure that
they too are copied to tospace.  (You have to do this because the program
may not touch such an object during an incremental gc, but do so afterwards.)

   The problem with the Baker algorithm is that it interleaves the data
copied by the background scavenger with data that were copied by the
running program.  This interleaves inactive data with active data, degrading
locality.  The Explorer GC copies data reached by the running program
into a different area than data reached first by the background scavenger.
There are several other refinements, but that's the basic idea.  Frequently
accessed data are likely to be encountered first by the running program
and are copied in order of access to one area of memory.  Other data are
likely to be encountered first by the scavenger, and copied to another.

   Unfortunately, fine-grained incremental collection causes considerable
overhead if you don't have some special-purpose hardware (as Lisp machines
do) to check for fromspace pointers in parallel.  On the other hand, the
RAM costs appear to be greatly reduced by Courts' technique, so a little
bit of processor complexity may save hundreds or thousands of dollars in
RAM costs.

-----------------------------------------------------------------------
About the irreducible cost of garbage collection:

Lisp code often uses heap allocation for things that would be stack-
allocated in other languages.  In Pascal, for instance, stack-allocated
data are reclaimed immediately on procedure exit rather than at some
indefinite point in the future (i.e., scavenge time).  With a generational
garbage collector, you can approximate this arbitrarily closely by
garbage collecting your youngest generation more often.

   This becomes prohibitive,however, if taken too far. You end up copying 
other data too soon, without giving them time to die instead.  The data
that in Pascal would be stack allocated force more frequent
scavenges, which means you copy the "true" heap data more often
than necessary.  ("True" heap data being data that would be heap allocated
in Pascal as well.)  To some degree, static analysis by a compiler can
be used to allocate these objects on the stack, but the problem generally 
replicates itself at larger time scales.  (The analysis is likely to
help cache-scale locality but not the pagewise locality in older, larger
generations.)

    One problem with most copying gc's is that they're very good at
reclaiming the space used by short-lived data generated by applicative-
style programming, but they're not particularly smart about data whose
lifetimes are closely tied to procedure calls and returns.  (I talk
about this, and some things to do about it, in a working paper in the
last regular issue of SIGPLAN Notices.  The basic idea was put forth
in Hewitt and Lieberman's original 1983 CACM paper on generational garbage
collection;  my paper is largely about efficient implementation tricks.)

   A related but more general problem is that most gc's lose efficiency
by having awkward policies for advancing data from one generation to the
next.  All of the collectors I've mentioned so far advance all copied
data to the next generation at every scavenge.  This means that brand-
new data are advanced along with data that were allocated just before
the previous scavenge.  Judging by the empirical data I've seen, *most*
of these data die comparatively soon thereafter, so that most of your
memory is holding garbage much of the time.  This degrades locality
considerably, I suspect.

   Ungar's Generation Scavenging system has more precise control over
advancement because it associates a count with every object.  The
count reflects the number of times an object has survived a scavenge
in a generation, and the object is advanced to the next generation
when its count exceeds a threshold.  This avoids most premature
advancement, but other aspects of Ungar's design are troublesome.  In
the original system, at least, there were only two generations.  This
is part of Ungar's strategy of avoiding the need for an incremental
collector, which is expensive on stock hardware.  Only the new generation
is scavenged during a user session, and it is hoped that the old generation
will not fill up during a user session; it is only gc'd when the user
is not using the system (e.g., at night).  The new generation is kept
small enough that collecting it does not take very long, and the user
will not generally notice it much.

This strategy appears to have two problems:

  1) some programs generate a lot of data that survive long enough to
be advanced; this fills up old memory and forces a time-consuming
gc of a large generation during a user session, and
  2) postponing any scavenging of this large, old generation degrades
locality because it fills "intermediate lifetime data" that become
garbage soon on the timescale of the old generation.

Ungar and Jackson discuss this in their 1988 OOPSLA paper.  The system
I describe in the above-mentioned SIGPLAN Notices paper also addresses
this issue, from a different angle.  My approach is to go ahead and
have more generations, but to schedule scavenges cleverly so that a) the
user is unlikely to notice a pause, and b) the pauses are generally shorter
because the system is more efficient.

(The Tektronix implementation of Smalltalk (described in Caudill & Wirfs-Brock's
1986 OOPSLA paper) uses several generations with a guarantee of age, but doesn't
attempt to schedule scavenges cleverly.)

A drawback to Ungar's scheme (and Tektronix') is the storage and manipulation
of per-object counts.  In a forthcoming paper I discuss a different technique
of guaranteeing age precision, which is to use a "bucket brigade" of
spaces to segregate objects *within* a generation by approximate age.  I
argue that in a system with several generations, a very simple system with
two buckets is not only efficient and easy to implement, it's what you
probably want anyway.  (This paper will be in SIGPLAN Notices in a while,
but I can send copies to anybody who's interested.)  I'm currently
implementing my design for Scheme-48, and it will serve as a testbed
for the ideas here and several others.

----------------------------------------------------------------------
About problems with parallel Lisps:

I don't actually know how parallel Lisp systems are different from other
parallel systems, but I expect the issues are many and subtle.  For example,
suppose you have adapted a Courts-style adaptive collector in a fairly
straightforward way to run on a shared-memory multiprocessor.  The objects
that will *always* migrate into the most-active region of the oldest
generation are those that are shared and frequently accessed by all
processes.  (No matter what process is running during an incremental
scavenge, it will access and transport those data to the most active 
region of the next generation.)  But maybe these are exactly the data
you *don't* want all lumped together, at least at the resolution of
cache blocks, because all of the processors will be trying to access
them frequently.  If two such data-objects-of-contention live in the
same cache block, the contention goes up unnecessarily.

If you have a lot of preemptive multitasking, you can also just confuse
both your cache block and page-replacement algorithms.  If you do a lot
of thread switching with Courts' collector, you'll get a lot of "phase
changes" that aren't likely to be repeated -- sequences of objects
accessed by one processor interleaved with sequences accessed by another.
The next time the process executes, it's likely to be preempted at
a different point in its execution by a different process, so unrelated
data have gotten interleaved.  If one of the processes is migrated
to another machine, the contention could be pretty bad (I guess).
Heavyweight processes avoid this because the data don't even exist in the
same address space.  (Not that I think it's necessarily worth the costs.)

But I don't really know much about this, or how people deal with such
issues (which is why I asked for references and views on these things).
One important question is whether a garbage collector lumps unrelated
things together, so that even if processors are working on unrelated
things, they contend for data access or swap in useless stuff.

------------------------------------------------------------------

So that's what I'm thinking about.  What do others think?

  -- Paul

Paul R. Wilson                         
Electronic Mind Control Laboratory*       off. ph. (312) 413-0042
U. of Illin. at C. EECS Dept. (M/C 154)   wilson%uicbert@uxc.cso.uiuc.edu
Box 4348   Chicago,IL 60680      (* a.k.a. Human Computer Interaction Lab)