wilson@uicbert.eecs.uic.edu (12/27/88)
I am looking for general views of the future of Lisp systems, especially parallel systems. A major concern is the effect of Lisp's locality problems on the performance of multiprocessors, and I am wondering about the compatibility of "big" systems (with a single many-megabyte heap) with multiprocessors. What worries me is the continuing trend for processor speeds to increase faster than memory speeds. For now, people mostly build faster and faster uniprocessors (or not-many-processors) with bigger and bigger main and cache memories. Generational garbage collection helps keep things in RAM. With big enough caches, I guess we can keep the youngest generation in cache, but the locality problem will probably just keep replicating itself at a new level. And when do we have to make some painful tradeoffs between clustering and scattering data, to optimize the tradeoff between locality and contention in multiprocessors? Bob Courts' adaptive incremental collector seems to give another nice improvement in locality, halving RAM requirements. (I have a couple of ideas of my own, one of which I'm pretty sure will also give a reasonable boost.) But are these techniques going to be enough? Some questions: 1. What is the irreducible locality cost of garbage collection? (i.e., if you gc a generation too often, you copy too much data without giving it time to die, but if you gc it less often, you have too much garbage interspersed with live data.) 2. What does the heap of a big Lisp system (i.e., 3600, Explorer) mostly hold? Would locality be much better if there were heavyweight process "firewalls" separating the data and code of different utilities (like the editor, browser, compiler)? I've heard Kyoto Common Lisp forks a separate process for compiling, and usually manages to finish a compile and just throw away the process without gc'ing it. Does that really work well, and does a generational gc remove the need for it? 3. Does anybody have any data on the effect of different scavenging algorithms on cache-block-scale locality and/or contention? On page-scale contention in distributed Lisp systems? (Do people do distributed Lisp systems?) 4. Just how serious is the locality issue, really, both for now and in the future? (How much time does your system lose to paging, and what happens when we have big multiprocessors?) I'd appreciate any views and/or references on these issues, especially references to overviews of relevant topics. Thanks prematurely, Paul Paul R. Wilson Human-Computer Interaction Laboratory U. of Illin. at C. EECS Dept. (M/C 154) wilson%uicbert@uxc.cso.uiuc.edu Box 4348 Chicago,IL 60680
jeff@aipna.ed.ac.uk (Jeff Dalton) (01/05/89)
In article <63200004@uicbert.eecs.uic.edu> wilson@uicbert.eecs.uic.edu writes: > > I've heard Kyoto Common Lisp forks a separate process for > compiling, and usually manages to finish a compile and just > throw away the process without gc'ing it. I don't know about Ibuki Common Lisp, but the standard KCl does not fork a Lisp process for compiling (though it does fork a C compilation). In my experience, KCl compilations spend most of their time in the C compiler, and garbage collection isn't very noticeable during the Lisp phases. However, for very large images the GCs might take longer. One more thing. The KCl compiler is often run as a separate process by the user, using the shell command "lc". It is also used in makefiles. And then, of course, the compilations are separate processes.
wilson@uicbert.eecs.uic.edu (01/06/89)
Thanks to those who replied to the posting on locality and lisp and multiprocessors. I got more questions than answers, though, so I'll post some clarifications of what I'm talking about. This is long, but in skippable chunks about different topics. ----------------------------------------------------------------------- About the adaptive incremental gc to improve locality: Bob Courts' adaptive incremental gc is the one used on the TI Explorer, and it was described in the September 1988 CACM ("Improving Locality of Reference in a Garbage-Collecting Memory Management System," _CACM_ 31,9, pp 1128-1138.) This garbage collector is similar to Moon's Ephemeral Garbage Collector (for background, you might want to look at his paper in the 1984 Lisp and Functional Programming conference proceedings.) Both of these gc's are incremental in the basic manner of Baker's algorithm. (After a generation has been flipped, the program can resume while objects are still being transitively traversed and copied from fromspace to tospace. This could cause trouble if a program reached the old version of an object that had already been transported by a pointer to it that had not yet been updated. To avoid that, the program is never allowed to see a pointer into fromspace -- if the program grabs such a pointer from memory, it is transparently "fixed" before the program is allowed to use it. That is, the object in fromspace is examined; if it has not been moved to tospace yet, it is moved; either way, the program is handed a pointer to the new version of the object (in tospace) and allowed to resume.) The nice thing about such an system is that the running program causes fromspace objects that it references to be transported to tospace *in the order that it first touches them*. This reorganizes the objects with very good locality. The standard Baker algorithm doesn't derive this benefit because of another aspect of its design. All of these systems must have a "background" scavenger that traverses *all* of the live data that are not reached by the running program, to ensure that they too are copied to tospace. (You have to do this because the program may not touch such an object during an incremental gc, but do so afterwards.) The problem with the Baker algorithm is that it interleaves the data copied by the background scavenger with data that were copied by the running program. This interleaves inactive data with active data, degrading locality. The Explorer GC copies data reached by the running program into a different area than data reached first by the background scavenger. There are several other refinements, but that's the basic idea. Frequently accessed data are likely to be encountered first by the running program and are copied in order of access to one area of memory. Other data are likely to be encountered first by the scavenger, and copied to another. Unfortunately, fine-grained incremental collection causes considerable overhead if you don't have some special-purpose hardware (as Lisp machines do) to check for fromspace pointers in parallel. On the other hand, the RAM costs appear to be greatly reduced by Courts' technique, so a little bit of processor complexity may save hundreds or thousands of dollars in RAM costs. ----------------------------------------------------------------------- About the irreducible cost of garbage collection: Lisp code often uses heap allocation for things that would be stack- allocated in other languages. In Pascal, for instance, stack-allocated data are reclaimed immediately on procedure exit rather than at some indefinite point in the future (i.e., scavenge time). With a generational garbage collector, you can approximate this arbitrarily closely by garbage collecting your youngest generation more often. This becomes prohibitive,however, if taken too far. You end up copying other data too soon, without giving them time to die instead. The data that in Pascal would be stack allocated force more frequent scavenges, which means you copy the "true" heap data more often than necessary. ("True" heap data being data that would be heap allocated in Pascal as well.) To some degree, static analysis by a compiler can be used to allocate these objects on the stack, but the problem generally replicates itself at larger time scales. (The analysis is likely to help cache-scale locality but not the pagewise locality in older, larger generations.) One problem with most copying gc's is that they're very good at reclaiming the space used by short-lived data generated by applicative- style programming, but they're not particularly smart about data whose lifetimes are closely tied to procedure calls and returns. (I talk about this, and some things to do about it, in a working paper in the last regular issue of SIGPLAN Notices. The basic idea was put forth in Hewitt and Lieberman's original 1983 CACM paper on generational garbage collection; my paper is largely about efficient implementation tricks.) A related but more general problem is that most gc's lose efficiency by having awkward policies for advancing data from one generation to the next. All of the collectors I've mentioned so far advance all copied data to the next generation at every scavenge. This means that brand- new data are advanced along with data that were allocated just before the previous scavenge. Judging by the empirical data I've seen, *most* of these data die comparatively soon thereafter, so that most of your memory is holding garbage much of the time. This degrades locality considerably, I suspect. Ungar's Generation Scavenging system has more precise control over advancement because it associates a count with every object. The count reflects the number of times an object has survived a scavenge in a generation, and the object is advanced to the next generation when its count exceeds a threshold. This avoids most premature advancement, but other aspects of Ungar's design are troublesome. In the original system, at least, there were only two generations. This is part of Ungar's strategy of avoiding the need for an incremental collector, which is expensive on stock hardware. Only the new generation is scavenged during a user session, and it is hoped that the old generation will not fill up during a user session; it is only gc'd when the user is not using the system (e.g., at night). The new generation is kept small enough that collecting it does not take very long, and the user will not generally notice it much. This strategy appears to have two problems: 1) some programs generate a lot of data that survive long enough to be advanced; this fills up old memory and forces a time-consuming gc of a large generation during a user session, and 2) postponing any scavenging of this large, old generation degrades locality because it fills "intermediate lifetime data" that become garbage soon on the timescale of the old generation. Ungar and Jackson discuss this in their 1988 OOPSLA paper. The system I describe in the above-mentioned SIGPLAN Notices paper also addresses this issue, from a different angle. My approach is to go ahead and have more generations, but to schedule scavenges cleverly so that a) the user is unlikely to notice a pause, and b) the pauses are generally shorter because the system is more efficient. (The Tektronix implementation of Smalltalk (described in Caudill & Wirfs-Brock's 1986 OOPSLA paper) uses several generations with a guarantee of age, but doesn't attempt to schedule scavenges cleverly.) A drawback to Ungar's scheme (and Tektronix') is the storage and manipulation of per-object counts. In a forthcoming paper I discuss a different technique of guaranteeing age precision, which is to use a "bucket brigade" of spaces to segregate objects *within* a generation by approximate age. I argue that in a system with several generations, a very simple system with two buckets is not only efficient and easy to implement, it's what you probably want anyway. (This paper will be in SIGPLAN Notices in a while, but I can send copies to anybody who's interested.) I'm currently implementing my design for Scheme-48, and it will serve as a testbed for the ideas here and several others. ---------------------------------------------------------------------- About problems with parallel Lisps: I don't actually know how parallel Lisp systems are different from other parallel systems, but I expect the issues are many and subtle. For example, suppose you have adapted a Courts-style adaptive collector in a fairly straightforward way to run on a shared-memory multiprocessor. The objects that will *always* migrate into the most-active region of the oldest generation are those that are shared and frequently accessed by all processes. (No matter what process is running during an incremental scavenge, it will access and transport those data to the most active region of the next generation.) But maybe these are exactly the data you *don't* want all lumped together, at least at the resolution of cache blocks, because all of the processors will be trying to access them frequently. If two such data-objects-of-contention live in the same cache block, the contention goes up unnecessarily. If you have a lot of preemptive multitasking, you can also just confuse both your cache block and page-replacement algorithms. If you do a lot of thread switching with Courts' collector, you'll get a lot of "phase changes" that aren't likely to be repeated -- sequences of objects accessed by one processor interleaved with sequences accessed by another. The next time the process executes, it's likely to be preempted at a different point in its execution by a different process, so unrelated data have gotten interleaved. If one of the processes is migrated to another machine, the contention could be pretty bad (I guess). Heavyweight processes avoid this because the data don't even exist in the same address space. (Not that I think it's necessarily worth the costs.) But I don't really know much about this, or how people deal with such issues (which is why I asked for references and views on these things). One important question is whether a garbage collector lumps unrelated things together, so that even if processors are working on unrelated things, they contend for data access or swap in useless stuff. ------------------------------------------------------------------ So that's what I'm thinking about. What do others think? -- Paul Paul R. Wilson Electronic Mind Control Laboratory* off. ph. (312) 413-0042 U. of Illin. at C. EECS Dept. (M/C 154) wilson%uicbert@uxc.cso.uiuc.edu Box 4348 Chicago,IL 60680 (* a.k.a. Human Computer Interaction Lab)