[comp.object] [generational] Garbage collection -- difficulty, cost, generality

wilson@carcoar.Stanford.EDU (Paul Wilson) (03/16/90)

In article <2356@syma.sussex.ac.uk> aarons@syma.sussex.ac.uk (Aaron Sloman) writes:
> I (Paul Wilson) write:
>
>> ...................Don't expect a fast language on stock
>> hardware to have gc speed costs of less than about 10% on average.
>> I think in the 8-15 percent range myself.
>    ..........
>
>I thought 8-15 percept was rather high, so I thought I would do some
>experiments in Pop-11 (a lisp like language with a Pascal-like
>syntax) in the Poplog system on a Sun4/260 (stock hardware I think).
>
> [ ...description of programs run, short resulting gc times...]

Two points:  I probably didn't make it clear in my posting that I was
referring to *generational* gc's.  They have a higher base overhead,
but perform better when faced with large, data-hungry applications.
(They also tend to have short pause times that are less likely to
intefere with interactive response.)  Also, I'm unclear about the 
system you're running -- I don't know much about Pop.

The programs you're testing appear not to stress a gc much.  This is
likely because they're short and they don't ever have much live
data.  Running a small program several times is NOT at all like running a
big program once.  Such experiments can be *very* misleadingly optimistic.
(That's one of the reasons why I'm trying to collect scalable (Scheme)
programs for benchmarking gc's on different problem sizes.)

Think about it this way.  After every execution of a program, all of its 
intermediate data objects, and probably its results, are garbage.  They
will incur _no_ copying cost. If you execute small programs repeatedly
between gc's, you generally only copy live data for those few iterations 
that a gc happens to occur during.  Objects from other iterations are
all dead.

Running programs that do a lot of number consing is particularly bad,
because those objects are typically short-lived and will flatter
your gc efficiency.  (Unless you store pointers to the heap-allocated
numbers into objects in an older generation.  Then your performance
can get bad fast.)

Garbage collection times correspond roughly to the average amount of
data that is live during execution.  If you run a benchmark a bunch
of times between scavenges, the raw memory usage scales up, but the 
copying cost does not.  This creates an illusion that the gc is working 
very well.  In real long-running programs, the amount of live data
*is* typically larger, sometimes dramatically larger, and the copying 
cost goes up significantly.

A big hunk of the cost of a generational gc is incurred _during_
_program_execution_, not just at scavenges.  (Stores into the heap have
to be checked and/or recorded, to keep track of intergenerational 
pointers.)  So if you just record pause times, you miss this component 
of gc cost entirely.  For little programs with little live data, this is 
likely to be the major cost of generational garbage collection.  (I
estimate a minimum of about 3%, but probably more like 6%, and maybe more,
for fast optimized code.)

  Again, you can get _very_optimistic_ estimates of gc performance if 
you ignore this cost.  The right way to test is this to recompile the 
code, leaving out the checking/marking code emitted for generational 
gc, and see how much faster your programs run without that continual 
overhead.

(But maybe Pop-11 doesn't have a generational gc.  If not, I guess
it doesn't have a lot of heap-resident system stuff like big editors,
browsers and compilers;  that would increase the pause times.)

   -- Paul

Paul R. Wilson                         
Software Systems Laboratory               lab ph.: (312) 996-9216
U. of Illin. at C. EECS Dept. (M/C 154)   wilson@bert.eecs.uic.edu
Box 4348   Chicago,IL 60680

gza@mentor.cc.purdue.edu (William R Burdick) (03/20/90)

Dave Ungar's paper in OOPSLA 89 (first paper in the book, I forget the
name, but it starts with "Tenuring") is about a scheme to tenure
objects based on continual feedback from the garbage collector.  He
uses only 2 generations (young and old) and claims an *average* of 3%
overhead for interactive programs in Smalltalk-80 (which generates
garbage much faster than LISP -- I've heard estimates of 10 times that
of LISP).

For details about how his scheme works, you can also look at his paper
in ACM Software Engineering Notes, may 1984 (page 157-167).  Or if
enough people want it, I'll post a summary of the techniques.

      	
--
	-- Bill Burdick
	burdick@cello.ecn.purdue.edu

aarons@syma.sussex.ac.uk (Aaron Sloman) (03/20/90)

wilson@carcoar.Stanford.EDU (Paul Wilson) writes:

> Date: 15 Mar 90 23:13:31 GMT
>
> In article <2356@syma.sussex.ac.uk> aarons@syma.sussex.ac.uk (Aaron Sloman)
> writes:
> > I (Paul Wilson) write:
> >
> >> ...................Don't expect a fast language on stock
> >> hardware to have gc speed costs of less than about 10% on average.
> >> I think in the 8-15 percent range myself.
> >    ..........

> >(Sloman:)
> >I thought 8-15 percept was rather high, so I thought I would do some
> >experiments in Pop-11 (a lisp like language with a Pascal-like
> >syntax) in the Poplog system on a Sun4/260 (stock hardware I think).
> >
> > [ ...description of programs run, short resulting gc times...]
>
> (Wilson:)
> Two points:  I probably didn't make it clear in my posting that I was
> referring to *generational* gc's.

I did not realize that was your intention. Poplog does not at present
have a generational gc. It has a stop-and-copy gc for use when there's
lots of memory/swap space available and a shuffling Morris gc for use
when there isn't, or paging is too slow. It also allows the heap to be
"locked": e.g. if you have compiled a very large suite of programs, lock
the heap after compilation, and then only a small amount of store will
need to be copied on each GC, though all still has to be scanned. (This,
I believe, produces the advantages of a generational GC, except that it
isn't automated.)

Poplog's heap can have "holes" in it for items allocated by external
programs linked in (e.g. C, Fortran). It can also store non-relocatable
Poplog data in such holes, e.g. for use in connection with external
programs.

> ........   Also, I'm unclear about the
> system you're running -- I don't know much about Pop.
>
> The programs you're testing appear not to stress a gc much.

It's hard to stress the Poplog GC as it is very fast. So only
frequent GC's cause problems, and, as I said, you can do things
to reduce frequency, like expanding heap space.

> ... This is
> likely because they're short and they don't ever have much live
> data.

One reason I chose compilation for part of the test is that the amount
of "live" data steadily increases, at least with Poplog's incremental
compiler, which allocates procedures in heap space, along with
everything else. During compilation of the Prolog subsystem of Poplog,
GC time was about 3 per cent. When compiling the Common Lisp extension,
which increases the used heap area by a megabyte, the GC time was about
3.7 percent. Both percentages can be decreased by allocating more heap
space, and/or by locking the heap after each file is compiled.

> Running a small program several times is NOT at all like running a
> big program once.  Such experiments can be *very* misleadingly optimistic.

I agree, but part of my point was also that for a given program you can
vary the percentage of GC time by setting a small or large limit for the
heap size. By allocating a VERY large heap (system memory and swap space
permitting) I can reduce the frequency of garbage collections and
thereby substantially reduce the percentage of time spent on gc.

So statistics are more or less useless without full information about
such variables.

> Running programs that do a lot of number consing is particularly bad,
> because those objects are typically short-lived and will flatter
> your gc efficiency.  (Unless you store pointers to the heap-allocated
> numbers into objects in an older generation.  Then your performance
> can get bad fast.)
Alternatively you have very little spare heap space and therefore
have to reclaim space FREQUENTLY, in which case the performance will
also get bad.

> ....  In real long-running programs, the amount of live data
> *is* typically larger, sometimes dramatically larger, and the copying
> cost goes up significantly.

I guess what's real, or typical, is a matter of style, which is why
benchmarking is so difficult. I do a lot of real work in the poplog
editor - reading and sending mail, editing, developing programs,
examining online documentation, browsing through news bulletins, running
a Pop-11  based troff previewer, etc. etc. and many of the files I
regularly read into the editor for browsing and editing are a few
hundred kilobytes or more in size. The editor uses a vector of strings
for a file, and while you work on a line, the corresponding string will
be discarded every time it needs to grow, and every time you move off a
string with wasted space to edit another line. So garbage is being
created all the time. Yet with a big heap you can work for ages without
invoking a GC and if you don't actually trace GC's you won't notice them
because they are so fast (except on an overloaded shared machine).

> (But maybe Pop-11 doesn't have a generational gc.  If not, I guess
> it doesn't have a lot of heap-resident system stuff like big editors,
> browsers and compilers;  that would increase the pause times.)

No, it doesn't have a generational gc, and most of the core system
stuff, like the editor, compilers, utilities, etc are not in the
heap. However, there are large libraries of utilities, which get
added to the heap if used. (On VMS, SunOS, and Dynix, saved images
can be created in which user code and date are in a shared, non
garbage collectable, segment.)

Anyhow, it appears, from what you say, that generational GCs have a
serious performance penalty, and that it may be preferable to have a
very fast, infrequently invoked, alternative mechanism.

Apologies for letting this grow too long. It's all that heap size
available....

Aaron Sloman,
School of Cognitive and Computing Sciences,
Univ of Sussex, Brighton, BN1 9QH, England
    EMAIL   aarons@cogs.sussex.ac.uk
            aarons%uk.ac.sussex.cogs@nsfnet-relay.ac.uk

wilson@carcoar.Stanford.EDU (Paul Wilson) (03/21/90)

In article <2372@syma.sussex.ac.uk> aarons@syma.sussex.ac.uk (Aaron Sloman) writes:
>wilson@carcoar.Stanford.EDU (Paul Wilson) writes:
>
>> Date: 15 Mar 90 23:13:31 GMT
>>
>> In article <2356@syma.sussex.ac.uk> aarons@syma.sussex.ac.uk (Aaron Sloman)
>> writes:
>> > I (Paul Wilson) write:
>> >
>> >> ...................Don't expect a fast language on stock
>> >> hardware to have gc speed costs of less than about 10% on average.
>> >> I think in the 8-15 percent range myself.
>> >    ..........
>
>> >(Sloman:)
>> >I thought 8-15 percept was rather high, so I thought I would do some
>> >experiments in Pop-11 (a lisp like language with a Pascal-like
>> >syntax) in the Poplog system on a Sun4/260 (stock hardware I think).
>> >
>> > [ ...description of programs run, short resulting gc times...]
>>
>> (Wilson:)
>> Two points:  I probably didn't make it clear in my posting that I was
>> referring to *generational* gc's.
>
>I did not realize that was your intention. Poplog does not at present
>have a generational gc. It has a stop-and-copy gc for use when there's
>lots of memory/swap space available and a shuffling Morris gc for use
>when there isn't, or paging is too slow. It also allows the heap to be
>"locked": e.g. if you have compiled a very large suite of programs, lock
>the heap after compilation, and then only a small amount of store will
>need to be copied on each GC, though all still has to be scanned. (This,
>I believe, produces the advantages of a generational GC, except that it
>isn't automated.)

This is a reasonable intermediate strategy, between a simple gc
and a generational one.  Since scanning is several times faster
than copying (as long as no paging is involved), it can be a win.
It avoids the generation checking and/or marking of a fully generational
gc, but still gives you part of the benefit.  For a large system
with large heap images (either system goop or application data),
though, it will not do well.  Scanning a many-megabyte image can
thrash the system at every gc.

>> ........   Also, I'm unclear about the
>> system you're running -- I don't know much about Pop.
>>
>> The programs you're testing appear not to stress a gc much.
>
>It's hard to stress the Poplog GC as it is very fast. So only
>frequent GC's cause problems, and, as I said, you can do things
>to reduce frequency, like expanding heap space.

But memory is not free.  In a high-performance native-code-compiled
system, your performance will generally drop off quickly at the
point where you're allocating and scavenging more virtual memory
than you can keep in physical memory.  This is becoming more severe
all the time, as processor speed gains continue to outstrip disk
speed gains.  Decreasing memory costs tend to counter this
trend, but it's unclear what the bottom line is.  (And the
various trends have different impacts on different users.)

>> ... This is
>> likely because they're short and they don't ever have much live
>> data.
>
>One reason I chose compilation for part of the test is that the amount
>of "live" data steadily increases, at least with Poplog's incremental
>compiler, which allocates procedures in heap space, along with
>everything else. During compilation of the Prolog subsystem of Poplog,
>GC time was about 3 per cent. When compiling the Common Lisp extension,
>which increases the used heap area by a megabyte, the GC time was about
>3.7 percent. Both percentages can be decreased by allocating more heap
>space, and/or by locking the heap after each file is compiled.

>> Running a small program several times is NOT at all like running a
>> big program once.  Such experiments can be *very* misleadingly optimistic.
>
>I agree, but part of my point was also that for a given program you can
>vary the percentage of GC time by setting a small or large limit for the
>heap size. By allocating a VERY large heap (system memory and swap space
>permitting) I can reduce the frequency of garbage collections and
>thereby substantially reduce the percentage of time spent on gc.

Until you outrun physical memory and start paging.  You have to trade
off increased copying for increased locality.  The interaction
between heap sizes and gc efficiency is important, but can't
be viewed in isolation.  And a simple compiler may not be a good
test for two reasons.  As I said previously, the average amount of
live intermediate data may be small, when compared to an optimizing
compiler.  But in your case, it would appear that the dominant
cost would be the repeated copying of the "results" (the already-
created procedure objects, etc.) while creating more results.  This
could get much worse, too.  An optimizing compiler may do several
times as much computation, allocating lots of space and forcing
several times as many gc's.  So the dominant cost may be increased
by several times.

>So statistics are more or less useless without full information about
>such variables.

Right.  You need to know when to bill the gc for whatever page faults
occur, rather than just measuring pause times.  There's no simple
answer, because it depends on how much memory you have, and how much
of that you had to buy precisely because of the characteristics of
the system you're measuring.

>> Running programs that do a lot of number consing is particularly bad,
>> because those objects are typically short-lived and will flatter
>> your gc efficiency.  (Unless you store pointers to the heap-allocated
>> numbers into objects in an older generation.  Then your performance
>> can get bad fast.)
>Alternatively you have very little spare heap space and therefore
>have to reclaim space FREQUENTLY, in which case the performance will
>also get bad.

Right.  A generational gc will tend to compensate somewhat for short-
lived numbers, as long as you don't store pointers to them into old
objects.  But a non-generational gc will simply copy (or scan) all
the data that much more often.  (Though serious number consing may
make the program enough slower than a well-compiled program that
it will flatter the gc significantly more. Less number consing
tends to go hand-in-hand with faster code, increasing the relative
importance of the *rest* of the gc's costs... yikes!)

>> ....  In real long-running programs, the amount of live data
>> *is* typically larger, sometimes dramatically larger, and the copying
>> cost goes up significantly.

>I guess what's real, or typical, is a matter of style, which is why
>benchmarking is so difficult. I do a lot of real work in the poplog
>editor - reading and sending mail, editing, developing programs,
>examining online documentation, browsing through news bulletins, running
>a Pop-11  based troff previewer, etc. etc. and many of the files I
>regularly read into the editor for browsing and editing are a few
>hundred kilobytes or more in size. The editor uses a vector of strings
>for a file, and while you work on a line, the corresponding string will
>be discarded every time it needs to grow, and every time you move off a
>string with wasted space to edit another line. So garbage is being
>created all the time. Yet with a big heap you can work for ages without
>invoking a GC and if you don't actually trace GC's you won't notice them
>because they are so fast (except on an overloaded shared machine).

A lot of the (mostly interactive) things you mention *do* account for
a lot of peoples' primary usage patterns.  But for most of these
things, raw general computation speed is likely not to be the limiting 
factor either -- the continual overhead of a generational gc probably 
won't be a big deal.  You're more likely to be concerned with the speed of
graphics & disk operations, etc.  And it will allow you a more 
integrated system, with utilities implemented in the language and 
resident in the heap.

>> (But maybe Pop-11 doesn't have a generational gc.  If not, I guess
>> it doesn't have a lot of heap-resident system stuff like big editors,
>> browsers and compilers;  that would increase the pause times.)
>
>No, it doesn't have a generational gc, and most of the core system
>stuff, like the editor, compilers, utilities, etc are not in the
>heap. However, there are large libraries of utilities, which get
>added to the heap if used. (On VMS, SunOS, and Dynix, saved images
>can be created in which user code and date are in a shared, non
>garbage collectable, segment.)
>
>Anyhow, it appears, from what you say, that generational GCs have a
>serious performance penalty, and that it may be preferable to have a
>very fast, infrequently invoked, alternative mechanism.

Right -- if you're willing to live with a less congenial development
environment in creating the system.  (You give up writing almost
everything in the same language, with all of your debugging tools
available.)  And for a lot of people, it's still not going to work
well.  If your applications have megabytes of live data, or if
they include large, heap-resident libraries,  you're going
to run into annoying gc pauses and overhead, or thrash.  My own
preference is for a generational gc, at least in a general
development environment;  its performance may have a higher
baseline overhead, but it's more robust in the face of big hairy
programs.  You trade a little processor overhead for a lot of
locality, and save significant money on memory.

For a system that generates stand-alone programs, though, you'd
like to have the option of specifying which gc strategy it should
use, and have three choices.  A simple gc will give the fastest
performance for programs with little data live at any given
time;  a generational gc will be better for programs with
lots of live data, and a fine-grained incremental gc will
be necessary for difficult real-time applications.  (The
performance will be degraded significantly, but response
times will be reliably predictable.)

    -- Paul

Paul R. Wilson                         
Software Systems Laboratory               lab ph.: (312) 996-9216
U. of Illin. at C. EECS Dept. (M/C 154)   wilson@bert.eecs.uic.edu
Box 4348   Chicago,IL 60680