[comp.lang.perl] Reducing data space of a perl program

don@zip.eecs.umich.edu (Don Winsor) (01/13/90)

We have been working with a large and complicated perl program that
uses lots of memory (the resident set is up to 8 megabytes).  To try
to optimize it, we would like to find out which data structures are
using the most space.  Has anyone done this?  Does anyone know of a
good way to determine how much storage a perl array or associative
array is consuming?  The ideas that come to my mind are to either
write some kind of perl utilities to traverse arrays and find their
total size, or to compile perl with -g and use dbx to dig through
it's memory.  Any advice or suggestions would be appreciated.

Don Winsor
Department of Electrical Engineering and Computer Science
University of Michigan, Ann Arbor
don@dip.eecs.umich.edu

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (01/13/90)

In article <1220@zip.eecs.umich.edu> don@zip.eecs.umich.edu (Don Winsor) writes:
: We have been working with a large and complicated perl program that
: uses lots of memory (the resident set is up to 8 megabytes).  To try
: to optimize it, we would like to find out which data structures are
: using the most space.  Has anyone done this?  Does anyone know of a
: good way to determine how much storage a perl array or associative
: array is consuming?  The ideas that come to my mind are to either
: write some kind of perl utilities to traverse arrays and find their
: total size, or to compile perl with -g and use dbx to dig through
: it's memory.  Any advice or suggestions would be appreciated.

Which way you want to do it depends on how accurate you want to be.

    The overhead for a normal array @foo (assuming non-sparse, unshrunk):
	sizeof(STAB)+sizeof(STBP)+sizeof(ARRAY)+
	    elems * sizeof(STR*) * Aslop * Mslop +
	    elems * sizeof(STR) * Mslop +

		$#foo
	       sum    maxlen($foo[$i]) * Mslop
		$i = 0

	  where
	    elems is the largest number of elements the array has held,
	    maxlen($foo[$i]) is either the length of the longest string value
		ever stored there plus a few bytes, with a minimum of 24 if
		the element was assigned a floating point value and was
		subsequently referenced as a string,
	    Aslop is the amount wasted at the end of the array of pointers
		to STR, in the range of (1.00 .. 1.20) (unless perl detects
		that you are shifting off the front and pushing onto the
		end, in which case Aslop gets pushed up to about 10.0!!!),
	    Mslop is the amount wasted at the end of all malloced strings,
		in the range of (1.00 .. 2.00) on BSD systems.

From within the language it's possible (knowing how the elements have been
used) to estimate maxlen for each string, but not possible to determine
it precisely.  It's not possible to determine the precise value of Aslop
unless you know the history of which subscripts were created when.  It's
possible to estimate individual Mslop's if you know your Aslop of the array
and the maxlen's of the individual strings, since it's a matter of rounding
up to the next power of 2 (on BSD).  Statistically, your average Mslop will
be about 1.5, but it would be easy to construct pathological cases.

The situation for associative arrays is similar.  The differences are that
it's more difficult to tell how many strings are actually in use, and there's
an extra structure for each element to hold the key pointer and hash value,
and the extra string for the key itself.

    The overhead for an associative array %bar:
	sizeof(STAB)+sizeof(STBP)+sizeof(HASH)+
	    tblsize * sizeof(STR*) * Mslop +
	    elems * sizeof(HENT) * Mslop +
	    elems * sizeof(STR) * Mslop +

	       sum    maxlen($bar{$key}) * Mslop
		 (over the keys of the array)
		+
	       sum    maxlen($key) * Mslop
		 (over the keys of the array)

	  where
	    tblsize is the size of the hash table,
	    elems is the number of elements in the array,
	    and the other fields are as before.

The tblsize can be determined by saying
    ($buckets_used, $tblsize) = split(/\//, %bar);

elems will be a little larger than $buckets_used, because most buckets will
only contain one element, but to get an exact figure you'd have to say
    @foo = values(%bar);
    $elems = $#foo + 1;

The Mslop for the hash table size should be fairly constant since the
hash tables are always a power of two in size.  If you have a memory
allocator that allocates in 2^n-k chunks, where k is greater than 0,
you could waste 2^n-2k space for each table of size 2^n.

As you can see, determining memory usage from within a script is an
exercise in estimation.

As for using dbx, I wouldn't want to try it.  Just locating the symbol
table entry for an array is a non-trivial operation.  One tires of typing...
Now, if you ran it under Sabre-C, it'd be easier to follow all those
pointers around.  (Wish I had Saber-C, sigh...)

Your best bet might be to try to instrument the perl code.  There's already
instrumentation code in there when you say -DLEAKTEST.  Unfortunately,
it just counts the number of allocations, not their sizes.  But it would
be fairly easy to add in the sizes.  Then you could find out the size of
various things that were allocated in various places.  (Unfortunately,
it wouldn't give you a collective figure for the size of an array.)

But you could write a routine to go through and find that out.  If you
knew what your malloc headers looked like, you could probably even account
for Mslop.  One could do something silly like making the defined operator
return the size (used and allocated) of an object if, say -D64 was given.
One could even define a new operator for it, if it turned out to be
generally useful.

Well, that's enough for the moment.

Larry