[comp.unix.cray] Malloc

refson@castle.ed.ac.uk (Keith Refson) (08/24/90)

Does anybody have or know of a FAST replacement for malloc() which
runs on Crays under UNICOS?  I have a program in which memory
allocation appears to take an inordinate proportion of the execution
time, and need to speed it up.   I would appreciate any advice or
hints from anybody with experience in memory allocation under UNICOS.

Thanks in advance.
----------------------------------------------------------------------------
|	Keith   Refson			  |ROYAL MAIL:                     |
|-----------------------------------------|   Department of Earth Sciences |
| JANET   : keith@uk.ac.ox.earth          |   Parks Road                   |
| INTERNET: keith@earth.ox.ac.uk          |   Oxford OX1 3PR               |
| BITNET  : keith%uk.ac.ox.earth@ukacrl   |   UK                           |
| UUCP    : K.Refson@ed.uucp              |PHONE   : +44 865 272026/272016 |
|         : keith%uk.ac.ox.earth@ukc.uucp |FAX     : +44 865 272072        |
----------------------------------------------------------------------------

ttw@lanl.gov (Tony Warnock) (08/28/90)

     Generally, system actions such as MALLOC() cannot be speeded up.
     These ususally require at least a trip through the operating
     system. If the current location of the program in memory does not
     have enough room for the program to expand, the job must be swapped
     to disk and then put back into a larger slot.

     The only good way to remove time from MALLOC() is not to call it
     as often. You will have to re-write your algorithm not to do
     so much memory management.

refson@castle.ed.ac.uk (Keith Refson) (08/29/90)

I would like to thank the many people who responded with helpful
suggestions to my query about speeding up memory allocation.

Unfortunately since I did not give details of the my program they did
not exactly address the problem.  However with some of the advice I
received I have made considerable progress which I thought it might be
useful to pass on.  

My program is a molecular dynamics simulation code - a repetitive
procedure which runs for several tens of thousands of iterations each
of which may take seconds of CPU and each of which has an identical
call sequence.  It uses dynamic memory as run-time dimensioned arrays
in a stack-like manner; that is memory is allocated on function entry
and freed upon exit.  The upshot of this is that the 100000th
iteration should use no more memory than the first and the heap does
not need to grow as the run proceeds.  Furthermore it is allocating a
few big chunks rather than many small ones.  In my test case run of 10
iterations which took 1.3s of CPU calloc() was called only 1154 times
and yet was using about 10% of the CPU.

Now comes the curious part.  To see what was actually going on I
replaced the calloc() call with malloc() and memset() (in my interface
function which handles all memory allocation).  The time spent in
memory allocation (malloc()+memset()) decreased by a factor of 3
compared with calloc()!   Now surely calloc() should be entirely
equivalent to calling malloc()+memset().  Is there some gross
inefficiency in Cray's implementation of calloc() ?  I am using C
version 5.0, but I don't know the library version.

Working with the malloc()+memset() version I then tried out some of
the suggestions to see if I could get a further improvement.  By far
the most useful was that of Steve Larson of CRI which was to use the
loader directives "HEAP=50000+50000;STACK=10000+10000" to set the heap
and stack's initial size and increment to values more appropriate to
my code.  This not only reduced the amount of time spent in malloc()
by a further factor of three but reduced the stack management overhead
(I presume - times of subroutines $STKOFEN $STKUFEX $STKCR%) from 6%
of total execution time to nothing.

I still don't understand though why a code which performs so few
malloc() calls should be affected by this.  Using the procstat and
procrpt commands (thank you again Steve) showed that with the default
heap size there were 18 calls to the memory processor irrespective of
run length (10 or 100 iterations).  Nevertheless the memory processor
used 0.85s for the hundred step run compared with 0.007s for the 10
step run (out of 10s and 1s respectively).  Setting the heap and stack
as above reduced the number of calls to 1 and the time spent to
effectively zero.  Doeas anybody out there know?, Steve?

Kent Koeniger suggested that doing one enormous malloc() call at the
beginning and immediately freeing it would help by reducing the
system-call overhead.   I think that the loader directives achieve the
same effect more elegantly.

Others (Rod Meyer, Ethan Miller, Ted Stockwell) gave suggestions which
amounted to re-implementing malloc(), tailored to my application.
Some (please forgive me if I slander you unduly) appeared to think
that malloc() was an expensive system call to be avoided at all costs,
rather than a library function.  I thought that malloc() was an
interface to the real system call sbrk() designed precisely to avoid
excessive system overhead.  Surely an efficient malloc() should be
part of every C library.

To summarize.  There is a curious inefficiency in calloc().  Use
malloc() and memset() instead.  Set the heap and stack initial sizes
and increments to suitable values.  I imagine there are cases where
there could be a much bigger gain than mine.  

Thanks to all those who responded.
----------------------------------------------------------------------------
|	Keith   Refson			  |ROYAL MAIL:                     |
|-----------------------------------------|   Department of Earth Sciences |
| JANET   : keith@uk.ac.ox.earth          |   Parks Road                   |
| INTERNET: keith@earth.ox.ac.uk          |   Oxford OX1 3PR               |
| BITNET  : keith%uk.ac.ox.earth@ukacrl   |   UK                           |
| UUCP    : K.Refson@ed.uucp              |PHONE   : +44 865 272026/272016 |
|         : keith%uk.ac.ox.earth@ukc.uucp |FAX     : +44 865 272072        |
----------------------------------------------------------------------------

mpp@uf.msc.umn.edu (Mike Pritchard) (08/30/90)

refson@castle.ed.ac.uk (Keith Refson) writes:

>Now comes the curious part.  To see what was actually going on I
>replaced the calloc() call with malloc() and memset() (in my interface
>function which handles all memory allocation).  The time spent in
>memory allocation (malloc()+memset()) decreased by a factor of 3
>compared with calloc()!   Now surely calloc() should be entirely
>equivalent to calling malloc()+memset().  Is there some gross
>inefficiency in Cray's implementation of calloc() ?  I am using C
>version 5.0, but I don't know the library version.

This is a problem with the X-MP calloc().  It clears the memory
out in a big for loop instead of using memset().  Calloc() is implemented
as a malloc() and memset() call on the Cray-2.  I just filed a problem
report on this, so Cray should fix this in the next release.

-Mike
--
Mike Pritchard
Internet: mpp@msc.umn.edu  
"Go that way.  Really fast.  If something gets in your way -- turn."

fouts@bozeman.bozeman.ingr.UUCP (Martin Fouts) (09/10/90)

>>>>> On 28 Aug 90 16:51:52 GMT, ttw@lanl.gov (Tony Warnock) said:

Tony>      Generally, system actions such as MALLOC() cannot be speeded up.
Tony>      These ususally require at least a trip through the operating
Tony>      system. If the current location of the program in memory does not
Tony>      have enough room for the program to expand, the job must be swapped
Tony>      to disk and then put back into a larger slot.

Malloc is not strictly a system operation, and doesn't always require
a trip through the operating system.  The underlying system call is
brk/sbrk, which adds memory to (subtracts memory from) the end of the
data section of the program's address space.  malloc/free/realloc
manage that memory once it has been allocated.

Malloc manages the pool of memory which has been allocated by the os.
When more memory is requested, malloc tries to find it from a pool
already in your address space.  If it can't find the free memory
within the address space it makes the system call to acquire more.
When free is called the memory is reallocated to the free pool.  The
system call to shrink the size of your application is not usually called.

Tony>      The only good way to remove time from MALLOC() is not to call it
Tony>      as often. You will have to re-write your algorithm not to do
Tony>      so much memory management.

There are two other ways.  One is to call it with larger chunks of
memory in requests, often requiring memory management modifications to
algorithms.  The other is to replace it with a variation which uses
structure imbedded in your algorithm to produce a more efficient
memory manager.

If the original programmer is only calling malloc and never calling
free or realloc, it should be fairly easy to find a faster
implementation.  (As a quick hack, the program might first malloc and
then free a huge chunk of memory -- depending on which mallocator is
currently running on your system, this will greatly reduce the number
of system calls made, but might increase the amount of time spent
swapping.)

--
Martin Fouts

 UUCP:  ...!pyramid!garth!fouts (or) uunet!ingr!apd!fouts
 ARPA:  apd!fouts@ingr.com
PHONE:  (415) 852-2310            FAX:  (415) 856-9224
 MAIL:  2400 Geng Road, Palo Alto, CA, 94303

Moving to Montana;  Goin' to be a Dental Floss Tycoon.
  -  Frank Zappa

elm@sprite.berkeley.edu (ethan miller) (09/11/90)

In article <729@garth.UUCP> fouts@bozeman.bozeman.ingr.UUCP (Martin Fouts) writes:
%>>>>> On 28 Aug 90 16:51:52 GMT, ttw@lanl.gov (Tony Warnock) said:
%
%Tony>      Generally, system actions such as MALLOC() cannot be speeded up.
%Tony>      These ususally require at least a trip through the operating
%Tony>      system. If the current location of the program in memory does not
%Tony>      have enough room for the program to expand, the job must be swapped
%Tony>      to disk and then put back into a larger slot.
%
%Malloc is not strictly a system operation, and doesn't always require
%a trip through the operating system.  The underlying system call is
%brk/sbrk, which adds memory to (subtracts memory from) the end of the
%data section of the program's address space.  malloc/free/realloc
%manage that memory once it has been allocated.

On the Y-MP, at least, there are *always* two traps to the operating
system.  The malloc() call must use semaphores to guard the allocation
routine, and manipulating them requires interrupts to be off so they
can be set and cleared atomically.  Because the interrupt mask can't
be changed at user level, there are two (undoubtedly very short)
traps to the system (it may be four--two to set and two to clear, but
I don't remember, though I can check).

This is in addition to any calls to brk() or sbrk().

ethan
=================================
ethan miller--cs grad student   elm@sprite.berkeley.edu
#include <std/disclaimer.h>     {...}!ucbvax!sprite!elm
Witty signature line condemned due to major quake damage.