[comp.unix.microport] optimized malloc for SV/AT: required for uEmacs and other big programs

hedrick@athos.rutgers.edu (Charles Hedrick) (02/21/88)
Over the next week or so, I'm going to post the results of my various
porting efforts.  The most significant problem I ran into was
performance problems with microEmacs.  In fact microEmacs will run, in
either small or large model, with little or no change.  But it was
incredibly slow in reading files, and sometimes claimed to be out of
memory when it wasn't.  This turned out to be due to problems in
malloc, not in microEmacs.  So the subject of this posting is a new
malloc.  There are two different problems with the one supplied with
the system:

The system version of malloc is very slow.  When you try to read a
file that is even moderately large, you end up doing a huge amount of
swapping.  I am not exactly sure why this is.  It obviously has to do
with the details of the way Microport (or ATT?) has chosen to
implement memory allocation.  On paging systems, memory is allocated
in fixed-size pages.  When you go to a new page, that page can be
allocated anywhere in memory.  At worst, you have to page out one page
from somebody else's core image, and typically not even that.  On
older systems, memory for a process had to be contiguous.  This meant
that when additional memory is to be allocated, sometimes a process
had to be swapped out in order to have free memory adjacent to your
existing memory.  It is not clear to me why the 286 should need to do
old-style allocation, but it certainly appears that it does.  Each
time you do sbrk to allocate memory, there is a good chance that this
will cause swapping activity.  In order to fix this, I have a
replacement version of sbrk.  The system sbrk goes to a new segment
every time you allocate a chunk of memory.  Mine attempts to allocate
chunks in the same segment, and goes to a new segment only when there
is no space left in the existing one.  When it goes to a new segment,
it immediately allocates all 64K in that segment.  When it leaves the
segment, it deallocates any memory in the segment that it didn't use.
This turns out to make an incredible difference to performance.  The
reason is that swapping can happen only when you go to a new segment,
rather than every time sbrk is called.  And in practice it doesn't
seem to happen very often even then.  My sbrk is intended to be
sufficiently general to be used as a complete replacement for sbrk.
However it has only been tested with malloc.

The second problem with the original malloc is that it seems to be
buggy.  It was common that after reading a file of about 50KB, when I
tried to read in another file, I would be told there is no more memory
available.  It was easy to see with ps that there was plenty of free
memory.  It is known that many systems have buggy malloc's.  It
appears from my experience that this is one of them.  I am supplying a
new malloc.  This one was taken from Unipress Emacs (with the
permission of Unipress).  It is based on an malloc done at Caltech.
Variants of this malloc are used in 4.3BSD and Gnu Emacs.  Unipress's
version has had an incredible amount of work done on portability.
It can run on just about any conceivable system.  The result is that
the file is a patchwork of conditionals, and requires 3 different
header files.  To save net bandwidth, and make the program easier to
manage, I have cut it down to exactly what is needed on SV/AT.  (Note
that this is not exactly the same as any combination of options that
Unipress supports.  I have made some minor changes myself.)

Note that my sbrk and malloc are intended only for use with large
model programs.  With the small model, the system versions should be
fine.  I am cc'ing Microport on this posting, in hopes that they
will make this available to customers on their bulletin board, and
consider including it in future releases.

#! /bin/sh
# This is a shell archive, meaning:
# 1. Remove everything above the #! /bin/sh line.
# 2. Save the resulting text in a file.
# 3. Execute the file with /bin/sh (not csh) to create the files:
#	makefile
#	malloc.c
#	sbrk.c
#	test.c
# This archive created: Sat Feb 20 17:01:19 1988
export PATH; PATH=/bin:$PATH
if test -f 'makefile'
then
	echo shar: will not over-write existing file "'makefile'"
else
cat << \SHAR_EOF > 'makefile'
libxmalloc.a:  malloc.o sbrk.o
	rm -f libxmalloc.a
	ar r libxmalloc.a malloc.o sbrk.o

malloc.o:  malloc.c
	cc -c -Ml malloc.c

sbrk.o:  sbrk.c
	cc -c -Ml sbrk.c

test: libxmalloc.a test.c
	cc -o test -Ml test.c libxmalloc.a
SHAR_EOF
fi # end of overwriting check
if test -f 'malloc.c'
then
	echo shar: will not over-write existing file "'malloc.c'"
else
cat << \SHAR_EOF > 'malloc.c'
/* 2malloc.c	nee @(#)nmalloc.c 1 (Caltech) 2/21/82
 *
 * This is a very fast storage allocator.  It allocates blocks of a small 
 * number of different sizes, and keeps free lists of each size.  Blocks
 * that don't exactly fit are passed up to the next larger size.  In this 
 * implementation, the available sizes are (2^n)-4 (or -16) bytes long.
 * This is designed for use in a program that uses vast quantities of
 * memory, but bombs when it runs out.  To make it a little better, it
 * warns the user when he starts to get near the end.
 *
 *   June 84, ACT: modified rcheck code to check the range given to malloc,
 *  		   rather than the range determined by the 2-power used.
 *   Nov 83, Mike@BRL, Added support for 4.1C/4.2 BSD.
 *   20 Jun 1983 ACT: strange hacks for Emacs
 *
 *   26-Jun-86 Ted Lemon @ New Media Graphics
 *		made Xenix/286 version use brkctl(), which allows
 *		allocation of exactly 32768 bytes.
 *   15-Feb-86	Mike Gallaher @ Unipress Software
 *		extensively reorganized, macroized system-dependent code,
 *		cleaned up morecore and malloc.
 *   15-Feb-88  C. Hedrick @ Rutgers - stripped down version specialized
 *		for Microport and with any code that might be Unipress-
 *		specific removed.
 */

#define NEWCALLOC   FALSE	    /* to use our own calloc/cfree */
#define NSLOTS	30	/* probably a bit excessive for a 16-bit machine */

/***********************************************************************\
* 								        *
* 		     		2malloc					*
* 								        *
\***********************************************************************/


/* remnants of options edited out for this Usenet posting */

#define msize	unsigned int
#define hidden	static
#define visible	
#define NULL 0L

/*******************   2malloc block definition  ********************/

/**********************  block size index definition  ***********************/

typedef char sizeindex;		/* A number identifying a class of blocks of
				 * a certain size.  Each block in class of
				 * index n has 2**(n+3) bytes.  Thus, index 0
				 * refers to 8-byte blocks, 1 has 16-byte
				 * blocks, etc.  NOTE: If you change this type,
				 * look at what you have done to union mhead!
				 */

/* PowerOfTwo performs the assignment  " result = 1 << shift ".
 */
#   define PowerOfTwo(result, shift)    result = 1 << (shift)

/* SizeFromIndex assigns to the given lvalue s the number of bytes
 * contained in a block whose index is x.
 *
 * IndexFromSize assigns to the given lvalue x the index of the block whose
 * size is s.  s must be a power of two.
 *
 * The 80286 allows blocks up to 64K.  But the appropriate power of
 * 2 would be 64K, which is in hex 0x10000.  This overflows a 16-bit
 * word.  We check for this case explicitly and use 0xffff.  It appears
 * that this is good enough.
 */
#define SizeFromIndex(s,x)      if (x == 13) s = 0xffff; else PowerOfTwo (s, (x) + 3)

#   define IndexFromSize(x,s)  if (1) { \
	register msize shiftr = (s - 1) >> 2; \
	x = 0;\
	while (shiftr >>= 1) x++; } else

/****************   2malloc block structure definition  ********************/

#define ISALLOC ((char) 0xf7)	/* magic byte that implies allocation */

/* Each block of memory allocated by a client is prefaced by a union
 * mhead, which contains overhead information used by the malloc package.
 * Each allocated block may be somewhat larger than the size that the client
 * asked for, since it is actually the smallest power-of-two-sized block that
 * will hold the requested size.  The mhead structure indicates the total
 * (power-of-two) size of the block.  For small blocks, it also indicates
 * the exact number of bytes actually requested by (and presumably in use by)
 * the client.  The mhead also contains some magic bytes whose presence is
 * used to verify that it is indeed a valid mhead, and that it has not been
 * corrupted by errant client code.
 *
 * Power-of-two blocks that have been acquired from the system but which
 * have not yet been assigned to a client, or have been freed, are kept in
 * a free list.  Each size category (power-of-two) has its own free list.
 * nextf is the array that holds the head of the free list for each category.
 * Malloc decides which power-of-two-sized block will fit the requested size,
 * and gets one out of the free list if it is not empty.  Otherwise, it invokes
 * morecore to get some memory from the system and place it in the free list.
 *
 * If a block has not been allocated to a client, then the mhead contains
 * only a pointer to the next block.  This assumes that the client won't stomp
 * on blocks that have not been explicitly allocated, which is not really
 * a good assumption -- usually when the mhead is corrupted, it is by code
 * that overshot some block that *was* allocated, which happened to lie before
 * the victimized block.  Note that the mhead of unallocated blocks does not
 * have any indication of how big the block is -- that is implied by the
 * free list in which it appears.
 *
 * If range checking is not turned on, the mhead has only a flag indicating
 * whether memory is allocated, an index in nextf[], and a size field; to
 * realloc() memory we copy either size bytes or SizeFromIndex(index) bytes
 * depending on whether the former can hold the exact size (given the value of
 * 'index').  If range checking is on, we always need to know how much space
 * is allocated, so the 'size' field is never used.  See the BlockSize macro.
 *
 * NOTE: If you change mhead, be careful to keep its size a multiple of 4!
 * We explicitly give type unsigned long to nbytes, even though we should
 * strictly use msize, since msize might be 16-bits on some machines (8086)
 * if LMALLOC is FALSE.  We also retain the size member even if RCHECK is on
 * even though it isn't used in that case, just to preserve the alignment.
 */
union mhead {
    struct {
	char            alloc;		/* ISALLOC => is allocated */
	sizeindex       index;		/* index in nextf[] */
	unsigned short  size;		/* size, if < 0x10000 */
    }               mh_m;
    union mhead    *mh_nextfree;
};

#define mh_alloc mh_m.alloc
#define mh_index mh_m.index
#define mh_size  mh_m.size

/* BlockSize assigns to the given lvalue sz the number of bytes actually
 * used in the block whose mhead is given.  For blocks less than 64k,
 * this will be the exact number of bytes allocated by the client.  For
 * larger blocks, the best we can do is the power-of-two size of the entire
 * block, which will generally be somewhat larger than the number of bytes
 * actually used by the client.  If RCHECK is on, then we can always tell
 * the exact number of allocated bytes.
 */

    /* If the size of the block fits in 16 bits (which we can tell roughly
     * from its index), we will find its exact size in mh_size.  If it is
     * bigger, then all we know is the size of the power-of-two-sized block 
     * that contains it; we can subtract the size of the overhead (mhead) in
     * that case, though, since it's not available for use by the client.
     */
#   define BlockSize(sz, mhp) 	if ((mhp)->mh_index >= 13) { \
				    SizeFromIndex (sz, (mhp)->mh_index); \
				    sz -= sizeof (union mhead); \
				} else sz = (mhp)->mh_size

/***********************  free list definition  ***************************/

/* nextf[i] points at the next free block of size 2^(i+3).  The smallest
 * allocatable block is 8 bytes.  The overhead information will go in the
 * first sizeof(union mhead) bytes of the block (4 if range-checking is off
 * on most machines).  The pointer returned by malloc will point to the byte
 * immediately after that.
 */

hidden union mhead *nextf[30];

/***********************************************************************\
* 								        *
* 			   malloc and its staff				*
* 								        *
\***********************************************************************/

/***************  system-dependent allocation macros  ********************/

/*  GetMem	allocates the amount of memory given by siz, assigning the
		address of the beginning of the block to cp.  If that
		amount cannot be allocated for some reason, executes the
		statement given by fail.
 */

    /* Use sbrk to get memory from the system.  sbrk normally tacks a
     * block of contiguous memory onto the end of the existing bss
     * (uninitialized data) segment.  This is not the case in Xenix
     * or Uport Unix.
     * Fortunately we don't rely on this property in this package...
     * GRIPE: sbrk should have been designed to return NULL on failure - argh!
     * Note by the way that this file should be loaded with our hacked-up
     * version of sbrk.
     */

    extern char *sbrk();
#   define GetMem(cp,siz,fail) \
	if ((long) (cp = sbrk (siz)) == -1L) fail

/* GETMEMZEROS is TRUE if the GetMem function zeros the newly allocated
 * memory.  For now we just assume that it doesn't.
 */
#define GETMEMZEROS FALSE

/****************************  morecore  *****************************/

/* morecore is what gets large chunks of memory from the operating system.
 * malloc and realloc then carve their blocks from that large chunk.  This
 * asks the system for a chunk of memory, into which it constructs one or
 * more blocks of the power-of-two size dictated by the given index (see
 * SizeFromIndex, above).  This sets the pointer in nextf to point at the
 * newly allocated set of blocks.  This will never allocate less than 2k:
 * if the given index specifies a block size less than 2k, it allocates 2k
 * and divides it up into as many blocks of the given size as will fit in 2k.
 * For blocks larger than 2k, allocates enough memory for one such block.
 */
/* We could avoid the overhead of asking the system for more memory
 * by grabbing a larger block out of its free list (assuming there
 * is such a block) and splitting it up into several blocks of the
 * desired size.  If there are often lots of unused large blocks lying
 * around, that scheme might win.
 */

hidden
morecore (nux)
register sizeindex    nux;		/* size index to get more of  */
{
    register char	*cp = (char *) NULL;
    register int	 nblks;
    register msize	 siz;

    /* Try to find a larger block that has already been acquired from the
     * system but has not yet been allocated to a client by looking at the
     * free list entries with higher indices.  (We assume that the list for
     * the index we're attempting to fill is empty, or we wouldn't have been
     * called by malloc.)  If we find one, steal it from the free list we found
     * it in, and leave cp pointing at it.  Set nblks to the number of blocks
     * of the desired size that it contains.  Below, we'll split it up and
     * place it in the free list for the desired index just as if we had gotten
     * it from the system.
     *
     * NOTE: We may not want to split up larger blocks in some cases, since
     * it might lead to lots of tiny blocks that won't get used.  Perhaps we
     * only want to split a larger block if it won't turn into more than
     * 8 or 16 smaller blocks (i.e., exit loop if ix < nux + 3).
     */
    {
	register sizeindex ix;

	for (ix = nux+1; ix < NSLOTS; ix++)
	    if (cp = (char *) nextf[ix]) {
		/* we've found a free block, steal it from its list */
		nextf[ix] = nextf[ix]->mh_nextfree;
		PowerOfTwo (nblks, ix - nux);
		break;
	    }
    }

    /* if we couldn't find larger block to steal, cp is NULL */

    if (cp == (char *) NULL) {

	/* Compute from the size index the additional number of bytes we want
	 * to get from the system.  Adjust it if necessary so that we get at
	 * least 2k, and set nblks to the number of blocks of the desired size
	 * that fit in that amount of memory.  Thus, for blocks larger than 2k,
	 * we allocate only enough for one, but for blocks smaller than 2k, we
	 * allocate enough to hold several.
	 */
	{
	    register sizeindex    szx = nux;

	    nblks = 1;
	    if (szx < 8) {
		szx = 8;
		PowerOfTwo (nblks,  8 - nux);
	    }

	    SizeFromIndex (siz, szx);
	    GetMem (cp, siz, return);
	}

	/* Make sure the block begins on an eight-byte boundary.  (It shouldn't
	 * happen otherwise, but just in case...)
	 */
	if ((int) cp & 7) {
	    cp = (char *) (((long) cp + 8) &~7);
	    nblks--;
	}
    }

    /* Now cp points to a raw chunk of memory, and nblks holds the number
     * of power-of-two blocks of the given index that will fit in that chunk.
     * Place the newly allocated block (or the first of the group) onto
     * the head of the free list (which must have already been empty or we
     * wouldn't have been called).  Then construct power-of-two block(s) out
     * of the raw allocated chunk by placing mhead structures at the beginning
     * of each block.  Since the blocks are initially free (unallocated to
     * clients), the mhead contains only the pointer to the next free block.
     */
    nextf[nux] = (union mhead *) cp;

    SizeFromIndex (siz, nux);		/* siz = number of bytes per block */
    while (--nblks > 0) {
	((union mhead *) cp)->mh_nextfree = (union mhead *) (cp + siz);
	cp += siz;
    }

#   if !GETMEMZEROS
    /* Zero the mh_nextfree pointer in the last block in the list.
     * If GetMem zeros the memory it allocates, this is already done for us.
     */
    ((union mhead *) cp)->mh_nextfree = (union mhead *) NULL;
#   endif
}

/******************************  malloc  ******************************/

visible char   *
malloc (n)			/* get a block */
msize     n;
{
    register union mhead   *p;
    register msize	    nbytes;	/* size in bytes of the power-of-two
					   sized block we need */
    register sizeindex	    szx;	/* size index (into nextf) of the size
					   of block we need */

    /* Figure out how many bytes are required, rounding up to the nearest
     * multiple of 4.  Then set szx to the index that corresponds to the
     * smallest power-of-two-sized block that will hold that number of
     * bytes.  This is the index into nextf of the area from which the
     * block is to be taken.
     */
    nbytes = (n + sizeof *p + 3) & ~3;

    IndexFromSize (szx, nbytes);	/* szx = index for nbytes */

    /* If there are no blocks of the appropriate size, go get some. */
    if (nextf[szx] == (union mhead *) NULL)
	morecore (szx);

    /* Get one block off the list, and set the new list head */
    if ((p = nextf[szx]) == (union mhead *) NULL)
	return (NULL);
    nextf[szx] = p->mh_nextfree;

    /* Fill in the info, and if range checking, set up the magic numbers */
    p->mh_alloc = ISALLOC;
    p->mh_index = szx;
    p->mh_size = n;

    /* return the address of the byte just past the union mhead. */
    return ((char *) (p + 1));
}

/*******************************  free  ******************************/

visible
free (mem)
char           *mem;
{
    register union mhead *p;

    {
	register char  *ap = mem;

	p = (union mhead *) ap - 1;
    }
    /* Link the now free block into the free list maintained for blocks in
     * its size category.
     */
    {
	register sizeindex    szx = p->mh_index;

	p->mh_nextfree = nextf[szx];
	nextf[szx] = p;

    }
}

/***************************  realloc  **********************************/

/* One potential optimization here would be to notice blocks that are not
   actually changing "real sizes".  --ACT */
visible char   *
realloc (mem, n)
char           *mem;
register msize  n;
{
    register union mhead *p;
    register msize tocopy;

    if ((p = (union mhead *) mem) == (union mhead *) NULL)
	return malloc (n);
    p--;

    BlockSize (tocopy, p);		/* set tocopy to our best guess at the
					   number of bytes used by client.
					   If RCHECK, this will be exact;
					   otherwise it will be somewhat larger
					   than the client actually uses. */
    if (n < tocopy)
	tocopy = n;
    {
	register char  *new;

	if ((new = malloc (n)) == (char *) NULL)
	    return (NULL);
	memcpy (new, mem, tocopy);
	free (mem);
	return (new);
    }
}

/***********************************************************************\
* 								        *
* 			      New calloc/cfree			        *
* 								        *
\***********************************************************************/


#if	NEWCALLOC
/* The standard calloc clears memory by zeroing some number of integers.
   However, it doesn't round up to integers before calling malloc!  The
   following code, while somewhat sloppy, works.  May not work for XENIX. */
char *
calloc (num, size)
msize num, size;
{
    register msize  n;
    register int   *p;
    char   *rv;

    n = num * size;
    n += sizeof (int) / sizeof (char) - 1;
    n /= sizeof (int) / sizeof (char);
    if ((rv = malloc (n * sizeof (int))) == (char *) NULL)
	return (NULL);
    p = (int *) rv;
    while (--n >= 0)
	*p++ = 0;
    return (rv);
}

cfree (p)
char *p;
{
    (void) free (p);
}
#endif	/* NEWCALLOC */

SHAR_EOF
fi # end of overwriting check
if test -f 'sbrk.c'
then
	echo shar: will not over-write existing file "'sbrk.c'"
else
cat << \SHAR_EOF > 'sbrk.c'
/*
 * This is a replacement for sbrk that attempts to assign the requested
 * memory in the same segment.
 */

unsigned long real_sbrk();

unsigned long sbrk_next = 1L;
unsigned long sbrk_limit = 0L;

unsigned long sbrk (size)
  unsigned int size;
{
unsigned long retval;

if (sbrk_next + size > sbrk_limit) {
    if (sbrk_limit)  /* except during initialization */
	brk(sbrk_next);   /* releases unused memory in this segment */
    sbrk_next = real_sbrk(0);  /* go to next segment */
    sbrk_limit = sbrk_next + 0xffff;  /* take that whole segment */
    if (brk(sbrk_limit) != 0)
	return(-1L);
}
retval = sbrk_next;
sbrk_next += size;

return(retval);
}    

asm("	.text");
asm("real_sbrk:");
asm("	push %bp");
asm("	mov %sp,%bp");
asm("	push 6(%bp)");
asm("	push $0x1");
asm("	push $0x11");
asm("	lcall 0x2c8,0");
asm("	jnb noerror");
asm("	jmp _cerror");
asm("noerror:");
asm("	leave");
asm("	ret");

asm("	.text");
asm("   .globl brk");
asm("brk:");
asm("	push %bp");
asm("	mov %sp,%bp");
asm("	push 8(%bp)");
asm("	push 6(%bp)");
asm("	push $0x0");
asm("	push $0x11");
asm("	lcall 0x2c8,0");
asm("	jnb noerror2");
asm("	jmp _cerror");
asm("noerror2:");
asm("	xor %ax,%ax");
asm("	leave");
asm("	ret");

SHAR_EOF
fi # end of overwriting check
if test -f 'test.c'
then
	echo shar: will not over-write existing file "'test.c'"
else
cat << \SHAR_EOF > 'test.c'
extern char *malloc();
extern unsigned long sbrk();

main () {
  char *x,*y;
  unsigned long l;

x = malloc(0xf000);
y = malloc(23);
printf("%lx %lx\n", (unsigned long)x, (unsigned long)y);
l = sbrk(0);
printf("%lx\n", l);
strcpy(x,"foo");
strcpy(x+0xef00,"foo");
x = malloc(0xf000);
y = malloc(23);
printf("%lx\n", l);
strcpy(y,"foo");
strcpy(y+18,"foo");
printf("%lx %lx\n", (unsigned long)x, (unsigned long)y);
}

SHAR_EOF
fi # end of overwriting check
#	End of shell archive
exit 0