[comp.unix.wizards] signal 10 in malloc call???

toma@killer.UUCP (Tom Armistead) (05/05/88)

I've been having an *interesting* thing happen under System 5, release 3.1
Unix, on a 3b2/310 and was wondering if I could get any insight from
any one as to the problem (or cause of)???

I am getting signal 10 (buss error) in the middle of a malloc call.
It doesn't happen under any regular set of circumstances as far as I can
tell. From sdb I can tell that that everything was set up ok, (but how can
you mess up on a malloc call?) One case in mind was:

	structptr = (struct st *)malloc( sizeof( struct st ) );

structptr was NULL before the call and was still after the crash.
The size of the struct is 230 bytes. We don't have kernel source so I am
not able to go into malloc except for dissasembly. The instruction
the thing dies on is a BITW (I think?) maybe something like:
	BITW	0(%r7),1

The system has 2meg of memory and at the time of the crash is
running 20 process and very heavily loaded. This has happend
in about half of those 20 processes at random times in random
places in the code (C of cource). All the processes use malloc,
realloc and free a WHOLE lot.

Call me crazy, but shouldn't malloc just return me an error if there 
are problems???

Any ideas??? This this is about to drive me to COBOL!!!

Any help would be GREATLY appreciated!

Thanks,
Tom
----
UUCP: ...!ihnp4!killer!toma

-- 
-------------
Tom Armistead
UUCP:  ...!ihnp4!killer!toma

gwyn@brl-smoke.ARPA (Doug Gwyn ) (05/05/88)

In article <3989@killer.UUCP> toma@killer.UUCP (Tom Armistead) writes:
>Call me crazy, but shouldn't malloc just return me an error if there 
>are problems???

Since I haven't finished writing /dev/telepathy, I can't remotely
debug your application, but in general problems like the one you
report result from a bug in the application code.  malloc()
maintains a linked list of storage (heap) blocks with "busy" bit
indicators attached to the block headers (in addition to the links).
If your application scribbles on one of these headers, or if it even
frees an already-free block, then a later (perhaps MUCH later)
invocation of malloc() can run amok.

Source code licensees can recompile libc/gen/malloc.c to enable
a slew of debugging checks that often detect such heap abuse early.
This really should be provided as a separate /usr/lib/*.[oa] for
binary customers to use, but it probably hasn't been.  One trick
you can try is to provide your own simple memory allocator called
MyAlloc()/MyFree() that performs stringent consistency checks but
uses malloc() to obtain an initial large chunk of heap space to be
subdivided and reallocated by your allocator.  Then recompile your
application with "-Dmalloc=MyAlloc -Dfree=MyFree" in CFLAGS in your
Makefile, link it with your debugging-allocator object, and see
what turns up.  Note that the C library will continue to use the
real malloc(), but presumably it knows what it's doing and will
use it correctly.  (The only way I think this checking could fail
is if the malloc arena corruption is due to abusing some other C
library routine.)  Good luck.

friedl@vsi.UUCP (Stephen J. Friedl) (05/06/88)

In article <3989@killer.UUCP>, toma@killer.UUCP (Tom Armistead) writes:
> I am getting signal 10 (buss error) in the middle of a malloc call.
> It doesn't happen under any regular set of circumstances as far as I can
> tell. From sdb I can tell that that everything was set up ok, (but how can
> you mess up on a malloc call?)

It is almost certainly a corruption of malloc's arena pointers by
a program bug.  Malloc keeps its blocks in a linked list, and the
word just before its return to you points to the *next* area:

		+---------+
		| pointer |--->-\
		+---------+     |
malloc return-->|         |     |
		|   Your  |     |
		|  memory |     |
		|  chunk  |     v
		|   here  |     |
		|         |     |
		+---------+     |
		|         |<----/
	
If these pointers get messed up (easy to do, just overwrite a
chunk or free() a random pointer), it becomes a core-dump party.
	
> The instruction
> the thing dies on is a BITW (I think?) maybe something like:
> 	BITW	0(%r7),1

The low bit of the "pointer" above indicates whether the block is
free or busy.  This instruction is almost certainly testing this
bit on a crazy, overwritten, invalid pointer.

>  All the processes use malloc, realloc and free a WHOLE lot.

Oh boy :-(.  The bummer here is that the failure happens long
after the corruption occurs, and these can be the most difficult
bugs to track down.  The best bet (on the 3B2, at least), is to
use the specialized malloc(3x) functions with the -lmalloc
library.  These are implemented differently and may help the bugs
show up in different ways.

If life gets really rough you can write a routine that will run
through the malloc chain looking for problems.  This will help track
down where a random memory write is trashing the malloc chains:

	checkmalloc();
	crazy_function();
	checkmalloc();

If the first passes and the second doesn't, you're getting closer.

Good luck.
-- 
Steve Friedl    V-Systems, Inc. (714) 545-6442    3B2-kind-of-guy
friedl@vsi.com    {backbones}!vsi.com!friedl   attmail!vsi!friedl

gandalf@csli.STANFORD.EDU (Juergen Wagner) (05/06/88)

The most likely source of strange effects like the ones you describe are
some strcpy/strcat/scanf/fgets/... which write beyond the end of some string
of chars. 

However, this does not necessarily have to be the cause of your problems.
Consider also the following:

o  functions using up more arguments than provided,
o  functions called with a variable number of args but popping them
   with the wrong size.
o  scanf/sscanf reads double into floats.
o  some buffer isn't large enough.

All this might not clobber the malloc area but the call stack, in which case
you may detect that much later, and in an unexpected manner. Some time ago,
somebody posted a malloc package with debugging aids, and I'll be glad to 
forward it to you.

-- 
Juergen "Gandalf" Wagner,		   gandalf@csli.stanford.edu
Center for the Study of Language and Information (CSLI), Stanford CA

lm@arizona.edu (Larry McVoy) (05/06/88)

In article <3989@killer.UUCP> toma@killer.UUCP (Tom Armistead) writes:
>I've been having an *interesting* thing happen under System 5, release 3.1
>Unix, on a 3b2/310 and was wondering if I could get any insight from
>any one as to the problem (or cause of)???

I dunno if this is it or not, but I have found out (the hard way) that malloc
keeps info in the memory it allocates (actually: is about to allocate).  The
bottom line is that if you overrun a malloced area you will cause crashes that
seem to stem from inside the malloc lib itself.  My problems occured on a
Vax running 4.3+NFS.
-- 
	"Peace and Unity - Neon Prophet, Tucson AZ"

Larry McVoy	lm@arizona.edu or ...!{uwvax,sun}!arizona.edu!lm

toma@killer.UUCP (Tom Armistead) (05/08/88)

I want to thank all of you for the responses...

Summary of my Quest...

I went through that stuff with a FINE tooth comb and didn't find
didley - (excuse the Texas accent...)

After much time on the phone with an AT&T techie, I was told that they
had experienced problems in malloc (core dumps and such) when it had
been used excessively with small blocks, as I was doing, and that this
was something to do with the free memory stuff getting garbled up...

He told me to use the special malloc(3X) library with
'-lmalloc' on the cc command line. I did this and the problem has gone
away!!!  "Thank you Antie Em, I'm not CRAZY!!!"

Thanks again for all the help...
Tom
---
UUCP: ...!ihnp4!killer!toma
-- 
-------------
Tom Armistead
UUCP:  ...!ihnp4!killer!toma

dce@mips.COM (David Elliott) (05/08/88)

In article <4016@killer.UUCP> toma@killer.UUCP (Tom Armistead) writes:
>He told me to use the special malloc(3X) library with
>'-lmalloc' on the cc command line. I did this and the problem has gone
>away!!!  "Thank you Antie Em, I'm not CRAZY!!!"

I hate to burst your bubble, Tom, but this doesn't really show that the
standard libc malloc() is broken.

It turns out that malloc(3X) is slightly different.  For example, if
you malloc() 0 bytes, one of the mallocs returns NULL and one returns
a pointer.  I have also seen cases (some compiler product) where using
-lmalloc fixed a bug (similar to your case), and it turned out that there
was actually a bug in the code.

Sure, it may be that you've triggered a rare bug in malloc(), but malloc()
is used so much that it really should be bomb-proof.

-- 
David Elliott		dce@mips.com  or  {ames,prls,pyramid,decwrl}!mips!dce

jfh@rpp386.UUCP (John F. Haugh II) (05/10/88)

In article <2149@quacky.mips.COM> dce@mips.COM (David Elliott) writes:
>In article <4016@killer.UUCP> toma@killer.UUCP (Tom Armistead) writes:
>>He told me to use the special malloc(3X) library with
>>'-lmalloc' on the cc command line. I did this and the problem has gone
>>away!!!  "Thank you Antie Em, I'm not CRAZY!!!"
>
>I hate to burst your bubble, Tom, but this doesn't really show that the
>standard libc malloc() is broken.

[ and later he goes to say it also doesn't prove you don't have a bug in
  your code. ]

below is some code i use to check the consistency of mallocs in a large
database i am working on.  it checks the number of malloc/free pairs, and
the leading edge for consistency.  this code is copyright john f. haugh ii,
1987, 1988, all rights reserved (by the way ;-)  [ and for the `but your
code has X problem' people - it works on everything it has been run on.
trouble is getting it to run on a 9370 and a PC/XT without major changes. ]


static	int	d_malcnt;

char	*x_malloc (size)
int	size;
{
	char	*cp;
	char	**tp;

	if (! (cp = malloc (size + sizeof (char *))))
		abort ();

	d_malcnt++;
	tp = (char **) cp;
	*tp = &cp[sizeof (char *)];
	return (*tp);
}

x_free (cp)
char	*cp;
{
	char	**tp;

	if (cp == (char *) 0)
		abort ();

	tp = (char **) &cp[- sizeof (char *)];

	if (*tp != cp)
		abort ();

	*tp = (char *) 0;
	free (tp);
	if (! d_malcnt--)
		abort ();
}

using this particular code (with the comments still present no less ;->
has helped me locate a countless number of bugs in the code.  adding code
to check for the upper edge would help some too.

- john.
-- 
John F. Haugh II                 | "You see, I want a lot. Perhaps I want every
River Parishes Programming       | -thing.  The darkness that comes with every
UUCP:   ihnp4!killer!rpp386!jfh  | infinite fall and the shivering blaze of
DOMAIN: jfh@rpp386               | every step up ..." -- Rainer Maria Rilke

fox@alice.marlow.reuters.co.uk (Paul Fox) (05/16/88)

In article <1620@rpp386.UUCP> jfh@rpp386.UUCP (The Beach Bum) writes:
>In article <2149@quacky.mips.COM> dce@mips.COM (David Elliott) writes:
>
>below is some code i use to check the consistency of mallocs in a large
>database i am working on.  

Oh well, I may as well post some code ... the following is my front
end to malloc/free/realloc. I use this to ensure that I do not corrupt
my memory areas, or try to free something thats never been allocated.
This code is portable - but requires you to call chk_alloc/chk_free and 
chk_realloc although #defines could be used avoid changing existing code.

If this code is used, then any memory freed by a normal free() must have
been allocated by a malloc() (not chk_alloc()). Its usually best to
recompile everything and link in the library. 

If a section 3 function calls malloc() it will bypass chk_alloc, and so
the freeing of this memory must be done by free() (not chk_free()). 
-----cut here------

# include	<stdio.h>
extern char	*malloc();

# define	MAGIC	0x464f5859L	/* FOXY */
# define	FREED	0x46524545L	/* FREE */

int	cnt_alloc = 0;

char	*
chk_alloc(n)
{	register char	*cp = malloc(n + 4);
	register long	*lp = (long *) cp;

	if (lp) {
		*lp++ = MAGIC;
		cnt_alloc++;
		}

	return (char *) lp;
}
char	*
chk_realloc(ptr, n)
char	*ptr;
{	char	*realloc();
	long	*lp = (long *) ptr;

	if (*--lp != MAGIC)
		chk_failed("Realloc non-alloced memory.");
	lp = (long *) realloc((char *) lp, n+4);
	return (char *) (lp + 1);
}
chk_free(ptr)
char	*ptr;
{	long	*lp = (long *) ptr;

	if (*--lp == FREED)
		chk_failed("Trying to free already freed memory.");
	if (*lp != MAGIC)
		chk_failed("Freeing non-alloced memory.");
	cnt_alloc--;
	*lp = FREED;
	free((char *) lp);
}
chk_failed(str)
char	*str;
{
	fprintf(stderr, "CHK_ALLOC: %s\r\n", str);
	abort();
}

---- cut here ------

=====================
     //        o      All opinions are my own.
   (O)        ( )     The powers that be ...
  /    \_____( )
 o  \         |
    /\____\__/      
  _/_/   _/_/         UUCP:     fox@alice.marlow.reuters.co.uk

chris@mimsy.UUCP (Chris Torek) (05/18/88)

In article <350@alice.marlow.reuters.co.uk> fox@alice.marlow.reuters.co.uk
(Paul Fox) provides a simple checking version of malloc.  Note that
it assumes that one can store a single `long' in the address returned
by malloc, and increment the result by the size of that long, and that
the resulting pointer is still `well aligned'.  As far as I can tell
there is no way to avoid some similar assumption.  It might be nice
to have an include file with a macro or function that aligns a pointer:

	#include <align.h>
	...
		void *p, *q;
		q = align(p);
	/* or possibly better */
		void *p; int off;
		off = align_off(p);

where <align.h> might read

	#define align(p) ((void *)(((long)(p) + 3) & ~3))

for a machine with four-byte alignment, or

	#define align_off(p) ((8 - ((long)(p) & 7)) & 7)

for a machine with eight-byte alignment, and maybe even

	int align_off(void *p);

for a machine with wacky alignment constraints.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris