[comp.sys.sgi] problem with malloc

ianh@bhpmrl.oz.au (Ian Hoyle) (10/24/90)

I've got an ongoing (well it was there with 3.2.2 and we just went to
3.3.1 today) problem with malloc. The application in question is
Rich Burridge's mp postscript filter program, patchlevel 13.

The following code fragment is causing problems:

      len = strlen(ptr) ;
      FPRINTF(stderr, "%s [] %s []  %i\n", ptr, whoami,strlen(whoami)) ;
      timenow = malloc((unsigned)(len + 6 + strlen(whoami) + 1)) ;
      FPRINTF(stderr,"After that error\n");

If I run mp using a news article that I know has been causing problems (mail1)
I get the following:

morgana->  ./mp -a < mail1 | lp
 8 Oct 90 23:26:48 GMT [] root []  4
lp: standard input is empty
lp: request not accepted
Segmentation fault (core dumped) 
morgana-> 


morgana-> dbx ./mp
dbx version 2.0 8/6/90 14:02
Copyright 1987 Silicon Graphics Inc.
Copyright 1987 MIPS Computer Systems Inc.
Type 'help' for help.
Reading symbolic information of `./mp' . . .
Process name from core dump: mp
Process died at pc 0x40340c of signal : segmentation violation
[using memory image in core]
(dbx) t
>  0 .malloc.malloc(0x10000bd4, 0x100005a0, 0x100086f8, 0x21, 0x16, 0x10) ["mall
oc.c":120, 0x403408]
   1 do_date() ["misc.c":45, 0x401978]
   2 set_defs() ["print.c":144, 0x402cfc]
   3 startpage() ["print.c":194, 0x402f40]
   4 printfile() ["main.c":193, 0x4014f4]
   5 .main.main(argc = 2, argv = 0x7fffc794) ["main.c":154, 0x401388]
(dbx) quit

The FPRINTF is correct before the malloc call, but it then crashes with a
core dump before it can print "After that error". This error though is not
consistent. Some news articles can be printed, but others mysteriously cause
mp to bomb.

I linked it using -lmalloc.  I'm completely at a loss to what may be occuring.
Perhaps differing int arguments to malloc are causing it to go haywire. Any 
suggestions anyone ??

		ian
--
                Ian Hoyle
     /\/\       Image Processing & Data Analysis Group
    / / /\      BHP Melbourne Research Laboratories
   / / /  \     245 Wellington Rd, Mulgrave, 3170
  / / / /\ \    AUSTRALIA
  \ \/ / / /
   \  / / /     Phone   :  +61-3-560-7066
    \/\/\/      FAX     :  +61-3-561-6709
                E-mail  :  ianh@bhpmrl.oz.au

scotth@harlie.corp.sgi.com (Scott Henry) (10/25/90)

In article <1658@merlin.bhpmrl.oz.au>, ianh@bhpmrl.oz.au (Ian Hoyle) writes:
|> I've got an ongoing (well it was there with 3.2.2 and we just went to
|> 3.3.1 today) problem with malloc. The application in question is
|> Rich Burridge's mp postscript filter program, patchlevel 13.
[details deleted]

(having just gone through this...) The problem with debugging malloc errors is that the culprit is never the call that bombs. The culprit is a previously malloc()ed piece of memory. The layout of a malloc()ed memory is basically a header (containing some pointers to maintain the free list), followed by the area actually allocated, followed by the next header, etc. Writing past the end of the allocated memory will step on the next header. This is frequently caused by the following code fragment:

	char *ptr = (char *)malloc((u_int)(strlen(s)));
	strcpy(ptr,s);

Because of the existence of the header, the actual amount allocated is rounded up to some boundary (frequently long (4-byte) or double(8-byte)). Therefore, 75% or 87.5% of the time, there is room for that pesky trailing null even though you didn't allow for it. And on many architectures, the lowest address byte of the header contains a zero anyway, but on an IRIS, that byte is always >0. Putting the trailing null in that location causes a de-reference to somewhere outside any of the data segments, and you 


get a segmentation violation when you attempt to malloc or free the memory area whose header just got stepped on. 

My first step to fixing these kind of problems has become:

grep 'alloc(.*strlen' *.c
grep '(char *\*).*alloc(' *.c

(make sure you get all of the source files), and ensure that every occurence
of mallocing string storage includes space for the null.

I could go on, but I think you get the picture. Just to repeat: the call to malloc() or free() that causes the segmentation violation is _NEVER_ the one at fault. (Assuming you're not doing funky casts and stuff).

-- 
 Scott Henry <scotth@sgi.com> / Traveller on Dragon Wings
 Information Services,       / Help! My disclaimer is missing!
 Silicon Graphics, Inc      / 'Under-achiever and proud of it!' -- Bart Simpson

ianh@bhpmrl.oz.au (Ian Hoyle) (10/26/90)

ianh@bhpmrl.oz.au (Ian Hoyle) writes:

>I've got an ongoing (well it was there with 3.2.2 and we just went to
>3.3.1 today) problem with malloc. The application in question is
>Rich Burridge's mp postscript filter program, patchlevel 13.

>The following code fragment is causing problems:

>      len = strlen(ptr) ;
>      FPRINTF(stderr, "%s [] %s []  %i\n", ptr, whoami,strlen(whoami)) ;
>      timenow = malloc((unsigned)(len + 6 + strlen(whoami) + 1)) ;
>      FPRINTF(stderr,"After that error\n");

[.... other stuff from dbx ]

>I linked it using -lmalloc.  I'm completely at a loss to what may be occuring.

Well I've fixed the problem. 

When mp was being linked, the command line in the Makefile was

       $(CC) $(LDFFLAGS) -o mp $(OBJS)

where LDFLAGS = -lmalloc

My guess is that there is an implicit -lc tacked on the end of this line thus
using the malloc(3C) and *not* malloc(3X). It should have read

       $(CC) -o mp $(OBJS) $(LDFLAGS)

thus loading libmalloc *last*.

I don't think it's a case of RTFM because I've just done that and couldn't 
find reference to exactly where the library definition should go in a cc
compile line. eg. from the manual entry for malloc(3X) :

"It is found in the library "libmalloc.a", and is loaded if the option
-lmalloc is used with cc(1) or ld(1)".

But, it's probably there somewhere and I missed it ..... damn it :-(

Thanks to those people who replied to me and made lots of useful suggestions
to chase up,

			ian
--
                Ian Hoyle
     /\/\       Image Processing & Data Analysis Group
    / / /\      BHP Melbourne Research Laboratories
   / / /  \     245 Wellington Rd, Mulgrave, 3170
  / / / /\ \    AUSTRALIA
  \ \/ / / /
   \  / / /     Phone   :  +61-3-560-7066
    \/\/\/      FAX     :  +61-3-561-6709
                E-mail  :  ianh@bhpmrl.oz.au

mike@SNOWHITE.CIS.UOGUELPH.CA (10/26/90)

Ian:

    I probably won't be the first or the last to say this, but you've probably
got a loose pointer on your deck somewhere.  If you write into memory that you
didn't allocate (by far the most popular way to do this is to scan either
forwards or backwards off of a chunk of memory you got from 'malloc', but there
are other, more interesting, ways to do it too), then sooner or later you'll
stomp on the memory that 'malloc' and 'free' use to maintain their sanity.
The result usually looks something like what you describe, plus or minus a
few Rolaids.

    I have found (through painful personal experience) that tracing this kind
of problem is difficult but not impossible.  The call that core dumped is 
almost always totally blameless, because the memory stomp could have happened
quite some time ago in program terms.  To help you out with this, SGI provides
a really neat option for memory debugging called 'mallopt'.  If you look at
the man page for this baby you'll find that just by putting:
	mallopt(M_DEBUG, 1);
early in 'main', 'malloc' and 'free' will perform a full scan of their internal
structures each time that they are called.  This should help you in tracing
down your varmint pointer by letting you know that something is goofed up 
as soon as possible after it happened.  Of course, you pay for this checking
by having to put up with a (ahem) noticeable decrease in speed of operation
of the two system calls in question.

    In more desperate straits I have even put my own front end on malloc so
that I allocated an extra 2k each time I asked for space, and positioned the
requested space in the middle of the big chunk (leaving 1k of safety buffer
on each side), just to see if it was an overscan problem.  If things get 
really bad you could try that too.

    Good luck.

>From an informal introduction to C:      | Mike Chapman
        C looks a lot like               | Grab Student, University of Guelph
        Pascal with a hangover.          | mike@snowhite.cis.uoguelph.ca

shenkin@cunixf.cc.columbia.edu (Peter S. Shenkin) (10/26/90)

In article <9010252231.AA23626@snowhite.cis.uoguelph.ca> mike@SNOWHITE.CIS.UOGUELPH.CA writes:
>    I probably won't be the first or the last to say this, but you've probably
>got a loose pointer on your deck somewhere.  If you write into memory that you
.....
>quite some time ago in program terms.  To help you out with this, SGI provides
>a really neat option for memory debugging called 'mallopt'.  If you look at

See also the public-domain debugging malloc routines that are available from
the comp.sources.unix archives.  After you use one of these things (or mallopt),
and debug your code, you go back to using the system malloc routine, since
these range-checking guys are inefficient.

	-P.
************************f*u*cn*rd*ths*u*cn*gt*a*gd*jb**************************
Peter S. Shenkin, Department of Chemistry, Barnard College, New York, NY  10027
(212)854-1418  shenkin@cunixc.cc.columbia.edu(Internet)  shenkin@cunixc(Bitnet)
***"In scenic New York... where the third world is only a subway ride away."***

msc@ramoth.esd.sgi.com (Mark Callow) (10/31/90)

In article <ianh.656882312@morgana>, ianh@bhpmrl.oz.au (Ian Hoyle) writes:
|>
|> Well I've fixed the problem. 
|> [stuff deleted]
|> thus loading libmalloc *last*.
|> 
|> I don't think it's a case of RTFM because I've just done that and couldn't 
|> find reference to exactly where the library definition should go in a cc
|> compile line. eg. from the manual entry for malloc(3X) :

It's in the cc(1) entry I expect.

The real point of this message is that I suspect your program still has a bug
in it.  All you've done by linking with a different malloc is change the
circumstances under which the bug rears its ugly head.  Read and pay attention
to Scott Henry's excellent primer on debugging malloc problems.
-- 
From the TARDIS of Mark Callow
msc@ramoth.sgi.com, ...{ames,decwrl}!sgi!msc
"There is much virtue in a window.  It is to a human being as a frame is to
a painting, as a proscenium to a play.  It strongly defines its content."