[comp.sys.pyramid] Memory fault - core dumped

news@massey.ac.nz (USENET News System) (12/19/90)

A few weeks ago, perhaps coincidentally right after a brown out,
perhaps not (our Pyramid did not go down as a result, though
other machines did, causing the Pyramid to need rebooting), we
started seeing some programs core dump with a memory fault.
This stopped when the kernel was rebuilt, but has just occured
again on one of the same programs as before.  It happens to be
a telepager at the end of a mailer, so the message comes back
in the form of a bounce:

telepage: 23739 Memory fault - core dumped

I believe the number is the pid, but have been unable to find out
anything more, even with the debugger.  Interestingly, recompiling
the program fixes the problem - for a while.

I've never seen this error mesage before this started.  Can anyone
shed any light?  We have a 9815 with 5.0.  Thanx.
-- 
K.Spagnolo@massey.ac.nz

csg@pyramid.pyramid.com (Carl S. Gutekunst) (12/20/90)

In article <1990Dec19.031913.24197@massey.ac.nz> K.Spagnolo@massey.ac.nz (Ken Spagnolo) writes:
>telepage: 23739 Memory fault - core dumped

This is adb's message when it receives a SIGSEGV, or segmentation violation.
What this means is that the program tried to reference memory outside of its
allocated bounds. In C, this almost always means dereferencing a bad pointer.

One of the most mystifying problems in UNIX programs is how they can be given
apparently identical conditions, and produce varying results. The answer, of
course, is that the conditions aren't *quite* identical; and for this class of
bugs, I frequently find that my dereferenced pointer is poking around in the
environment table. Big environment, program runs. Small environment, program
gets a SIGSEGV. Note that programs spawned by cron typically have very small
environments (HOME, PATH, SHELL, and not much more), while those spawned by
users typically have huge environments.

To try reproducing this, start a new C shell, then 'unsetenv' everything in
the environment. (You can use printenv to find your environment variables.)
Then invoke your program, and see what happens.

You can also look around malloc() calls, to see if pointers are not always
being handed back correctly; also check to make sure that the return value
from malloc is error checked! A lot of programs don't do this, resulting in
inexplicable and unreproducable errors when virtual memory gets tight.

<csg>

news@massey.ac.nz (USENET News System) (12/21/90)

In article <138294@pyramid.pyramid.com> csg@pyramid.pyramid.com (Carl S. Gutekunst) writes:
>In article <1990Dec19.031913.24197@massey.ac.nz> K.Spagnolo@massey.ac.nz (Ken Spagnolo) writes:
>>telepage: 23739 Memory fault - core dumped
>
>To try reproducing this, start a new C shell, then 'unsetenv' everything in
>the environment. (You can use printenv to find your environment variables.)
>Then invoke your program, and see what happens.
>
>You can also look around malloc() calls, to see if pointers are not always
>being handed back correctly; also check to make sure that the return value
>from malloc is error checked! A lot of programs don't do this, resulting in
>inexplicable and unreproducable errors when virtual memory gets tight.

Thanx for your response.  I had the programmer involved do what you
said, but to no avail.  All the pointers, etc. look ok and there are
no mallocs or other dynamic memory allocations going on either.  Its
a pretty straight forward piece of code.

Perhaps some more info is needed.  Though the problem has recently
occured only with the telepage program, when it began (before the
kernel was rebuilt), the first thing to get hit was sh.  This
happened upon reboot while running rc.  Another program to be hit
was EaseScreen, which is third party software.  One common thing
about these three programs, is that they all are part of the att
universe (our system administration runs under att, telepage was
compiled under att and EaseScreen is from a SysV house).  So
perhaps there is a problem in an att library somewhere?

Of course, this could be a physical problem caused by the brown
out (sh dumped on reboot right afterward), but I'd like to hear
any comments, before I go to our local rep with a problem that
is not easy to reproduce.

Thanx again.  Happy Holidays.

Ken

PS  Who is actually reporting this (not terribly informative) error
    message?  The kernel?  In what way would adb be invloved?