news@massey.ac.nz (USENET News System) (12/19/90)
A few weeks ago, perhaps coincidentally right after a brown out, perhaps not (our Pyramid did not go down as a result, though other machines did, causing the Pyramid to need rebooting), we started seeing some programs core dump with a memory fault. This stopped when the kernel was rebuilt, but has just occured again on one of the same programs as before. It happens to be a telepager at the end of a mailer, so the message comes back in the form of a bounce: telepage: 23739 Memory fault - core dumped I believe the number is the pid, but have been unable to find out anything more, even with the debugger. Interestingly, recompiling the program fixes the problem - for a while. I've never seen this error mesage before this started. Can anyone shed any light? We have a 9815 with 5.0. Thanx. -- K.Spagnolo@massey.ac.nz
csg@pyramid.pyramid.com (Carl S. Gutekunst) (12/20/90)
In article <1990Dec19.031913.24197@massey.ac.nz> K.Spagnolo@massey.ac.nz (Ken Spagnolo) writes: >telepage: 23739 Memory fault - core dumped This is adb's message when it receives a SIGSEGV, or segmentation violation. What this means is that the program tried to reference memory outside of its allocated bounds. In C, this almost always means dereferencing a bad pointer. One of the most mystifying problems in UNIX programs is how they can be given apparently identical conditions, and produce varying results. The answer, of course, is that the conditions aren't *quite* identical; and for this class of bugs, I frequently find that my dereferenced pointer is poking around in the environment table. Big environment, program runs. Small environment, program gets a SIGSEGV. Note that programs spawned by cron typically have very small environments (HOME, PATH, SHELL, and not much more), while those spawned by users typically have huge environments. To try reproducing this, start a new C shell, then 'unsetenv' everything in the environment. (You can use printenv to find your environment variables.) Then invoke your program, and see what happens. You can also look around malloc() calls, to see if pointers are not always being handed back correctly; also check to make sure that the return value from malloc is error checked! A lot of programs don't do this, resulting in inexplicable and unreproducable errors when virtual memory gets tight. <csg>
news@massey.ac.nz (USENET News System) (12/21/90)
In article <138294@pyramid.pyramid.com> csg@pyramid.pyramid.com (Carl S. Gutekunst) writes: >In article <1990Dec19.031913.24197@massey.ac.nz> K.Spagnolo@massey.ac.nz (Ken Spagnolo) writes: >>telepage: 23739 Memory fault - core dumped > >To try reproducing this, start a new C shell, then 'unsetenv' everything in >the environment. (You can use printenv to find your environment variables.) >Then invoke your program, and see what happens. > >You can also look around malloc() calls, to see if pointers are not always >being handed back correctly; also check to make sure that the return value >from malloc is error checked! A lot of programs don't do this, resulting in >inexplicable and unreproducable errors when virtual memory gets tight. Thanx for your response. I had the programmer involved do what you said, but to no avail. All the pointers, etc. look ok and there are no mallocs or other dynamic memory allocations going on either. Its a pretty straight forward piece of code. Perhaps some more info is needed. Though the problem has recently occured only with the telepage program, when it began (before the kernel was rebuilt), the first thing to get hit was sh. This happened upon reboot while running rc. Another program to be hit was EaseScreen, which is third party software. One common thing about these three programs, is that they all are part of the att universe (our system administration runs under att, telepage was compiled under att and EaseScreen is from a SysV house). So perhaps there is a problem in an att library somewhere? Of course, this could be a physical problem caused by the brown out (sh dumped on reboot right afterward), but I'd like to hear any comments, before I go to our local rep with a problem that is not easy to reproduce. Thanx again. Happy Holidays. Ken PS Who is actually reporting this (not terribly informative) error message? The kernel? In what way would adb be invloved?