[comp.unix.ultrix] emacs or csh problem?

hagan@scotty.dccs.upenn.edu (John Dotts Hagan) (07/26/90)

OK - I could use some help from you, the people, who use DECstations.

I have submitted a problem to Digital about trouble starting jobs from
/bin/csh, specifically gnu emacs.  Unfortunately, Digital seems highly
skeptical that I am really having a problem:

   "I think I mentioned that we have hundreds of
   people here using emacs and the reported seg fault problem has never been
   seen here."

	-- unname Digital Ultrix Engineer latest response to the bug report

If other people are having the problem, please let me know (and perhaps
Digital).  I will also let them know that I am not alone with the
problem (or maybe I am!).  Here is the problem:

You are running in /bin/csh like normal, and try and start an application (like
the gnu emacs Digital distributes an unsupported subset):

% emacs
Segmentation fault
% 

In fact, if you keep trying to run emacs, it just keeps faulting:

% emacs
Segmentation fault
% emacs
Segmentation fault
% emacs
Segmentation fault
% emacs
Segmentation fault
% 

However, if you run another task:

% emacs
Segmentation fault
% emacs
Segmentation fault
% emacs
Segmentation fault
% emacs
Segmentation fault
% ls
fish
% emacs
<emacs starts up OK>

The Segmentation fault occurs almost instantly, and core rarely dumps.  The
few times core has dumped, it is always a core dump of /bin/csh, not emacs.
Since the dumps are of the csh, and it fault happens so fast, I believe the
csh is having trouble starting the emacs task, rather than emacs causing the
problem.

I have built two newer versions of gnu emacs and it still happens.  I should
also mention that these faults happen quite rarely to some users (I may see
it once a month), while other users see it a few times a week or even daily!
But no one at my site can make it happen at will.  We have looked for a
pattern, like it happens when you first log on, or after you have been
logged on for days, or just to vt100s, or just X Windowing emacs, etc. 
No pattern
is obvious to us.  Also, we have looked at everyone's .cshrc and .login and
they are all quite different, but everyone sees this problem to some extent.
Also, we have diskless systems (with local swap disk), "dataless" systems
(with a local swap and root partition, but NFS mounting /usr), and "diskfull"
(all disk locally attached) that all experience the problem.

We have been suffering with this problem since the first release of UWS for
RISC - and only our RISC DECstation 3100s show this bug.  It has never happened
on our VAX/Ultrix systems.  We have installed UWS 2.1, UWS 2.2, and
Ultrix 3.1D/2.2D on each DECstation, and no change.

Once, we believe another task failed the same way.  It was the expire program
that runs nightly on one DECstation to clean up network news.  Also possibly
related is a strange observation the occationally things to do seem to get
started in crontab.  Some of our scripts that run out of crontab append to a
log file as their first step - the next day after the job did not run, we look
at the log file and it is not touched!  It also seems to happen rarely, but
once in a while the same task at night will not run for 2 or 3 days, then it
will run nightly for months correctly.  The expire program that failed to start
was in a /bin/csh script started bu cron, and all of these other scripts that
failed are /bin/csh scripts.

Again, if you are think you are experiencing these problems, please let me
know so I can let Digital know it is not a problem unique to my site.  Also,
if you know the problem, PLEASE LET ME KNOW - it's driving us crazy!!!!!!!!!

--Kid.

diamond@tkou02.enet.dec.com (diamond@tkovoa) (07/27/90)

In article <27522@netnews.upenn.edu> hagan@scotty.dccs.upenn.edu (John Dotts Hagan) writes:

>I have submitted a problem to Digital about trouble starting jobs from
>/bin/csh, specifically gnu emacs.  Unfortunately, Digital seems highly
>skeptical that I am really having a problem:
>   "I think I mentioned that we have hundreds of
>   people here using emacs and the reported seg fault problem has never been
>   seen here."

Well, the response was almost suitable.  The problem is not in emacs.
You say so yourself:

>The Segmentation fault occurs almost instantly, and core rarely dumps.  The
>few times core has dumped, it is always a core dump of /bin/csh, not emacs.
>Since the dumps are of the csh, and it fault happens so fast, I believe the
>csh is having trouble starting the emacs task, rather than emacs causing the
>problem.
>...  The expire program that failed to start
>was in a /bin/csh script started bu cron, and all of these other scripts that
>failed are /bin/csh scripts.

I had similar problems on a MIPS box at a previous employer.  The failures
were often in csh but also often in other programs such as the ld step of
a cc command.  I haven't seen it on a DECstation (yet) but it looks like
this narrows it down to code that was inherited from MIPS.  Sometimes I had
the same problem while debugging a program that I wrote myself:  If the
program aborted, and I tried running it under dbx, it would get a segmentation
fault as soon as I said "run" -- repeatedly.  Perhaps certain environments
set up segments of some particular lengths, or by some other unknown means
they manifest an intermittent bug in address mapping/paging/swapping/
who knows.  Or perhaps the kernel forgets to delete a signal that has been
delivered, so it gets delivered again when a new process bears a certain
characteristic.  I hope these guesses might help locate the problem.
-- 
Norman Diamond, Nihon DEC     diamond@tkou02.enet.dec.com
This is me speaking.  If you want to hear the company speak, you need DECtalk.

lat@creatures.cs.vt.edu (Laurie Zirkle) (07/27/90)

In article <27522@netnews.upenn.edu> hagan@scotty.dccs.upenn.edu (John Dotts Hagan) writes:
>
>OK - I could use some help from you, the people, who use DECstations.
>
>You are running in /bin/csh like normal, and try and start an application (like
>the gnu emacs Digital distributes an unsupported subset):
>
>% emacs
>Segmentation fault
>% 

I have seen the same problem here on a DECstation 3100 running 3.1/2.1, 3.1/2.2,
and 3.1d/2.2d.  I have emacs-18.55 compiled instead of the unsupported emacs
that DEC supplies, and it's compiled with the X-windowing support.  It doesn't
always happen, but it happens enough to annoy the owner/user of the 3100
(especially since she is a heavy emacs user).

Laurie Zirkle				lat@vtopus.cs.vt.edu
Computer Systems Engineer		lat@vtcs1.bitnet
VA Tech Computer Science Dept
Blacksburg, VA 24060			703-231-6370

grue@nirvana.cool.engin.umich.edu (Paul Howell) (07/28/90)

In article <27522@netnews.upenn.edu>, hagan@scotty.dccs.upenn.edu (John
Dotts Hagan) writes:
|> 
|> 
|> You are running in /bin/csh like normal, and try and start an
application (like
|> the gnu emacs Digital distributes an unsupported subset):
|> ...
|> % emacs
|> Segmentation fault
|> % 
|> 
|> 
|> However, if you run another task:
|> 
|> % emacs
|> Segmentation fault
|> % emacs
|> Segmentation fault
|> % emacs
|> Segmentation fault
|> % emacs
|> Segmentation fault
|> % ls
|> fish
|> % emacs
|> <emacs starts up OK>
|>  ... 
|> The Segmentation fault occurs almost instantly, and core rarely dumps.  The
|> few times core has dumped, it is always a core dump of /bin/csh, not emacs.
|> Since the dumps are of the csh, and it fault happens so fast, I believe the
|> csh is having trouble starting the emacs task, rather than emacs causing the
|> problem.
|> ...
!> 
|> Again, if you are think you are experiencing these problems, please let me
|> know so I can let Digital know it is not a problem unique to my site.  Also,
|> if you know the problem, PLEASE LET ME KNOW - it's driving us crazy!!!!!!!!!
|> 
|> --Kid.

We had the same problem here with emacs giving a segmentation fault.  And only
after exec'ing something, would emacs work.  Our thoughts were that it was/is
a problem with csh.  sh didn't have the problem.

We're now running Ultrix 3.1d and haven't seen the problem again.  Not to say 
it's fixed, but I'm not an emacs user so I can only go by what others say.


---
Paul Howell
grue@caen.engin.umich.edu

cole@dip.cs.wisc.edu (Bruce Cole) (07/29/90)

In article <27522@netnews.upenn.edu> hagan@scotty.dccs.upenn.edu (John Dotts Hagan) writes:
>   "I think I mentioned that we have hundreds of
>   people here using emacs and the reported seg fault problem has never been
>   seen here."
It is very distressing to hear DEC say this, since I QAR'ed this problem to
DEC some time ago, and gave them the fixes to the Ultrix kernel problem which
causes this to occur.

I posted this description to info-gnu-emacs:

From: cole (Bruce Cole)
To: gordon!jpd
Cc: cole, info-gnu-emacs@prep.ai.mit.edu
Subject: RISC Ultrix- Emacs problems
Date: Sat, 7 Jul 90 12:11:25 -0500

 > We're running emacs 18.54 on an Ultrix Risc Decsystem 5400. Three times
 > we've had the machine hang with the following message:
 > 
 >   panic: tblmod on invalid pte
 > 
 > Ultrix support tells us this is caused by emacs. Has anyone experienced
 > this? DEC says it only happens on RISC boxes.
This is due to a MIPS specific Ultrix kernel bug.  I sent DEC a description
of the bug with a bug fix.  The Kernel bug manifests itself with emacs 
since emacs uses a non-standard data start address on Ultrix MIPS machines.

I haven't often seen emacs cause MIPS machines to panic.  Usually you just
see one of the following errors when you try to start up emacs:
	segmentation fault (core dumped)
	emacs: Bad address
	Out of memory
	data size rlimit exceeded, pid 6523, process tcsh (for example)

Until DEC fixes their kernel, you can avoid the bug by changing the data
start address used by emacs.  Change m-pmax.h to define these values:
[...]
Here are diffs to emacs 18.55:

*** m-pmax.h	Thu Jun  8 11:53:55 1989
--- m-pmax.h.new	Mon Jul  9 10:21:21 1990
***************
*** 1,3 ****
--- 1,7 ----
  #include "m-mips.h"
  #undef LIBS_MACHINE
  #undef BIG_ENDIAN
+ #undef LD_SWITCH_MACHINE
+ #undef DATA_START
+ #define DATA_START 0x10000000
+ #define DATA_SEG_BITS 0x10000000

>I have built two newer versions of gnu emacs and it still happens.  I should
>also mention that these faults happen quite rarely to some users (I may see
>it once a month), while other users see it a few times a week or even daily!

The problem only occurs when a MIPS machine is doing a lot of paging.  Users
who don't cause their workstation to page will not see this problem.
--
Bruce Cole
Computer Sciences Dept.
U. of Wisconsin - Madison

gk5g+@andrew.cmu.edu (Gary Keim) (08/08/90)

Excerpts from netnews.comp.unix.ultrix: 28-Jul-90 Re: emacs or csh
problem? Bruce Cole@dip.cs.wisc.e (2336)

> >   "I think I mentioned that we have hundreds of
> >   people here using emacs and the reported seg fault problem has never been
> >   seen here."
> It is very distressing to hear DEC say this, since I QAR'ed this problem to
> DEC some time ago, and gave them the fixes to the Ultrix kernel problem which
> causes this to occur.

Andrew Toolkit applications suffer from this problem.  Can someone tell
if, and in which version of Ultrix, this paging bug has been fixed. 
Might it be fixed in 4.0?

Gary Keim
ATK Group