[comp.windows.x] Running dclock on OpenWindows causing SunOS crash

epstein@trwacs.UUCP (Jeremy Epstein) (05/15/91)

Environment: Sun 4/260 running SunOS 4.1.1 and OpenWindows 2.0

We have a *very* strange problem...one of our users runs "dclock"
(I believe patchlevel 4, but I'm not positive), and that user's
server crashes every day precisely at 1:00pm.  If the user kills
dclock before that time, then the system doesn't crash.  I've
checked, and there's no cron jobs or anything else running.

If I look at the logs in /usr/adm/messages*, some of the days
don't have any crash indications.  Those which do all look the same:
	BAD TRAP
	pid xxxx, `xnews': Data fault
	kernel read fault at addr=0x0, pme=0x70000000
	Bus Error Reg 80<INVALID>
and so on with registers, tracebacks, etc.  We haven't tried this
on any machines other than one 4/260 (we're planning to), so I
suppose it's possible that it could be hardware related.

I haven't yet tried to figure out exactly where it's crashing,
but something is really wrong if xnews (the OpenWindows server)
is causing the system to crash.

Has anyone else seen this problem (or perhaps other programs which
cause OpenWindows to crash SunOS)?  Can anyone reproduce the problem
(if anyone is willing to try :-) ).  Is there any known solution?

We could stop running dclock (that's no big deal), but we're
concerned that there's a lurking bug which could manifest itself
in other ways once the system is out in the field.

Thanks for any help...in the meantime we'll be working on analyzing
the traceback.

--Jeremy
-- 
Jeremy Epstein			UUCP: uunet!trwacs!epstein
Trusted X Research Group	Internet: epstein@trwacs.fp.trw.com
TRW Systems Division		Voice: +1 703/876-8776
Fairfax Virginia

mouse@lightning.mcrcim.mcgill.EDU (der Mouse) (05/16/91)

> Environment: Sun 4/260 running SunOS 4.1.1 and OpenWindows 2.0

> We have a *very* strange problem...one of our users runs "dclock" (I
> believe patchlevel 4, but I'm not positive), and that user's server
> crashes every day precisely at 1:00pm.  If the user kills dclock
> before that time, then the system doesn't crash.  I've checked, and
> there's no cron jobs or anything else running.

> If I look at the logs in /usr/adm/messages*, some of the days don't
> have any crash indications.  Those which do all look the same:
> 	BAD TRAP
> 	pid xxxx, `xnews': Data fault
> 	kernel read fault at addr=0x0, pme=0x70000000
> 	Bus Error Reg 80<INVALID>

This looks singularly like something we observe here.  We have two
SPARCserver 470s running 4.1 (not 4.1.1...yet) and my mterm will, about
one time in ten, produce a strikingly similar panic on startup.
(Always just at startup.  If it survives the first second after the
window comes up, it'll survive fine...until the next mterm starts up.)
Stack traces from adb -k seem to imply that the crash is occurring
somewhere deep inside the scheduler, which, coupled with Sun's choice
to release a binary-only system, isn't much help.

Once, some process running emacs (we have at least two executables
called emacs, though I suspect that in this case it was GNU emacs)
produced a similarly inexplicable crash.

I, too, would be most interested in any solutions or rumors thereof.

					der Mouse

			old: mcgill-vision!mouse
			new: mouse@larry.mcrcim.mcgill.edu

pete@iris49.biosym.COM (Pete Ware) (05/18/91)

Your problem with dclock crashing your X/NeWS server at 1:00 is quite
interesting.  Mostly because I'm running a beta version of SGI's X
server and at exactly 1:00pm everyday I get a white band displayed
across the screen.

Looks like I'm going to have to look at dclock instead of just blaming
it on SGI.

--pete

dwig@b11.ingr.com (David Wiggins) (05/23/91)

pete@iris49.biosym.COM (Pete Ware) writes:

>Your problem with dclock crashing your X/NeWS server at 1:00 is quite
>interesting.  Mostly because I'm running a beta version of SGI's X
>server and at exactly 1:00pm everyday I get a white band displayed
>across the screen.

>Looks like I'm going to have to look at dclock instead of just blaming
>it on SGI.

Our server also had this problem.  dclock sends a CopyArea with a negative
width (or very large positive, depending on your interpretation), which
apparently confuses a lot of vendors' servers.

>--pete

David P. Wiggins
Intergraph