[comp.sys.sgi] Crashing the window mgr from GL programs

slevy@poincare.geom.umn.edu (Stuart Levy) (08/03/90)

We use a locally-written 3-d object viewer on our Irises (personal and GTX).
For some aberrant objects, or possibly some xform matrices pushed on the stack,
we find it causes the window server to crash -- with messages resembling
"timeout: graphics FIFO still > 1/2 full" and/or "window server killed with
signal 15".  In extreme cases it can cause our GTX Iris to lock up such that
we must reboot to recover the graphic display, though normally we're just
kicked back to a login: prompt.

Does anyone know what kinds of geometric data can wedge the graphics subsystem
this way?  If we knew what to avoid we might be able to change our application
to prevent crashes.

	Stuart Levy, Geometry Group, University of Minnesota
	slevy@geom.umn.edu

kurt@cashew.asd.sgi.com (Kurt Akeley) (08/03/90)

In article <1990Aug3.075057.11705@cs.umn.edu>,
slevy@poincare.geom.umn.edu (Stuart Levy) writes:
|> We use a locally-written 3-d object viewer on our Irises (personal and GTX).
|> For some aberrant objects, or possibly some xform matrices pushed on
the stack,
|> we find it causes the window server to crash -- with messages resembling
|> "timeout: graphics FIFO still > 1/2 full" and/or "window server killed with
|> signal 15".  In extreme cases it can cause our GTX Iris to lock up such that
|> we must reboot to recover the graphic display, though normally we're just
|> kicked back to a login: prompt.
|> 
|> Does anyone know what kinds of geometric data can wedge the graphics
subsystem
|> this way?  If we knew what to avoid we might be able to change our
application
|> to prevent crashes.
|> 
|> 	Stuart Levy, Geometry Group, University of Minnesota
|> 	slevy@geom.umn.edu

as you suggest, there may be some geometric data or transformation that
causes the pipe to lock up.  in my experience, however, it is more likely
that you are calling GL routines in an order that is not supported, with
the same result.  a common mistake is to include GL commands other than
c(), color(), cpack(), lmbind(), lmcolor(), lmdef(), n(), RGBcolor(), t(),
or v() between bgnpolygon() and endpolygon() calls (ditto for points, lines,
closedlines, tmeshes, and qstrips).  you might expect, for example, to be
able to change the depthcue parameters within a line or polygon - sorry,
not allowed (a new GL depthcue feature will correct this and other issues
regarding the current depthcue).

of course the sequence theory makes more sense if the failure is associated
with a viewing mode, rather than with particular data sets.  if failures
can be isolated to a subset of the viewing modes, it might be worthwhile to
review the related code for unsupported GL sequences.

-- kurt

blbates@AERO4.LARC.NASA.GOV ("Brent L. Bates AAD/TAB MS361 x42854") (08/04/90)

    We had a similar problem with a demo Personal Iris.  I was told
it was a problem in the combination of the Personal Iris and the gl
library.  This problem is supposed to be fixed in 3.3 OS.  In our
case we weren't doing anything wrong or illegal (the programs were
written for a 3130) it was just a problem in the system software.
--

	Brent L. Bates
	NASA-Langley Research Center
	M.S. 361
	Hampton, Virginia  23665-5225
	(804) 864-2854
	E-mail: blbates@aero4.larc.nasa.gov or blbates@aero2.larc.nasa.gov

jim@baroque.Stanford.EDU (James Helman) (08/04/90)

> it might be worthwhile to review the related code for unsupported GL
> sequences.

It would be more worthwhile for SGI to make GL safer to use, even at
some expense in performance.  Having bad sequences lock up the pipe
and bomb you back to login may have been acceptable back when IRISes
had one window, i.e. the screen itself.  But when an easily made error
in the GL program your debugging causes your entire "desktop" of
networked windows, edits and remote jobs (including, of course, the
window you were debugging in) to go south, it's just plain lousy.
(Actually, when it happens I have a slightly stronger word for it.)

Count one vote for guardrails for the autobahn.

Jim Helman
Department of Applied Physics			Durand 012
Stanford University				FAX: (415) 725-3377
(jim@KAOS.stanford.edu) 			Voice: (415) 723-9127

drb@eecg.toronto.edu (David R. Blythe) (08/04/90)

In article <JIM.90Aug3152228@baroque.Stanford.EDU> jim@baroque.Stanford.EDU (James Helman) writes:
>
>> it might be worthwhile to review the related code for unsupported GL
>> sequences.
>
>It would be more worthwhile for SGI to make GL safer to use, even at
>some expense in performance.
I disagree.  Having a separate checking version of the library (say the
unshared version) would be better.  Then once you have some confidence your
code works you can link with the high performance version which doesn't
do checking.
>
>Jim Helman
>Department of Applied Physics			Durand 012
>Stanford University				FAX: (415) 725-3377
>(jim@KAOS.stanford.edu) 			Voice: (415) 723-9127

	-drb
	drb@clsc.utoronto.ca

jim@baroque.Stanford.EDU (James Helman) (08/05/90)

>>It would be more worthwhile for SGI to make GL safer to use, even at
>>some expense in performance.

> I disagree.  Having a separate checking version of the library (say the
> unshared version) would be better.  Then once you have some confidence your
> code works you can link with the high performance version which doesn't
> do checking.

I can't object to giving the user the freedom to decide the
safety/performance level.  If you *know* your program has no bugs
whatsoever, go for it.

But I don't want it to be a choice between a no-checks-made
"production" version GL which runs at full speed, and a test version
that runs at half speed.

I would argue that safety features which slow performance down
slightly (say less than 10%) should be included standard.  From a
marketing perspective, SGI outclasses other platforms by so much that
10% less performance would be well worth the improved reputation that
would come from more robustness.  And I'm sure a lot of it could be
designed into the hardware with virtually no performance loss at all.

I've seen too many complex programs in which the "unsafe" sequence was
not tickled until demo time.  Presumably, the authors also had "some
confidence" in their code before they put it on display.  But when one
sees a failure, which is indistinguishable from a hardware or system
software problem, it doesn't matter who's too blame.  It makes the
machine look like an unstable platform.  After a non-technical
management type sees a machine apparently crash and burn during a
demo, how much good will it do to explain: ... application bug ... bad
sequence ... locked pipe ... may be bad hardware.... but it's 10%
faster!

When a user program can easily and accidentally bring the entire
windowing system down, in my book, it's a bug.

Speaking of which, SGI's X server is much improved in IRIX 3.3.  Some
bugs remain, but (knock on wood) I haven't experienced a single core
dump.  Hats off to SGI's X team.  When's the next release?

Jim Helman
Department of Applied Physics			Durand 012
Stanford University				FAX: (415) 725-3377
(jim@KAOS.stanford.edu) 			Voice: (415) 723-9127

swed@aerospace.aero.org (Gregory D. Swedberg) (08/06/90)

	I have run into exactly this problem when the wrong type is
passed to a GL function.  The window manager especially hates passing
doubles to routines expecting floats.  The same thing also happens if
a float value is NaN.

balaguer@disuns2.epfl.ch (Jean-Francis Balaguer) (08/29/90)

In article <1990Aug3.075057.11705@cs.umn.edu>, slevy@poincare.geom.umn.edu (Stuart Levy) writes:
> We use a locally-written 3-d object viewer on our Irises (personal and GTX).
> For some aberrant objects, or possibly some xform matrices pushed on the stack,
> we find it causes the window server to crash -- with messages resembling
> "timeout: graphics FIFO still > 1/2 full" and/or "window server killed with
> signal 15".  In extreme cases it can cause our GTX Iris to lock up such that
> we must reboot to recover the graphic display, though normally we're just
> kicked back to a login: prompt.
> 
> Does anyone know what kinds of geometric data can wedge the graphics subsystem
> this way?  If we knew what to avoid we might be able to change our application
> to prevent crashes.
> 
> 	Stuart Levy, Geometry Group, University of Minnesota
> 	slevy@geom.umn.edu


We had the same problem here not specially on personal iris but on every kind
of SGI machines.
It was coming from an accumulation of wrong gl calls. The most dangerous one
was n3f called with NaN coordinates.

We think SGI should provide a debug version of GL where every inconsistent
call to the library should be trapped as it took us more than 3 days to find the 
problem.

----------------------------------------------------------------------------
Francis Balaguer
Departement d'Informatique                      Tel : 41-21-6935244
Laboratoire d'Infographie                       FAX : 41-21-6933909
Ecole Polytechnique Federale de Lausanne
CH-1015 LAUSANNE

E-Mail : balaguer@ligsg2.epfl.CH
         balaguer@elma.epfl.CH
----------------------------------------------------------------------------