[comp.os.vms] resexh - fatal bugcheck

cetron@CS.UTAH.EDU (Edward J Cetron) (10/31/87)

since i never saw this come back, i will resubmit it....(again....)
apologies to anyone who did see it....

	we are suddenly  experiencing some bizzare crashes and hanging
computers.  A quick synopsis of the configuration:

	3 microvax II's, one with 13Mb, the others with 9
	all three are in a LAVC with the boot node having 9Mb
	the boot node has 1 RD54 and 2 RD53's, the others
		each have 3 RD53's.
	All nine disks are served across the entire cluster
	All three microvaxes have Vaxstation II/GPX upgrades.
	We are running VMS 4.6 and VWS 3.2 with 19 point font support.

Ok, when running programs which use the gpx screen heavily (especially for a
clear/erase and repaint operation) the systems go into one of two modes:

	a fatal bugcheck of RESEXH - resources exhausted, system shutting down

	or a non-fatal bugcheck of INCONSTATE - inconsistent I/O data base
		which fills the errlog file in minutes and renders the system
		almost useless in short order. Any device trying to get to
		the VAA0: (gpx screen) is in a wierd state which uses cpu
		time, never does any I/O and can't be stop/id'ed.

I have tried just about everything - I've increased pool (both paged and
nonpaged), I've increased the irp, srp, and lrp lists, and i've upped the
number of resource blocks and lock id blocks...I can supply all of the
various values if desired but in general an analy/system shows the system
running at the limits 80-92% full on srp, irp, 27% utilized on lrp lists,
and between 90-95% utilized when I finally resources exhausted bugcheck.

	The only odd thing about analy/sys or analy/crash is that when I
have sda 'sho res' I get a lot of resource blocks which have a seq number
(which is undocumented) and then the notation 'Not valid' right after the
sequence number.  Also, the ascii text of a lot of them (there are 3-500 of
them) seem to be disk resource blocks (i.e. F11B$ CEDCAD_SYS, where 
cedcad_sys is the vol name of one of our served disks) and some have
'bad' ascii in them (F11B$s0+.).

Has anyone seen anything like this? Does anyone have ANY ideas? I at first
thought it obviously the gpx driver/workstation software, but this resources
block stuff seems to be disk server/cluster related, but......but then again
I do see some error counts for the vaa0: device, but no entries in the log
file....

I'd be happy to supply lots more information, but I don't even have a clue
as to what to look for anymore.  Unfortunately, we shifted over to vms 4.6,
and vws 3.2 at the same time and so I can't really isolate it to one or
the other.

I apologize for the length, but our facility has come to an absolute halt
until we can figure out  what the problem is. Any insight, comments, 
suggestions, ANYTHING will be greatly appreciated.

thanx,

-ed cetron
center for engineering design
univ of utah

cetron@cs.utah.edu
cetron@utahcca.bitnet
801-581-5304 or 801-581-6499






d