[comp.os.vms] LAVC related crashes, HELP?

heselton@admin.okanagan.bcc.CDN (Mike Heselton) (09/29/87)

Having seen very few complaints about LAVC clusters, I am beginning to believe
that we are the only site in the world with this problem.  Can anyone make
us feel a little less alone.

We are running the following (tiny) cluster:

	VAX-11/750	(boot node)		MicroVax II
	- 8mb Memory				- 9mb Memory
	- UDA-50 				- Emulex QD32 disk Controller
	- 2 RA81s				- 2 Fujitsu M2361As
	- DELUA					- DEQNA
	- 1 DMF32				- Emulex TC03 tape Coupler
	- 2 DZ-11s				- Fujitsu M244X tape drive

We are running LAVC under VMS V4.5A and have been experiencing the following
crashes on our MVII. As far as we can tell, at times when the disk traffic from
the 750 to the MVII gets high. 


VAX/VMS System dump analyzer
 
Dump taken on 23-SEP-1987 10:59:45.23
INVEXCEPTN, Exception while above ASTDEL or on interrupt stack

System crash information
------------------------
Time of system crash: 23-SEP-1987 10:59:45.23


Version of system: VAX/VMS VERSION V4.5


VAXcluster node name: OKMV01


Reason for BUGCHECK exception:

       INVEXCEPTN, Exception while above ASTDEL or on interrupt stack


Process currently executing: GOODALL


Current image file: OKCADM$DUA0:[SYS2.SYSCOMMON.][RAF]RAFPC.EXE


Current IPL: 20  (decimal)


General registers:

	R0  = 00000014   R1  = 00140000   R2  = 80036320   R3  = 801D8950
	R4  = 8032F600   R5  = 801CDA40   R6  = 8032F83E   R7  = 00000001
	R8  = 0000A000   R9  = 801DBD50   R10 = 8035E2E0   R11 = 801C9E00
	AP  = 00000000   FP  = 000001CC   SP  = 7FFE7C0C   PC  = 80004862
	PSL = 00140009


Processor registers:		  MicroVAX II


	P0BR   = 80714600     SBR    = 008EA800     ASTLVL = 00000004
	P0LR   = 000006DF     SLR    = 00005280     SISR   = 00000100
	P1BR   = 7FF2B600     PCBB   = 00666878     ICCS   = 00000040
	P1LR   = 001FFA13     SCBB   = 008E7200     SID    = 08000000

	TODR   = 98B4EE16     SYSTYPE= 01010000

	ISP    = 80465A00
	KSP    = 7FFE7C0C
	ESP    = 7FFE9E00
	SSP    = 7FFED032
	USP    = 7FF443FC


We first noticed these crashed when we installed the cluster and performed
backups of our disks to tape.  On an unloaded system we cannot reliably
perform a BACKUP/BUFFER=5/NOCRC of the RA81's to the tape drive at 6250
bpi, the system crashes as above.  We can perform the same backup at 1600
bpi with no problems.  If we remove the BUFFER=5 we can, for the most part,
make it through the backup without a crash as long as the system is not 
completely idle. (ie.  I read my morning deluge of INFO-VAX mail, thank
god for the volume of mail)  We have also noticed these crashes at other
times on and off.

We have 3 MVIIs and all 3 behave the same, we also have an 11/780 that we
have tried as a boot node for a diskless MVII but as soon as we get a few
users (students) trying to access the 780s disks we get the identical crash.
We don't believe it could be the FUJI disk drives or controller and have
a hard time thinking it could be the tape drive or controller, as we have
tried it with a diskless MVII with a TK50 as its only tape drive and it
still crashes with a few students running.

Has anyone out there seen or heard of any similar problems or even solutions?
We are planning to SPR the problem, as we have had it since spring when
we first setup our little cluster but we thought we would see if anyone
else had seen the problem first.

Thanks for any help you can give us.


Mike Heselton
Programmer/Analyst
Okanagan College
1000 K.L.O. Road
Kelowna, B.C., Canada
V1Y 4X8

HESELTON@ADMIN.OKANAGAN.BCC.CDN

russell@CINCOM.UMD.EDU ("CHRIS RUSSELL") (10/09/87)

(My apologies for posting, but I haven't gotten the knack of sending
to CSNET yet.)

>Having seen very few complaints about LAVC clusters, I am beginning to believe
>that we are the only site in the world with this problem.  Can anyone make
>us feel a little less alone.
>
>We are running the following (tiny) cluster:
>
> [Description of Cluster]
>
>We are running LAVC under VMS V4.5A and have been experiencing the following
>crashes on our MVII. As far as we can tell, at times when the disk traffic from
>the 750 to the MVII gets high. 

Mike,

	First of all, "NO", you are not the only person struggling with
an LAVC.  I went through Hell and back again to get our 7 satellite
LAVC up and running.  Especially since our boot node is an 11/750 which
is only supposed to handle 5 satellites... :-)

	I don't know if the problems we encountered are causing your
difficulties, but here's a couple of things to look out for that we
ran into:

	1)  The Rev Level on your Microvax DEQNA must be Rev E1 or
		later.  We were running on C2 DEQNAs, and things
		appeared to be working.  However, every once in a
		while, every satellite would lose touch with every
		other satellite, causing about 5 or 6 pages of
		console printout.  This happenned several times an
		hour.  Once we upgraded the DEQNAs, the problem
		went away.

	2)  The other thing I would suggest is that you get VMS 4.5C
		from DEC.  That's what we're running.  I really don't
		know the exact differences, but I do know that we are
		running 4.5C and it's running very smoothly now.

	Hope this helps.  

					~chris


-----------------------------------------------------------------------
INFO-VAX: Love it or Leave it - But No More Meta-Discussions!  Please!
-----------------------------------------------------------------------
Christopher Russell              ARPA: SYSMGR@KING.EE.UMD.EDU
Operations Manager               JNET: RUSSELL@UMCINCOM
Computer Aided Design Lab        UUCP: ...!seismo!umcp-cs!eneevax!russell
University of Maryland           FONE: (301)454-8886/454-8950

        "If growing up were fun, I'd have done it already."

-----------------------------------------------------------------------
------