[comp.unix.ultrix] Help with mvaxII/Ultrix crashing irreversibly

art@maccs.McMaster.CA (Art Mulder) (04/07/89)

Hi.  Help!

	I am running a MvaxII, running Ultrix 1.2

	It has been doing a good impression of a yo-yo for the past several
months.  In fact since January, it has yet to be up for any 7-day period.
These are not minor crashes either.  These are "boot sector of disk munched"
crashes.  I think I've reinstalled ultrix on this baby more times in the
past 3 months than anyone should have to.

Hardware service has not been much help.
	CPU board has been replaced
	4mb Ram board has been replaced
	Tk50 drive and controller have been replaced
	disk drive and controller have been replaced
	Mux has been replaced
	Power supply has been tested - fine.
The only thing that has not been replaced is a DiLog CQ1610 16-port mux.

This is the error messages that we had today when it crashed most recently:
  :
  kda500: hard error, ra0a: hard error sn5886    (<- whole bunch of these)
  :
  start=0, len=120, fs=/usr
  panic: allocg: map corrupted
  Syncing disks...
  
and when I try to reboot it says "Not a directory, boot not found"


I'm really getting tired of this.  Any suggestions?
please e-mail, thanx.

---------------------------------------------------------------------------
Art Mulder,     art@maccs.uucp,    ...neat.ai.toronto.edu!maccs!art
                art@maccs.DCSS.mcmaster.ca,   uwocc1gate%"art@maccs.uucp"

dan@rna.UUCP (Dan Ts'o) (04/09/89)

In article <2355@maccs.McMaster.CA> art@maccs.UUCP (Art Mulder) writes:
)	I am running a MvaxII, running Ultrix 1.2
)
)	It has been doing a good impression of a yo-yo for the past several
)months.  In fact since January, it has yet to be up for any 7-day period.
)These are not minor crashes either.  These are "boot sector of disk munched"
)crashes.  I think I've reinstalled ultrix on this baby more times in the
)past 3 months than anyone should have to.
)
)Hardware service has not been much help.
)	CPU board has been replaced
)	4mb Ram board has been replaced
)	Tk50 drive and controller have been replaced
)	disk drive and controller have been replaced
)	Mux has been replaced
)	Power supply has been tested - fine.
)The only thing that has not been replaced is a DiLog CQ1610 16-port mux.
)
)This is the error messages that we had today when it crashed most recently:
)  :
)  kda500: hard error, ra0a: hard error sn5886    (<- whole bunch of these)
)  :
)  start=0, len=120, fs=/usr
)  panic: allocg: map corrupted
)  Syncing disks...

	You didn't say what type of disk you are using.

	Ultrix 1.2, at least at our site, was a fairly solid release, as long
as you do fairly standard stuff. How do you use your system ? Anything
not straight forward ?

	We had a similar situation with a MVAXII, though our crashes were
usually not as diasterous as yours. After several CPU and memory swaps, the
problem went away. Who does your servicine ? Can you be 100% sure that the
"new" boards swapped in are perfect ?

	How was the power supply checked ? Just using a multimeter is not
enough. You should get a power supply monitor and run it 24-hours/day. Often
power supply problems are difficult to find but quite common. Are you anywhere
near the power supply capacity of your box (BA23 ? I think 25A or 35A +5).
Did you try changing boxes or backplanes ? Maybe you have a flakely backplane...

	Does every crash result in a "Boot sector of disk munched" ? You should
keep a spare bootable partition or floppy. Is every crash identical ?

	If you provide more information, I might be able to help...

				Cheers,
				Dan Ts'o		212-570-7671
				Dept. Neurobiology	dan@rna.rockefeller.edu
				Rockefeller Univ.	...cmcl2!rna!dan
				1230 York Ave.		rna!dan@nyu.edu
				NY, NY 10021		tso@rockefeller.arpa
							tso@rockvax.bitnet

grr@cbmvax.UUCP (George Robbins) (04/10/89)

In article <2355@maccs.McMaster.CA> art@maccs.UUCP (Art Mulder) writes:
> 
> 	It has been doing a good impression of a yo-yo for the past several
> months.

>   kda500: hard error, ra0a: hard error sn5886    (<- whole bunch of these)
>   start=0, len=120, fs=/usr
>   panic: allocg: map corrupted

The hard error is pretty indicative:

1) If the sector number is within the address range of the disk, then your
   disk drive is screwed up and you have to either map out the bad block
   or get a new drive.  The hard error says that it could not complete the
   operation, so it either read or wrote trash...

2) If the sector number is outside the address range of the disk, then
   you have corrupted your file structure somehow.  Normal causes are
   overlapping partitions, data structures in memory being corrupted -
   either by hardware or software or a sick disk controller.

Check all the hardware stuff again, also check for dead fans or marginal
power supplies.  Run a diagnostic on the disk drive and see if it has
trouble with that bad block.  Has a guru look at the data on the disk
to see whether something identifiable left it's spoor...

The actual panic message says something about a cylinder group map being
confused, which could point either memory corruption, or that bad disk
block being in part of the disk structure overhead rather than in a
regular data block.

I had endless grief trying to get Ultrix running on a 785, it turned out
to be a bad memory controller that gradually forgot stuff left laying in
the buffer pool for too long.  The results when you finally checked the
disk were truely ugly.

-- 
George Robbins - now working for,	uucp: {uunet|pyramid|rutgers}!cbmvax!grr
but no way officially representing	arpa: cbmvax!grr@uunet.uu.net
Commodore, Engineering Department	fone: 215-431-9255 (only by moonlite)

stefan@wheaton.UUCP (Stefan Brandle ) (04/11/89)

I realize this is unlikely, but you might have your swap area overlaying your
root partition, or something to that effect.  DEC suggested that to me 3 years
ago when we were having some problems. I didn't have my swap over any other
partitions, but they said they had seen it happen and it produced wild
results.

Stefan Brandle
-- 
--------------------------------------------------------------------------------
Stefan Brandle                           UUCP: ...!{spl1,obdient}!wheaton!stefan
Wheaton College                          "But I never claimed to be sane!"
---------------------------------------------- MA Bell: (312) 260-4992 ---------