art@maccs.McMaster.CA (Art Mulder) (04/07/89)
Hi. Help! I am running a MvaxII, running Ultrix 1.2 It has been doing a good impression of a yo-yo for the past several months. In fact since January, it has yet to be up for any 7-day period. These are not minor crashes either. These are "boot sector of disk munched" crashes. I think I've reinstalled ultrix on this baby more times in the past 3 months than anyone should have to. Hardware service has not been much help. CPU board has been replaced 4mb Ram board has been replaced Tk50 drive and controller have been replaced disk drive and controller have been replaced Mux has been replaced Power supply has been tested - fine. The only thing that has not been replaced is a DiLog CQ1610 16-port mux. This is the error messages that we had today when it crashed most recently: : kda500: hard error, ra0a: hard error sn5886 (<- whole bunch of these) : start=0, len=120, fs=/usr panic: allocg: map corrupted Syncing disks... and when I try to reboot it says "Not a directory, boot not found" I'm really getting tired of this. Any suggestions? please e-mail, thanx. --------------------------------------------------------------------------- Art Mulder, art@maccs.uucp, ...neat.ai.toronto.edu!maccs!art art@maccs.DCSS.mcmaster.ca, uwocc1gate%"art@maccs.uucp"
dan@rna.UUCP (Dan Ts'o) (04/09/89)
In article <2355@maccs.McMaster.CA> art@maccs.UUCP (Art Mulder) writes:
) I am running a MvaxII, running Ultrix 1.2
)
) It has been doing a good impression of a yo-yo for the past several
)months. In fact since January, it has yet to be up for any 7-day period.
)These are not minor crashes either. These are "boot sector of disk munched"
)crashes. I think I've reinstalled ultrix on this baby more times in the
)past 3 months than anyone should have to.
)
)Hardware service has not been much help.
) CPU board has been replaced
) 4mb Ram board has been replaced
) Tk50 drive and controller have been replaced
) disk drive and controller have been replaced
) Mux has been replaced
) Power supply has been tested - fine.
)The only thing that has not been replaced is a DiLog CQ1610 16-port mux.
)
)This is the error messages that we had today when it crashed most recently:
) :
) kda500: hard error, ra0a: hard error sn5886 (<- whole bunch of these)
) :
) start=0, len=120, fs=/usr
) panic: allocg: map corrupted
) Syncing disks...
You didn't say what type of disk you are using.
Ultrix 1.2, at least at our site, was a fairly solid release, as long
as you do fairly standard stuff. How do you use your system ? Anything
not straight forward ?
We had a similar situation with a MVAXII, though our crashes were
usually not as diasterous as yours. After several CPU and memory swaps, the
problem went away. Who does your servicine ? Can you be 100% sure that the
"new" boards swapped in are perfect ?
How was the power supply checked ? Just using a multimeter is not
enough. You should get a power supply monitor and run it 24-hours/day. Often
power supply problems are difficult to find but quite common. Are you anywhere
near the power supply capacity of your box (BA23 ? I think 25A or 35A +5).
Did you try changing boxes or backplanes ? Maybe you have a flakely backplane...
Does every crash result in a "Boot sector of disk munched" ? You should
keep a spare bootable partition or floppy. Is every crash identical ?
If you provide more information, I might be able to help...
Cheers,
Dan Ts'o 212-570-7671
Dept. Neurobiology dan@rna.rockefeller.edu
Rockefeller Univ. ...cmcl2!rna!dan
1230 York Ave. rna!dan@nyu.edu
NY, NY 10021 tso@rockefeller.arpa
tso@rockvax.bitnet
grr@cbmvax.UUCP (George Robbins) (04/10/89)
In article <2355@maccs.McMaster.CA> art@maccs.UUCP (Art Mulder) writes: > > It has been doing a good impression of a yo-yo for the past several > months. > kda500: hard error, ra0a: hard error sn5886 (<- whole bunch of these) > start=0, len=120, fs=/usr > panic: allocg: map corrupted The hard error is pretty indicative: 1) If the sector number is within the address range of the disk, then your disk drive is screwed up and you have to either map out the bad block or get a new drive. The hard error says that it could not complete the operation, so it either read or wrote trash... 2) If the sector number is outside the address range of the disk, then you have corrupted your file structure somehow. Normal causes are overlapping partitions, data structures in memory being corrupted - either by hardware or software or a sick disk controller. Check all the hardware stuff again, also check for dead fans or marginal power supplies. Run a diagnostic on the disk drive and see if it has trouble with that bad block. Has a guru look at the data on the disk to see whether something identifiable left it's spoor... The actual panic message says something about a cylinder group map being confused, which could point either memory corruption, or that bad disk block being in part of the disk structure overhead rather than in a regular data block. I had endless grief trying to get Ultrix running on a 785, it turned out to be a bad memory controller that gradually forgot stuff left laying in the buffer pool for too long. The results when you finally checked the disk were truely ugly. -- George Robbins - now working for, uucp: {uunet|pyramid|rutgers}!cbmvax!grr but no way officially representing arpa: cbmvax!grr@uunet.uu.net Commodore, Engineering Department fone: 215-431-9255 (only by moonlite)
stefan@wheaton.UUCP (Stefan Brandle ) (04/11/89)
I realize this is unlikely, but you might have your swap area overlaying your root partition, or something to that effect. DEC suggested that to me 3 years ago when we were having some problems. I didn't have my swap over any other partitions, but they said they had seen it happen and it produced wild results. Stefan Brandle -- -------------------------------------------------------------------------------- Stefan Brandle UUCP: ...!{spl1,obdient}!wheaton!stefan Wheaton College "But I never claimed to be sane!" ---------------------------------------------- MA Bell: (312) 260-4992 ---------