[unix-pc.general] Fixdisk problems?

john@spirit.UUCP (John F. Godfrey) (02/06/90)

Early last week I installed Fixdisk 2.0 on spirit (3.5mb RAM and a
67mb ST-4096).  I have a DOS-73 installed as well.  Shortly after
installation I received the panic message which will follow.  It took
three reboots to get me up and running again, and shortly thereafter,
it paniced again.  I removed the fixdisk and everything ran (and has
been running) fine, since.

Here is the panic message:
----------------------------------------------------------------------
#WD1010 ST=/Sekg/Err/ EF=/Id?/ cy=710. sc=14. hd=7. dr#=0. MCR2:0x0
#HDERR ST:51 EF:10 CL:C6 CH:2 SN:E SC:2 SDH:27 DMACNT:FFFF DCRREG:9F MCCREG:8300

panic: Hard disk timeout
----------------------------------------------------------------------

Anyone have any idea what this is from and what, if anything, I can do to 
correct it?

Thanks,
John
-- 
John F. Godfrey, Pastor
"Jesus said to him, 'I am the way and the truth and the life.  No man
comes to the Father except through Me'" (John 14:6).
PHONE: 1-(616)-896-8309
NET ADDRESS: john@spirit.UUCP		..sharkey!spirit!john
ATTMAIL: attmail!spirit!john		US DOMAIN spirit.grle.mi.us

jcm@mtune.ATT.COM (John McMillan) (02/07/90)

In article <111@spirit.UUCP> john@spirit.UUCP (John F. Godfrey) writes:
>Early last week I installed Fixdisk 2.0 on spirit (3.5mb RAM and a
>67mb ST-4096).  I have a DOS-73 installed as well.  Shortly after
>installation I received the panic message which will follow.  It took
>three reboots to get me up and running again, and shortly thereafter,
>it paniced again.  I removed the fixdisk and everything ran (and has
>been running) fine, since.
>
>Here is the panic message:
>----------------------------------------------------------------------
>#WD1010 ST=/Sekg/Err/ EF=/Id?/ cy=710. sc=14. hd=7. dr#=0. MCR2:0x0
>#HDERR ST:51 EF:10 CL:C6 CH:2 SN:E SC:2 SDH:27 DMACNT:FFFF DCRREG:9F MCCREG:8300
>
>panic: Hard disk timeout
>----------------------------------------------------------------------
>
>Anyone have any idea what this is from and what, if anything, I can do to 
>correct it?
:
	It could not find a sector-id on the disk.  The only ways
	this can be "fixed" are to re-write the sector-id --
	-- aka re-FORMAT the disk -- or enter the sector in the
	bad-block list.

	It looks like bad karma: in loading the FixDisk you finally
	USED this bad sector.  If you test this sector with the OLD
	kernel you should get some comparable message/warning:
		dd  < /dev/rfp002 > /dev/null

	ST=/Sekg/Err/	-- While Seeking [a sector] an Error occurred.
	EF=/Id?/	-- An Error occurred: Missing [sector] Id.

	There's no reason the new kernel should exacerbate such
	problems, just a coincidence -- I hope 8-]

john mcmillan -- att!mtune!jcm

pat@rwing.UUCP (Pat Myrto) (02/08/90)

In article <111@spirit.UUCP>, john@spirit.UUCP (John F. Godfrey) writes:
>	... [ edited to reduce length ] ...
> Early last week I installed Fixdisk 2.0 on spirit ... [with a] 67mb
> ST-4096 and DOS-73...  Shortly after installation I received the panic
> message which will follow... [after reboot] ... it paniced again.
> 
> Here is the panic message:
> ----------------------------------------------------------------------
> #WD1010 ST=/Sekg/Err/ EF=/Id?/ cy=710. sc=14. hd=7. dr#=0. MCR2:0x0
> #HDERR ST:51 EF:10 CL:C6 CH:2 SN:E SC:2 SDH:27 DMACNT:FFFF DCRREG:9F
> MCCREG:8300
> 
> panic: Hard disk timeout
> ----------------------------------------------------------------------

It's hard to say - I have seen that sort of panic before, but only
once.  It sounds like the drive wasn't seeking - like the seek mech
was jammed, or something.  I had it happen with a ST 251, and
rebooting didn't help, till the power was cycled - from the sounds it
made, that sort of "kicked" it loose.  It is possible your problems
are of a similar nature.  Even with you changing back to the old
kernel and things appearing to be fixed due to this, its still
possible that it was a coincidence, the operations, reboot cycles, etc
that got done when you restored the old kernel was what restored
sanity.  I have also installed the new fixdisk, and it has been
running fine for over a week, till today, where I got a "kernel
parity" panic.  I didn't copy the message down, but it mentioned a
disk parity error (though nothing was in unix.log).  I am convinced
that occasionally things such as this do happen.  If it happens again,
with the same problem, then I will be concerned.  Obviously things are
running fine now, as the involved system is the one I am typing this
prose on.  Once I had an entry in unix.log appear where it couldn't
read head 0, sector 0 cylinder 0, and bailed out with a "drive not
ready" error - if for real, a very grave symptom.  However, this was
months ago, and after rebooting, it hasn't happened since.  I did
selectively installed the fixdisk, instead of using the provided
Install script (because some stuff in the FIXDISK pkg I don't use
anymore, and because I tend to be leery of Install scripts in general,
especially ones that do such sweeping things as the FIXDISK one must
do).

Following is what *I* would do, if I were in the same situation.  I
probably am going into excess detail, but in this case that might be
preferable than assuming too much.  The procedure I used for
installing the FIXDISK worked for me, and this is being written in
good faith, but since I have no control over how this will be read or
interpreted, *YOU ARE ON YOUR OWN*.  NO CLAIMS ARE MADE AS TO THIS
BEING CORRECT OR BEING FREE OF LOGICAL OR TYPOGRAPHICAL ERRORS, OR
BEING WITHOUT CRITICAL OMISSIONS.

Before writing off the FIXDISK2.0, I would suggest re-trying the
FIXDISK (a different copy of it, if it was a downloaded copy), and
installing it BY HAND, rather than using the Install script - this
allows one to selectively install fixes, and to do it in stages, as I
suggest below, starting with the kernel, which provides most of the
major fixes, other than the uucico (uucico not being relevant if HDB is
installed), and the fix for the occasional corrupted /etc/utmp file.

I suggest you try unarchiving the fixdisk into a work subdir, (its a
cpio archive, and assuming FIXDISK2.0+IN is in the parent subdir, the
command ``cpio -iBcdm <../FIXDISK20+IN'' run as root, into an empty
subdir will extract the contents, preserving the original dates,
perms, and ownership of the files).  If its on the floppies, replace
the "../FIXDISK2.0+IN" with "/dev/rfp021".  In the subdir 'kernel',
unpack the kernel file (`` unpack UNIX3.51m'') and then copy the new
kernel to /UNIX3.51m.  Verify the permissions are at least 754,
owner/group root/sys (depending on how things are set up, you may need
to have world read perms on the kernel).  Follow with ``mv /unix
/unix.old'', (to preserve the old kernel, in case the UNIX3.5?  link
isn't there) and then do ``ln /UNIX3.51m /unix''.  Once the above
steps are done and checked for correctness, do a normal shutdown and
reboot.  If the system comes up OK, and gets past the time interval
where you originally experienced the problems, then I would try
replacing /etc/lddrv/wind.o, /etc/init, /bin/login, and /bin/getty,
etc., MANUALLY, BY HAND, with the files provided in the kernel, utmp,
subdirs, preserving the original versions as /bin/login.old,
/etc/lddrv/wind.o.old, etc.  You can inspect the Install script for
the proper permissions and owner/group to use on each file (most will
be owner=bin, group=bin).  Be sure that after the new init is copied
in, to rm /bin/telinit and then do ``ln /bin/telinit /etc/init'' (some
stuff does look for /bin/telinit, even possibly during reboot
sequence).  After verifying everything is right, again doing the
shutdown and reboot.  If the panics happen again, I have no suggestions.

Perhaps someone can answer - does 3.51 require a new format on the
drive that had previously been formatted with, say, 3.0 or 3.5?

As I said, your mileage may vary, but good luck - just proceed slowly
and carefully.

-- 
pat@rwing                                       (Pat Myrto),  Seattle, WA
                            ...!uunet!pilchuck!rwing!pat
      ...!uw-beaver!uw-entropy!dataio!/
WISDOM:    "Travelling unarmed is like boating without a life jacket"

wtm@neoucom.UUCP (Bill Mayhew) (02/10/90)

After having read Frank's article, I thought I'd try
dd </dev/fp002 >/dev/null to see if I might be able to catch some
bad blocks on my disk (yes, I know, I could use the diagnostics
disk).  I ran the command in the background as root, and then
returned to ksh to do some other work such as reading mail.

A few minutes later, I got a page fault from the 3.51m kernel.  I
wasn't doing anything extraordinary that I haven't done many times
before, other than the dd command.

I have a very plain machine with 2 meg RAM, miniscribe 6085, and
the stock disk controller.  No 2010s or hardware modifications.  I
was in ksh, and the metermaid display was on.  Just before the
crash, metermaid look normal.  The %serial buffers was ~100%,
%clists ~90%, %ram pages ~ 50%, %CPU idle ~0%, %CPU user ~40%, %CPU
kernel ~60%, %CPU wait ~0%, printer selected and no errors.

This is what the panic message looked like.  The panic scribbled
over the metermaid and some of the other stuff on the screen, so it
was a bit tough to read.  Seems somewhat elusive because I tried to
duplicate the conditions again and have not been able to get the
crash.

type = 0x02, pid = 17144, pc = 0x6C09, rps = 0x2000, .... 0x4BB5C
GSR = 8D00, BSR0 = 7C07, BSR1 = 2400 PHYSPF = 0
D0 = ff, D1 = 6030, D2 = 301, D3 = 5
D4 = 52, D5 = 400, D6 C000, D7 = 400
A0 = 30300, A1 = 72400, A2 = 4BB5C, A3 = 70E08
A4 = 70884, A5 = 413B8, A6 = 70820, userA7=2FF138/kernA7=707C4
KI-RAM@6BC:000422D8 51C8FFFC 4E75227C 004000E0
KS-RAM@707C4:
 000101C6 000302FC 00072400 00000400 000270EC 00000000 00000034 08000007
 0B560000 13312002 00051052 000001F4 00400602 00000000 00000001 FFFB7C30


panic: page fault in kernel


I'm not sure what goes in the space where the .... is above.  The
display was too cluttered to make it out.  Also the last number on
the first line might be 0x4BB50 isntead of 0x4BB5C.

Anybody have any ideas?


Bill Mayhew  (wtm@neoucom.edu)
North Eastern Ohio Universites College of Medicine
Rootstown, OH  44272-9995       ph:  216-235-2511