[comp.sys.att] possible head crash :

gst@spdcc.COM (Gary Trujillo) (01/05/90)

Well, it may have finally happened.  Last night before going to bed, I
snapped the keyboard back onto its retaining posts as usual - and the
screen suddenly went to the pattern of all dots, though the phone,
window, and screen managers all continued to run.  The mouse moved the
arrow cursor, but clicking did nothing - as did (not) the keyboard.

I was effectively "dead in the water."  I tried rebooting, both with the
reset button, and with a power cycle.  All I got was repeated little
squares in the upper left-corner of the screen, and sounds of the disk
attempting to recal.

The floppy diagnostics produced:

  Test: Hard disk test (Drive 0)
  Subtest: Read all the disk.
  Pass: 4
  Error: WINCHESTER: Error on check - Read Response=40, Start Block=5080

I dissasembled the machine - completely, to check all the connectors, etc.
I removed the hard disk from the chassis.  A visual inspection revealed
nothing.  (I'm not claiming taking the machine apart was logical, but hey,
it was 6 AM, and I may not have been thinking clearly.)

After reassembly, the diagnostics produced the same result.  BTW, the
random seek test went OK.

I was able to boot UNIX from the floppy and mount /dev/fp002.  The file
system looks OK, but when I tried doing an fsck, I got:

  CAN NOT READ BLOCK 2 (or some such)

I suspect what happened is that there was a head crash with the heads
positioned over track zero.  Now that I think about it, it never was
a good idea to insist on having the keyboard locked in place while
I sleep.

Is it curtains for my 67meg miniscribe?  (I think I may install the WD2010
while I have the machine open, though I don't harbor any illusions
about doing to being able to solve anything.)

If I do lose the disk, I'm gonna have a heck of a time getting stuff off
the disk, since I know of no way to write to the floppy while running
single-user UNIX running off the floppy.  I have a spare machine with
a 10 meg drive, and may be able to pump stuff through the serial ports,
but it's not going to be fun. :-(

Any ideas, sympathy, etc.?

Please respond via email to:

	gst@ursa-major.spdcc.com or gst@wjh12.harvard.edu

gst@spdcc.COM (Gary Trujillo) (01/05/90)

In article <1108@ursa-major.SPDCC.COM> gst@ursa-major.spdcc.COM
	(Gary Trujillo [me]) writes:
> Well, it may have finally happened...
> [details of possible head crash deleted]

Here's an update on the situation I described recently.

Sorry to burden you with my continued saga of woe, but (heavens
forfend) something like it just might happen to *you* someday,
so you may want to pay attention. :-)  I seem to be able to
access a couple of other machines where I can get at mail and
news using the 7300 I haven't touched in a couple of years.  I
think it would be OK to post any suggestions as followups to my
articles, as it appears I'll be able to read the UNIXpc news groups
reliably via this method.

Fortunately, I have backups of the important stuff that is recent
enough that there's not a lot I'd need to get off the disk in case
it turns out to be unsalvagable.

Well, I thought I might be able to get stuff off the disk by creating
compressed tar or cpio archives in /tmp (where I have > 10K blocks),
and pulling it over to the other machine with kermit.  The problem
turns out to be that I can't write to the disk, it seems.  I've been
able to rename a file, which requires a write to a directory block.
(I did a sync afterwards, just to make sure it was really written,
and not just cached for write.) However, I can't even cp a file: I
hear sounds of attempted recals, a number of clicks, indicating write
attempts, I assume, and then a message from the cp command saying the
copy failed.

So, my new questions are as follows:

1. What's down there at block 2, which fsck reports it gets a read
   error on?  I would have assumed that the bootstrap loader (or
   whatever it's called) lives there.

2. What are the chances I might be able to re-write the boot loader
   with the "ldrcpy" utility, which is in the /etc directory on the
   "floppy file system disk" (3 of 12)?  There are some lines in
   /etc/profile  on this floppy which say:

	# Copy the loader from the floppy to the hard disk
	echo "\nCopying the loader onto the hard disk ....\n"
	ldrcpy /dev/rfp020 /dev/rfp000

   I have my doubts that I'd be able to write *anything* to the boot
   sector of the hard disk, but I'm wondering whether ldrcpy knows
   how much to write, or it assumes the next thing you're going to
   do is an /etc/mkfs (since that's the next thing that happens in the
   /etc/profile script), so it can write as much as it wants, and
   needn't worry about smashing the superblock of /dev/fp002.

   (I tried "ldrcpy /dev/rfp020 tmpfile" - but it seems to require
    a special-device name as its second argument, so I can't seem
    to fake it out to see how much it wants to write.)

3. It turns out the file system is *not* completely OK.  I just ran
   an "ls -R /" for the bittersweet fun of it, and found there were
   several files (~ 10) which were reported "not found," amid the
   buzzing of recal attempts.  I assume the inodes of these files
   are inaccessible due to the damage which I now feel justified
   in imagining has taken place.

4. Speaking of damage, I suddenly realized that I might be wise to
   not run the unit too long, since, if there was a head crash, there's
   probably oxide flying around inside, which is not good on heads
   which are designed to float really, really close to the surface of
   the disk platters.

5. Here's a puzzler for you all-- given the conditions I've described,
   how can I get files off the machine? Here's what I have:

   A. A mostly-readable hard drive which I can't run a multi-user
      system from at present; the only way I can run UNIX is from
      the (writable) floppy filesystem disk I made quite some time
      ago in anticipation of major problems like the present ones.
      I cannot boot from the hard drive even starting from the boot
      floppy (I tried).  I can't seem to write to the drive.

   B. A 7300 with a 10meg drive (mostly filled, but I can probably
      get ~4K blocks free if I work at it).

6. Can anyone recommend a good replacement unit, if it comes to that?
   I have the WD2010, so I can run a drive with (what is it, John
   Milton? - 1400 cylinders?).  I don't have the P5.1 upgrade, but
   I'm sure I could get it easily enough.

7. Is there any point in attempting to reformat the drive, once I've
   gotten as much as I can from it, do you think, or would you just
   leave it in its current state.  (I know it's really up to me, but
   I'd sort of like to know what you'd do and why.)

8. What do you guess are my chances of saving the drive?

Well, that's enough questions for one night!  If anyone wants to
send answers via email, I'll summarize what I get.  Otherwise, please
post your replies.

Thanks to everyone!  (And I hope your new years are happier than mine.)

	Gary

jcm@mtune.ATT.COM (John Mcmillan) (01/06/90)

In article <1134@ursa-major.SPDCC.COM> gst@ursa-major.spdcc.COM (Gary S. Trujillo) writes:
  >Here's an update on the situation I described recently.
:
  >1. What's down there at block 2, which fsck reports it gets a read
  >   error on?  I would have assumed that the bootstrap loader (or
  >   whatever it's called) lives there.

        NOPE!  The bootstrap loader is in the low parts of /dev/rfp000.
        You should be running FSCK on /dev/[r]fp002.

        Logical (1K-byte) BLOCK#2 is the 1st INODE BLOCK.
                /dev/rfp002-  LOGBLK#0 = 1st 1/2 empty, 2nd 1/2==FS SuperBlock
                /dev/rfp002-  LOGBLK#1 = reserved
                /dev/rfp002-  LOGBLK#2 = 1st INODE BLOCK

        Following is A SAMPLE of the use of the 1st INODE BLOCK
        -- my /dev/rfp002
          ("ncheck -i 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 /dev/rfp002"):

        inode
        disk    inode
        index   number  file 'name'
                        ------------------ (1K-byte) block  #2
                        --------- (512-byte) block  #4
        0       -       <reserved>
        1       2       /.
        2       3       /bin/.
        3       4       /usr/lib/terminfo/d/dw4
        4       5       /usr/lib/terminfo/h/.
        5       6       /usr/lib/terminfo/h/hp262x
        6       7       /etc/.
        7       8       /usr/lib/terminfo/i/.

                        --------- (512-byte) block  #5
        8       9       /usr/lib/terminfo/i/intext
        9       10      /etc/lddrv/unix.sym
        10      11      /tmp/EXPORTS
        11      12      /dev/.
        12      13      /usr/lib/terminfo/i/intext2
                        /usr/lib/terminfo/i/intextii
        13      14      /usr/lib/terminfo/l/.
        14      15      /usr/lib/terminfo/l/lp
                        /usr/lib/terminfo/l/lpr
                        /usr/lib/terminfo/p/print
                        /usr/lib/terminfo/p/printer
                        /usr/lib/terminfo/p/printing
        15      16      /lib/.

  >2. What are the chances I might be able to re-write the boot loader
  >   with the "ldrcpy" utility, which is in the /etc directory on the
  >   "floppy file system disk" (3 of 12)?  There are some lines in
  >   /etc/profile  on this floppy which say:
:
        STOP!  Why are you doing this?!?!?!?  There must be more
        pleasurable ways of inflicting pain on yourself!?

        1) You SEEM to have an INTERMITTENT READ FAILURE.
                a) You CAN mount & LS the File-System, therefore
                        LOGICAL-BLOCK#2 & inode#2 CAN be read
                        some of the time.
                b) Yet you cannot reliably FSCK it.
                [ Note:
                        i)   Are you doing: fsck ... /dev/rfp002?
                                                          ^^^^^^
                                This MAY skip the re-try mechanism.
                        ii)  If so, "fsck ... /dev/fp002;reboot"
                                                   ^^^^^
                                might at least 'work' -- but it
                                is unlikely to FIX the problem.
                ]

        2) I didn't see anything that suggested the LOADER was
                damaged.  More likely, it just can't cope with
                intermittent read errors in trying to locate "/unix".

  >3. It turns out the file system is *not* completely OK.  I just ran
  >   an "ls -R /" for the bittersweet fun of it, and found there were
  >   several files (~ 10) which were reported "not found," amid the
  >   buzzing of recal attempts.  I assume the inodes of these files
  >   are inaccessible due to the damage which I now feel justified
  >   in imagining has taken place.

        Right.
        "ls /" just needs to read the DIRECTORY entries for inode#2.
        "ls -R /" or "ls -l /" needs to examine the INODE entry
                for each DIRECTORY entry.

  >4. Speaking of damage, I suddenly realized that I might be wise to
  >   not run the unit too long, since, if there was a head crash, there's
  >   probably oxide flying around inside, which is not good on heads
  >   which are designed to float really, really close to the surface of
  >   the disk platters.

        You probably DID NOT suffer a head crash: you probably vibrated
        the head to an unacceptable clearance while it was writing --
        thus making the signal inadequate.

  >5. Here's a puzzler for you all-- given the conditions I've described,
  >   how can I get files off the machine? Here's what I have:
  >
  >   A. A mostly-readable hard drive which I can't run a multi-user
        [^^^^^^^^^^^^^^^ -- an unproven statement as I read your notes.]
  >      system from at present; the only way I can run UNIX is from
  >      the (writable) floppy filesystem disk I made quite some time
  >      ago in anticipation of major problems like the present ones.
  >      I cannot boot from the hard drive even starting from the boot
  >      floppy (I tried).  I can't seem to write to the drive.
  >
  >   B. A 7300 with a 10meg drive (mostly filled, but I can probably
  >      get ~4K blocks free if I work at it).

If you can put a compiler thereon, there's a great deal of test-codes
you can build IF it comes down to that... but that's for LATER on.
(You do NOT need libraries... just /lib/crts0.o and the '/lib/*ifile*'
stuff, I believe.)

:
  >7. Is there any point in attempting to reformat the drive, once I've
  >   gotten as much as I can from it, do you think, or would you just
  >   leave it in its current state.  (I know it's really up to me, but
  >   I'd sort of like to know what you'd do and why.)

Yes: reformatting is likely to recover disk.  But WAIT, unless you
WANT to discard your data.

  >8. What do you guess are my chances of saving the drive?

Pretty good -- if you'll SLOW DOWN !-)
        (Amongst other things: [almost] NEVER try to construct
        files on a damaged system -- until FSCK has successfully run.)

I'd start with verifying your FSCK data:
^^^
        umount /dev/fp002 ; dd bs=1024 count=64 < /dev/fp002 > /dev/null
                                                       ^^^^^
                I'd expect the above to either succeed -- given the
                apparently intermittent nature of the problem --
                or to fail with a note that 2 blocks were copied.
                                            ^

                If the above succeeds -- but with recalibration noises
                -- I'd be tempted to try a correction-by-rewriting.
                   ^^^
                (I'm tired of posting just "why" this often works.)

        umount /dev/fp002 ; dd bs=1024 count=64 < /dev/fp002 > /dev/fp002
                                                       ^^^^^        ^^^^^

                If THIS works... it's time to try the FSCK again!

If the above DOESN'T WORK, or there are other issues, E-mail me.
There are too many scenarios to describe.  (Before contacting me,
run "dd bs=512 count=128 < /dev/rfp002 > /dev/null" and inform me of results.
                                ^
Also, verify you specifically followed "/dev/fp00x" and "/dev/rfp00x" details!)

  :

john mcmillan -- att!mtune!jcm -- Speaking for SELF, not AT&T

gst@spdcc.COM (Gary Trujillo) (01/06/90)

Well, it looks as if I'm well on my way to solving my own problem.
I haven't heard from anyone yet, but I suspect that messages are
even now winging their way toward me regarding my problems cleaning
up after a hard disk failure.

I recalled from my PDP-11 UNIX days being able to fork a shell on
a serial port from a single-user console shell, and succeeded on
my 3B1 with the command:

	/mnt/bin/ksh </dev/tty000 >/dev/tty000 2>&1&

That allowed me to talk to the machine via a cable between it and my
spare (a 7300).  [The only problem is that I can't seem to set the
"intr" function on the port, so I can't break out of a command. :-(]

I discovered an "-s -" option in kermit that allows it to send from
standard input, rather than a file, so I can use:

	tar cf - * | compress | kermit -s -

Unfortunately, this version of kermit only supports speeds up to
9600 baud, whereas the ports will go up to 19.2KB.  Sigh.

Oh well, at least I'm able to recover stuff.

BTW, I came across a message from John B. Milton that he posted
back in April (which I'd cleverly written on my handmade floppy
file system disk) where he describes how to fix up /etc/inittab
to free up the floppy drive.  However, I think it relies on being
able to boot from the hard disk (right, jbm?):

| From: jbm@uncle.UUCP (John B. Milton)
| Newsgroups: comp.sys.att,unix-pc.general
| Subject: Re: DISK CRASH! (quick boot info)
| Message-ID: <513@uncle.UUCP>
| Date: 20 Apr 89 06:12:05 GMT
| References: <8904181249.AA22934@zorch.UU.NET>
| Organization: U.N.C.L.E.
| 
| In article <8904181249.AA22934@zorch.UU.NET> scott@zorch.UU.NET
|		(Scott Hazen Mueller) writes: (for David Melman)
| ...
| >It will not boot off the hard disk, but will boot off the floppy.
| >The injured disk can be mounted, and some files can be read.
| >
| >How can the system allow the removal of the boot floppy (and stay up)
| >so the readable files off the hard disk can be backed up to the floppy
| >drive?
| 
| ...
| When booted from the floppy, cd to /mnt/etc and add these lines:
| is:1:initdefault:
| cn:1:respawn:/bin/sh >/dev/console </dev/console 2>&1
| and add a ":" to the beginning of these two lines:
| is:2:initdefault:
| rc::bootwait:/etc/rc > /dev/null 2>&1
| 
| Yes, you will have to use vi in open mode!
| 
| Save out and try to reboot to the hard disk. You should get a # prompt about
| 5 seconds after the "... Main board is ..." stuff. No window driver, no disk
| check. This should free up the floppy drive so you can get stuff off.
| 
| John

Anyway, it appears things are gradually coming under control, even if
my methods seem a bit primitive (I'm still open to better suggestions,
though).

I'd still like ideas on a good replacement drive, if I'm unable to
reformat the Miniscribe.  Thanks!

	Gary

jbm@uncle.UUCP (John B. Milton) (01/11/90)

In article <1108@ursa-major.SPDCC.COM> gst@ursa-major.spdcc.COM (Gary Trujillo) writes:
[bad news]
>  Test: Hard disk test (Drive 0)
>  Subtest: Read all the disk.
>  Pass: 4
>  Error: WINCHESTER: Error on check - Read Response=40, Start Block=5080
Partition 0: start Track=0, size (in Blocks)=72
Partition 1: start Track=9, size (in Blocks)=5000
Partition 2: start Track=634, size (in Blocks)=60952

These are the sizes of the partitions on my disk, which seem to match yours
The diagnostics say "Start Block=5080", 5080-(72+5000)=8, as in the 8th
logical (1k) block into the file system partition

[...]
>system looks OK, but when I tried doing an fsck, I got:
>
>  CAN NOT READ BLOCK 2 (or some such)
rut roh

>I suspect what happened is that there was a head crash with the heads
>positioned over track zero.
Well, some kind of error

[...]
>In article <1108@ursa-major.SPDCC.COM> gst@ursa-major.spdcc.COM
>	(Gary Trujillo [me]) writes:
>> Well, it may have finally happened...
>> [details of possible head crash deleted]
>
>Here's an update on the situation I described recently.
>
[...]
>1. What's down there at block 2, which fsck reports it gets a read
>   error on?  I would have assumed that the bootstrap loader (or
>   whatever it's called) lives there.
Nope, first few blocks of the file system. The UNIXpc is unlike old UNIX
systems (which did have a boot loader at the beginning of the file system), in
that it has only one boot loader at the begining of partition 0 (5000 blocks
away)

[ldrcpy]
Nothing will change, as its 5000 blocks away
>5. Here's a puzzler for you all-- given the conditions I've described,
>   how can I get files off the machine?
Hmm, well you could use my board and another drive...
"Oh shut up Milton, I don't have it"

I seem to remember someone (over Icus way maybe) that found out that the
ROM boot loader checks EVERY partition looking for a file system with UNIX
on it. Hmm. How about we boot off the swap partition?:

Do this ahead of time: (unpack and run as root)

---
#! /bin/sh
# This is a shell archive.  Remove anything before this line, then unpack
# it by saving it into a file and typing "sh file".  To overwrite existing
# files, type "sh file -c".  You can also feed this as standard input via
# unshar, or by typing "sh <file", e.g..  If this archive is complete, you
# will see the following message at the end:
#		"End of shell archive."
# Contents:  mkrecov
# Wrapped by jbm@uncle on Thu Jan 11 06:16:27 1990
PATH=/bin:/usr/bin:/usr/ucb ; export PATH
if test -f 'mkrecov' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'mkrecov'\"
else
echo shar: Extracting \"'mkrecov'\" \(870 characters\)
sed "s/^X//" >'mkrecov' <<'END_OF_FILE'
X# this must be run as super user (mknod)
XPATH=/bin:/usr/bin:/etc
Xecho "Put a copy of the \"Floppy File System Disk\" in the drive and press Return"
Xread line
Xmount /dev/fp021 /mnt
Xfor i in cat cp cpio echo ln cp mv ls mkdir pwd rm wnlessmsg; do
X	rm /mnt/bin/$i
Xdone
Xfor i in convert dismount ldrcpy nospace.txt profile profile.fd profile.hd rm3.0stuff update.txt; do
X	rm /mnt/etc/$i
Xdone
Xrm -rf /mnt/etc/convert
Xcp /usr/bin/pcat /mnt/bin
Xcp /unix .; pack ./unix; mv ./unix.z /mnt
Xmknod /mnt/dev/rfp001 c 4 1	# add a raw device for the swap partition
Xln /mnt/dev/swap /mnt/dev/fp001	# this was already here
Xmkdir /mnt/new
Xcat >/mnt/etc/profile <<EOF
Xset -x
XPATH=:/mnt/bin:/bin:/etc; export PATH
Xmkfs /dev/rfp001
Xmount /dev/fp001 /new
Xpcat /unix.z > /new/unix
Xsync; sync; sync
Xumount /dev/fp001 > /dev/null 2>&1
Xreboot "Please open the floppy door."
XEOF
Xumount /dev/fp021
END_OF_FILE
if test 870 -ne `wc -c <'mkrecov'`; then
    echo shar: \"'mkrecov'\" unpacked with wrong size!
fi
# end of 'mkrecov'
fi
echo shar: End of shell archive.
exit 0
---

This will create a disk which when used as 3 of 12 (Floppy File System)
when booting will set up the swap partition with a copy of unix. When the
system is rebooted (part of the above program), the boot loader will find
a file system on the swap partition, and since it's the first one with a
file /unix, will load and run it. Since the root file system mount is hard-
coded, partition 2 will still be the root file system and partition 1 will
still be the swap partition. Lenny found out the "first file system" boot
(I think by accident) quite a while ago. I tested all this out on a 3.51
system. If you have 3.5 or 3.0, you will have to tune the shell script

Sorry I'm so late in responding to this, Gary.

[...]
>6. Can anyone recommend a good replacement unit, if it comes to that?
>   I have the WD2010, so I can run a drive with (what is it, John
>   Milton? - 1400 cylinders?).  I don't have the P5.1 upgrade, but
yes (/usr/include/sys/gdisk.h)

>   I'm sure I could get it easily enough.
[...]

Hard disk speed test results next...

John
-- 
John Bly Milton IV, jbm@uncle.UUCP, n8emr!uncle!jbm@osu-cis.cis.ohio-state.edu
(614) h:252-8544, w:469-1990; N8KSN, AMPR: 44.70.0.52; Don't FLAME, inform!

gst@gnosys.svle.ma.us (Gary S. Trujillo) (01/15/90)

Thanks to John Milton, John MacMillan, Judy Scheltema, and everyone else
who wrote or posted to express sympathy and/or to make suggestions on how
to solve my disk problem.  My problem was solved last week, due mainly to
the patience and perserverance of John MacMillan, who worked with me and
provided guidance and encouragement.

John's theory as to what actually happened is that the disk was in the
process of doing a write operation when it was jarred, and that the heads
were momentarily far enough out of position to cause a too-weak signal to
be laid down in one particular block.  His recommended solution was a
procedure to copy a block of zeroes over the faulty block.  I did so,
ran fsck, and "Hey, presto!" I was able to salvage what was left, with
very little loss (within an hour, after restoring a handful of lost files
from backups, I was back up in multi-user mode - good as new!).

John Milton gets some credit here too, for his posting from back in April
or so where he describes a way to fix up /etc/inittab (which he forgot to
mention by name :-) to boot from the hard disk in single-user mode.  Thanks
also to whoever it was (Lenny T.?) who first suggested how to create a
boot floppy from the distribution disks that permit one to mount /dev/fp002
in order to do fixup work on it.

As my pennance (in addition to the many lashes with wet noodles I have
administered to myself for being so stooopid as to THWACKK the keyboard
into place), I have agreed with John MacMillan to go over the notes he
sent me to put them into the form of a tutorial describing what I have
learned from the experience, so some of you might be spared my grief.

Again, thanks to all, and watch this space for the posting from John and me.

	Gary

-- 
Gary S. Trujillo                              gst@gnosys.svle.ma.us
Somerville, Massachusetts                     {wjh12,spdcc,ima,cdp}!gnosys!gst