[comp.sys.next] fixing bad superblocks

corey@blake.acs.washington.edu (Corey Satten) (03/14/89)

Fixing trashed optical disk superblocks on NeXT 0.8.

I have just fixed the 3rd (on campus) trashed superblock on a NeXT
optical disk.  My current hypothesis is that shutting down via the
power button instead of running /etc/halt is the cause.  Fortunately,
it seems that the solution is relatively painless if you can put the
optical disk in a machine which is either booted off a hard disk or
booted diskless off another cube.

First, verify that your problem is really a bad superblock by typing as
root:

	/etc/fsck -n /dev/rod0a

It will say something about not being able to read block 8.  Then fix
it by:

	/etc/fsck -b 16 /dev/rod0a

This rebuilds the file system (and restores the "bad" superblock) from
information in one of many alternate copies of the superblock which the
designers of the Berkeley fast-filesystem thoughtfully included.  Note,
the NeXT 0.8 manpage for fsck is incorrect:  block 32 is *not* an
alternate superblock.  Two other suitable alternates are: 848 and 1680.

Note:  even though the error messages might lead you to believe there
is an actual problem with the optical medium, I believe this is not the
case and the disks I have restored show no sign of further trouble.

Corey Satten
corey@cac.washington.edu

P.S.  The above posting is for the convenience of the network community.
It is not a "comment" on NeXT quality.  In my experience NeXTs are very
well behaved.  The two I use every day have not crashed in recent memory
(~2 months).  Quite impressive for an initial pre-release!  Sure, I've
sent in bugs and suggestions.  What else is new?

gerrit@nova.cc.purdue.edu (Gerrit) (03/14/89)

In article <1208@blake.acs.washington.edu> corey@blake.acs.washington.edu (Corey Satten) writes:
>Fixing trashed optical disk superblocks on NeXT 0.8.
>
>I have just fixed the 3rd (on campus) trashed superblock on a NeXT
>optical disk.  My current hypothesis is that shutting down via the
>power button instead of running /etc/halt is the cause.  Fortunately,

I remember a comment about this at the 3rd NeXT developer's camp.  When
they were going through "basic training" on the machine (booting, mounting,
shutting down, etc) they told people let the machine sit for 60 seconds at
some point near the end of a shutdown.  I believe it was either just before
powering off the machine or just after typing halt (I'm one of those who
never shuts down my workstation, so my memory of the timing is fuzzy).
Maybe someone else who was there will find this a refresher and fill in
the details I'm forgetting.

Anyway, the problem was that some quirk of Mach which prevents data from
getting actually flushed to disk until some long timeout (something less
than 60 seconds) had passed.  The moral:  warn people not to power off
their machine without a 60 second pause *somewhere* near the end.  Again, I
apologize for the lapse in memory, and I hope someone remembers what I'm
not.

Gerrit Huizenga,
Purdue University Computing Center
NeXT Workstation Coordinator
gerrit@mentor.cc.purdue.edu

ali@polya.Stanford.EDU (Ali T. Ozer) (03/15/89)

In article <1740@mentor.cc.purdue.edu> gerrit@nova.cc.purdue.edu writes:
>I remember a comment about this at the 3rd NeXT developer's camp.  When
>they were going through "basic training" on the machine (booting, mounting,
>shutting down, etc) they told people let the machine sit for 60 seconds at
>some point near the end of a shutdown.

My understanding is to make sure you wait at least 60 seconds after
unmounting the disk before ejecting it. What I usually do after unmounting
an optical is to

  sleep 60; disk -e /dev/rod0a

just to make sure. This of course means that if you shut off the machine
without unmounting the optical there's a chance you might trash it...

This all pertains to 0.8 only, of course...

Ali Ozer, NeXT Developer Support

feldman@umd5.umd.edu (Mark Feldman) (03/15/89)

In article <1208@blake.acs.washington.edu> corey@blake.acs.washington.edu (Corey Satten) writes:
>Fixing trashed optical disk superblocks on NeXT 0.8.
>
>I have just fixed the 3rd (on campus) trashed superblock on a NeXT
>optical disk.  My current hypothesis is that shutting down via the
>power button instead of running /etc/halt is the cause.  Fortunately,
>it seems that the solution is relatively painless if you can put the
>optical disk in a machine which is either booted off a hard disk or
>booted diskless off another cube.
>

We haven't had many optical-related problems since we are running most of
our NeXTs off scsi winchesters.  

Our service-trained person had blocks eight and nine of one of his ODs
destroyed several time by what appears to be a faulty drive.  fsck with a
valid alternate superblock -- -b 16, 848, or 1680 (as Corey noted) fixed
it every time.

>
>Corey Satten
>corey@cac.washington.edu
>
>P.S.  The above posting is for the convenience of the network community.
>It is not a "comment" on NeXT quality.  In my experience NeXTs are very
>well behaved.  The two I use every day have not crashed in recent memory
>(~2 months).  Quite impressive for an initial pre-release!  Sure, I've
>sent in bugs and suggestions.  What else is new?

If you avoid the known bugs -- which should go away when 0.9 hits the
streets sometime soon -- it isn't difficult to keep a NeXT up and running.

Looking forward to 0.9...

	Mark

jbn@glacier.STANFORD.EDU (John B. Nagle) (03/16/89)

In article <1740@mentor.cc.purdue.edu-> gerrit@nova.cc.purdue.edu (Gerrit) writes:
->I remember a comment about this at the 3rd NeXT developer's camp.  When
->they were going through "basic training" on the machine (booting, mounting,
->shutting down, etc) they told people let the machine sit for 60 seconds at
->some point near the end of a shutdown.  I believe it was either just before
->powering off the machine or just after typing halt (I'm one of those who
->never shuts down my workstation, so my memory of the timing is fuzzy).
->Maybe someone else who was there will find this a refresher and fill in
->the details I'm forgetting.
->
->Anyway, the problem was that some quirk of Mach which prevents data from
->getting actually flushed to disk until some long timeout (something less
->than 60 seconds) had passed.  The moral:  warn people not to power off
->their machine without a 60 second pause *somewhere* near the end.  

       This on a machine with a software-controlled power switch.
Jobs has gone from the "appliance" concept to the other extreme, a cult
object with obscure rituals which must be carried out correctly.  One
is reminded of Arthur C. Clarke's remark that "any sufficiently advanced
technology is indistinguishable from magic," but I don't think that this
is quite what Clarke had in mind.


					John Nagle