[comp.sys.sgi] block errors, /debug gone for a walk...

Claude.P.Cantin@NRC.CA (01/10/91)

One of my users is having problems with her Personal IRIS 4D/25G, running
IRIX 3.2.3.  A while back, she got an "uncorrectable" block error.  The
block number was 7160.

The error was happening on the /usr partition, so we used "fx" to map out
the block (we used the offset of /usr + 7160 as the block number).

Now, that error has re-appeared.  Did we need to add the offset of /usr to
the original block number???????

Second question/problem:  that same system has now "lost" its "/debug"
directory!!!!!  What may have caused this??  Simple programs run fine, but
(I assume) as soon as swap space is needed, an error message appears...
Any suggestions??????  (yes, "df" does NOT show /debug!!! and yes, this is
causing MAJOR problems...   and yes, we're planning to upgrade SOON
to IRIX 3.3.1...  and yes we would like to solve those problems BEFORE the
upgrade so that we don't carry in those problems...)

Thank you very much for your help,

        Claude Cantin
        National Research Council of Canada

        cantin@vm.nrc.ca, cantin@nrcvm01.bitnet, cantin@nrccsb3.di.nrc.ca...

olson@anchor.esd.sgi.com (Dave Olson) (01/10/91)

In <9101092200.aa07113@VMB.BRL.MIL> Claude.P.Cantin@NRC.CA writes:

| One of my users is having problems with her Personal IRIS 4D/25G, running
| IRIX 3.2.3.  A while back, she got an "uncorrectable" block error.  The
| block number was 7160.
| 
| The error was happening on the /usr partition, so we used "fx" to map out
| the block (we used the offset of /usr + 7160 as the block number).
| 
| Now, that error has re-appeared.  Did we need to add the offset of /usr to
| the original block number???????

In 3.2, you did the correct thing, IF the error was on the usr
partition.  In 3.2 all block errors were reported relative to start
of partition; in 3.3, both partition relative and absolute block
number are reported, since fx wants the 'absolute' block #.

I STRONGLY recommend using the exercise option in fx to spare
bad blocks.  Use the readonly mode on the entire drive, and
fx will automatically forward unreadable blocks.  It is possible
that the block # reported by the drive wasn't really the failing
block; some drive firmware revs reported the first block in
the i/o request, rather than the failing block.

| Second question/problem:  that same system has now "lost" its "/debug"
| directory!!!!!  What may have caused this??  Simple programs run fine, but
| (I assume) as soon as swap space is needed, an error message appears...

ALWAYS run fsck after sparing bad blocks, since the data isn't
preserved (if you could get the data, you wouldn't have to spare the
block, barring persistent soft errors).  This may be the cause of
your problem.  On the other hand, perhaps you simply don't have a
/debug directory.

/debug is only needed by the debugger (dbx), it is just a window
onto virtual memory and process structures, etc.  Its absence has no
impact on swapping or paging.

| Any suggestions??????  (yes, "df" does NOT show /debug!!! and yes, this is
| causing MAJOR problems...

What kind of problems are being caused by not having /debug mounted
(or were you referring to the badblock problems)?
--

	Dave Olson

Life would be so much easier if we could just look at the source code.