[comp.sys.nsc.32k] SCSI errors...

news@bungi.com.mu.edu (05/13/91)

Wow, now that I have Minix 1.3 on the machine (THANKS HEAPS
Bruce!) I have noticed quite alot of:

SCSI ok with recovery. code 0x17, logical address 0x<some address, varies>

and a few

SCSI failure, key 0x3, code 0x11, log adr 0x31e, sense buf 0xc462

messages. The first one seems to occur after a few hours of running the
drive and often appears up to 10 times in a row.

The second message occurs five to ten minutes after the first type of message
start and can lead to a kernal panic.

Is it just the driver (remember its Minix 1.3) or could there be a problem
with the drive (a Mini-Scribe) ?

I have held off compiling the 1.5h version of the OS until I can be sure
of where the problem lies.

Thanks for any help,

marcb
------------------------------------------------------------------------------
Marc A. Boschma                                       marcb@img.uu.oz.AU
Systems Development                                   img Consultants
                                                      GPO Box 3304GG
                                                      Melbourne, Victoria 3001
                                                      Australia

culberts@hplwbc.hpl.hp.com (Bruce Culbertson) (05/16/91)

> From: daver!uunet!munnari!eyrie.img.uu.oz.au!marcb@mips.com (Marc A. Boschma)
>
> Wow, now that I have Minix 1.3 on the machine (THANKS HEAPS
> Bruce!) I have noticed quite alot of:
> SCSI ok with recovery. code 0x17, logical address 0x<some address, varies>
> 
> and a few
> 
> SCSI failure, key 0x3, code 0x11, log adr 0x31e, sense buf 0xc462

"Ok with recovery" is a "soft error" -- some random noise or a power
glitch caused a disk operation to a healthy block to fail.  Both Minix
and most SCSI disks retry operations which fail.  This message means
Minix eventually was successful in performing a disk operation which
initially failed.  Minix is trying to say that it is happy and its
file system is intact, but something funny happened with you disk
which you might want to know about.

It is normal and expected that you will see soft errors occasionally
but if you are seeing several a day, you have a problem.  A typical
cause is a defect in the disk surface which makes reading the block
unreliable.  The standard Minix distribution includes a tool for testing
all the blocks on a disk.  Another tool builds a file of all the bad
blocks so that the blocks will not be allocated to files you care about.

If you get frequent retry messages and the block numbers are truly
random, then you have a problem in the drive electronics, its power
supply, or your pc532.  Debugging it might require some creativity.

"SCSI failure" means Minix cannot talk to your disk.  This usually
results in a panic.  If Minix has been successfully talking to your
disk and then suddenly gets a "SCSI failure", then your file system
is likely to be corrupted.  Cross your fingers and run fsck after you
debug and correct the problem.  If your file system is really in bad
shape but you are desperate to save your data, you might have some
success with the disk editor "de".

> Is it just the driver (remember its Minix 1.3) or could there be a problem
> with the drive (a Mini-Scribe) ?

1.3 has a pretty good SCSI driver, though not perfect.  Many people
have used it with Mini-Scirbe drives.  I do not think the 1.5h driver
is substantially different from the 1.3 driver.

Bruce Culbertson

s861298@minyos.xx.rmit.oz.au (Marc A. Boschma) (05/18/91)

culberts@hplwbc.hpl.hp.com (Bruce Culbertson) writes:

>> From: daver!uunet!munnari!eyrie.img.uu.oz.au!marcb@mips.com (Marc A. Boschma)
>>
>> Wow, now that I have Minix 1.3 on the machine (THANKS HEAPS
>> Bruce!) I have noticed quite alot of:
>> SCSI ok with recovery. code 0x17, logical address 0x<some address, varies>
>> 
>> and a few
>> 
>> SCSI failure, key 0x3, code 0x11, log adr 0x31e, sense buf 0xc462

>"Ok with recovery" is a "soft error" -- some random noise or a power
>glitch caused a disk operation to a healthy block to fail.  Both Minix
>and most SCSI disks retry operations which fail.  This message means
>Minix eventually was successful in performing a disk operation which
>initially failed.  Minix is trying to say that it is happy and its
>file system is intact, but something funny happened with you disk
>which you might want to know about.

>It is normal and expected that you will see soft errors occasionally
>but if you are seeing several a day, you have a problem.  A typical
>cause is a defect in the disk surface which makes reading the block
>unreliable.  The standard Minix distribution includes a tool for testing
>all the blocks on a disk.  Another tool builds a file of all the bad
>blocks so that the blocks will not be allocated to files you care about.

>If you get frequent retry messages and the block numbers are truly
>random, then you have a problem in the drive electronics, its power
>supply, or your pc532.  Debugging it might require some creativity.

The soft errors only occur for a given block once or twice so
I hope there is only some noise on the SCSI bus.

I'm thinking of doing a low level format and trying again. These
problems occured after the machine had been on for about a day.
Maybe better cooling is needed.


>"SCSI failure" means Minix cannot talk to your disk.  This usually
>results in a panic.  If Minix has been successfully talking to your
>disk and then suddenly gets a "SCSI failure", then your file system
>is likely to be corrupted.  Cross your fingers and run fsck after you
>debug and correct the problem.  If your file system is really in bad
>shape but you are desperate to save your data, you might have some
>success with the disk editor "de".

fsck has managed to clean it twice now..though I lost 6 blocks somewhere.

>> Is it just the driver (remember its Minix 1.3) or could there be a problem
>> with the drive (a Mini-Scribe) ?

>1.3 has a pretty good SCSI driver, though not perfect.  Many people
>have used it with Mini-Scirbe drives.  I do not think the 1.5h driver
>is substantially different from the 1.3 driver.

Ok, so I'll start debuging the hardware if the drive doesn't work after
the format

>Bruce Culbertson