[comp.sys.sgi] disk timeout, SCSI reset...

Claude.P.Cantin@nrc.CA (04/06/91)

PROBLEM SUMMARY:
---------------
WREN VI hard drive as "dks0d2".

    sc0,2,0: timeout after 30 sec.  Resetting SCSI BUS
    dksc0d2s7: retrying request
    dksc0d2s7: retrying request
    dksc0d2s7: retrying request

    Apr  5 16:27:24 nrcbs3 grcond[471]: CIO: dks0d2s7: retrying request
    Apr  5 16:27:24 nrcbs3 grcond[471]: CIO: sc0,2,0: timeout after 30 sec.  Res

    bru "filename" warning - X block checksum error
    bru "filename2" warning - block sequence error
    bru "filename3" warning - file synchronization error - attempting recovery.


BACKGROUND LEADING TO PROBLEM:
-----------------------------
We received a Seagate Wren VI drive from PARITY systems, already formatted
and partitioned for our Personnal IRIS (4D/35).  It was even setup to be
used as disk "2" (dks0d2).

I installed it on the PI, used "fx" to be sure it was partitioned.  It
is partitioned in the same way SGI partitions their (16 MB for root,
50 for swap, and the rest for "/usr").

Partition 7 representing the entire area ("/" + swap + "/usr"), I used

    mkfs /dev/dsk/dks0d2s7

That worked fine.  I then issued

    ln /dev/dsk/dks0d2s7 /dev/usr2

and the same for the raw device.

    mount /dev/usr2 /usr2

was also successfull, as was writing small files to the disk.

I then wanted to move all users from /usr/people to /usr2/people.  For
some reason, I felt like using bru, so I issued

    cd /usr
    bru -cvf /usr2/bru.dat people

This worked fine and created a file of 77 MB.

    cd /usr2
    bru -xvf bru.dat

started normally, BUT at several intervals (at 13, 21, 23, 30.6, 30.8, 32.3,
38.4, 47.5, 59, 65.3, 75.9 Megabytes), the extraction from the file stopped,
then when it started again the messages

    sc0,2,0: timeout after 30 sec.  Resetting SCSI BUS
    dksc0d2s7: retrying request
    dksc0d2s7: retrying request
    dksc0d2s7: retrying request

would appear on the console.  The /usr/adm/SYSLOG file (the last 20 lines)
looks like:

    Apr  5 16:27:24 nrcbs3 grcond[471]: CIO: dks0d2s7: retrying request
    Apr  5 16:27:24 nrcbs3 grcond[471]: CIO: dks0d2s7: retrying request
    Apr  5 16:27:24 nrcbs3 grcond[471]: CIO: sc0,2,0: timeout after 30 sec.  Res
    Apr  5 16:27:24 nrcbs3 grcond[471]: CIO: dks0d2s7: retrying request
    Apr  5 16:27:24 nrcbs3 grcond[471]: CIO: sc0,2,0: timeout after 30 sec.  Res
    Apr  5 16:27:24 nrcbs3 grcond[471]: CIO: dks0d2s7: retrying request
    Apr  5 16:27:24 nrcbs3 grcond[471]: CIO: sc0,2,0: timeout after 30 sec.  Res
    Apr  5 16:27:24 nrcbs3 grcond[471]: CIO: dks0d2s7: retrying request
    Apr  5 16:27:24 nrcbs3 grcond[471]: CIO: sc0,2,0: timeout after 30 sec.  Res
    Apr  5 16:27:24 nrcbs3 grcond[471]: CIO: dks0d2s7: retrying request

ALSO, as bru was extracting from the file "bru.dat", the following messages
would should up at irregular intervals (and not necessarily in the following
order):

    bru "filename" warning - X block checksum error
    bru "filename2" warning - block sequence error
    bru "filename3" warning - file synchronization error - attempting recovery.

*** The same SCSI timeout errors happened when I used
***
***     cp -r /usr/people/* /usr2/people/
***

The hard disk is "auto-terminating", so it does not need a SCSI terminator.

That system is on one of our satellite campuses, so it's hard to keep carrying
equipement back and forth (i.e. carry the disk here and try on our own PI,
or come back here and get another cable, or terminator, or anything else...)

What is wrong??  anyone have any clue??  Any suggestions??

Thank you for your suggestions,

     Claude Cantin
     National Reasearch Council

olson@anchor.esd.sgi.com (Dave Olson) (04/07/91)

In <9104051754.aa05599@VMB.BRL.MIL> Claude.P.Cantin@nrc.CA writes:


| PROBLEM SUMMARY:
| ---------------
| WREN VI hard drive as "dks0d2".
| 
|     sc0,2,0: timeout after 30 sec.  Resetting SCSI BUS
|     dksc0d2s7: retrying request
|     dksc0d2s7: retrying request
|     dksc0d2s7: retrying request
| 
|     Apr  5 16:27:24 nrcbs3 grcond[471]: CIO: dks0d2s7: retrying request
|     Apr  5 16:27:24 nrcbs3 grcond[471]: CIO: sc0,2,0: timeout after 30 sec.  Res
| BACKGROUND LEADING TO PROBLEM:
| -----------------------------
| We received a Seagate Wren VI drive from PARITY systems, already formatted
| and partitioned for our Personnal IRIS (4D/35).  It was even setup to be
| used as disk "2" (dks0d2).
| 
| I installed it on the PI, used "fx" to be sure it was partitioned.  It
| is partitioned in the same way SGI partitions their (16 MB for root,
| 50 for swap, and the rest for "/usr").
| 
...
| The hard disk is "auto-terminating", so it does not need a SCSI terminator.
| 
| That system is on one of our satellite campuses, so it's hard to keep carrying
| equipement back and forth (i.e. carry the disk here and try on our own PI,
| or come back here and get another cable, or terminator, or anything else...)
| 
| What is wrong??  anyone have any clue??  Any suggestions??
| 
| Thank you for your suggestions,

There is no such thing as an 'auto terminating' drive.  Either it has
a terminator or it doesn't.  Unless it is a completely external drive
that is the furthest scsi device from the system, it should not have
a terminator.  My guess is termination problems, or possibly a bad
cable.
--

	Dave Olson

Life would be so much easier if we could just look at the source code.

dave@ADCS00.FNAL.GOV (David Richardson) (04/07/91)

  I think there may be more to this than just coincidence.  We have had these
same messages show up on our machines since the upgrade to 3.3.2.  There have
not been any obvious problems other than the annoying messages.

  On one machine:

Apr  4 09:05:42 adchl1 unix: sc0,1: Resetting SCSI bus: timeout after 30 sec
Apr  4 09:05:42 adchl1 unix: dks0d1s0 (/): retrying request
Apr  4 09:06:54 adchl1 unix: dks0d4s10: retrying request

  On another:

Apr  5 16:57:09 adcs00 grcond[27111]: CIO: sc0,5: Resetting SCSI bus: timeout
after 120 sec
Apr  5 16:57:09 adcs00 grcond[27111]: CIO: dks0d1s0: retrying request
Apr  5 16:57:09 adcs00 grcond[27111]: CIO: dks0d4s10: retrying request

  We did not make any changes to the hardware during or since the upgrade.
I'm inclined to think the clue lies in the fact that both of these machines
have Gigatape 8mm tape drives.  Our machine's without the 8mm tapes do not
record these messages.  What's up?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dave Richardson  dave@adcs00.fnal.gov           
Fermi National Accelerator Laboratory
Batavia, IL     60510
(708) 840-3354

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

olson@anchor.esd.sgi.com (Dave Olson) (04/08/91)

In <9104062100.AA28279@adcs00.fnal.gov> dave@ADCS00.FNAL.GOV (David Richardson) writes:
|   I think there may be more to this than just coincidence.  We have had these
| same messages show up on our machines since the upgrade to 3.3.2.  There have
| not been any obvious problems other than the annoying messages.
| Apr  4 09:05:42 adchl1 unix: sc0,1: Resetting SCSI bus: timeout after 30 sec
| Apr  4 09:05:42 adchl1 unix: dks0d1s0 (/): retrying request
| Apr  4 09:06:54 adchl1 unix: dks0d4s10: retrying request
| 
| Apr  5 16:57:09 adcs00 grcond[27111]: CIO: sc0,5: Resetting SCSI bus: timeout
| after 120 sec
| Apr  5 16:57:09 adcs00 grcond[27111]: CIO: dks0d1s0: retrying request
| Apr  5 16:57:09 adcs00 grcond[27111]: CIO: dks0d4s10: retrying request
| 
|   We did not make any changes to the hardware during or since the upgrade.
| I'm inclined to think the clue lies in the fact that both of these machines
| have Gigatape 8mm tape drives.  Our machine's without the 8mm tapes do not
| record these messages.  What's up?

What OS release were you running before the upgrade, and what hardware
do you have (cpu and SCSI devices).  The hinv output might help.

I'm beginning to sound like a broken record, but for 3rd party hardware,
your best bet is to contact the vendor.  My guess is still cabling or
termination problems.  There were some OS changes in 3.3 that affected the
average size of disk i/o for scsi drives, and that could be exposing
a problem that has always been there.

If you were running pre 3.2 software before, scsi devices didn't run in
sync mode, and that could have something to do with the problem, but
that doesn't sound likely, since the 8mm drvies weren't really
supported prior to 3.2 (there was a 3.1g maint release to support them,
but not many people had that).
--

	Dave Olson

Life would be so much easier if we could just look at the source code.

bernie@umbc3.umbc.edu (Bernard J. Duffy) (04/16/91)

These kinds of "disk timeout" SYSLOG messages happened on our 4D/25
4D/220 systems and the only thing that stopped them was swapping out
some cables and in one case changing the SCSI bus device ordering.  Do
make sure you have no (internal, since it's real hard to have
external ones in the middle of the bus ;-) ) terminators on the devices
in the middle or the SGI disk drive inside the power/disk tower/cabinet.
After you make sure of that, use the shorter and shorter cables.  It
might help to do a quick test with the shortest cables you have and work
up from there as space and cabinets require (can't use a real short
cable out of the PI boxes with the case on it).  You can also try
some tests with different arrangements of the SCSI bus.  In one case
I had to put the 8mm Exabyte on the end of the bus in order to get
things working quietly.  
    Not all of these disk-timeout errors aren't fatal, they just put 
a nasty pause in I/O.  For each of your arrangements/cableing, fire
a tar/bru read/write operations.  This should run error free/ pause
free on a lightly loaded system.  I usually fire off two tar/bru
jobs to separate drives on the same SCSI bus for my final verification
test.  This is an extreme test since most daily operations are not
this SCSI I/O intensive.  Our 4D/220 will occasionally pause under
this test, but the I/O operation continues and bru operation doesn't
fail (I used bru's check/verification option).

Good luck,

-- 
Bernie Duffy   Systems Programmer II | Bitnet    :  BERNIE@UMBC2
Academic Computing Services - L005e  | Internet  :  BERNIE@UMBC2.UMBC.EDU
Univ. of Maryland Baltimore County   | UUCP      :  ...!uunet!umbc3!bernie
Baltimore, MD  21228   (U.S.A.)      | W: (301) 455-3231  H: (301) 744-2954