Claude.P.Cantin@nrc.CA (04/06/91)
PROBLEM SUMMARY: --------------- WREN VI hard drive as "dks0d2". sc0,2,0: timeout after 30 sec. Resetting SCSI BUS dksc0d2s7: retrying request dksc0d2s7: retrying request dksc0d2s7: retrying request Apr 5 16:27:24 nrcbs3 grcond[471]: CIO: dks0d2s7: retrying request Apr 5 16:27:24 nrcbs3 grcond[471]: CIO: sc0,2,0: timeout after 30 sec. Res bru "filename" warning - X block checksum error bru "filename2" warning - block sequence error bru "filename3" warning - file synchronization error - attempting recovery. BACKGROUND LEADING TO PROBLEM: ----------------------------- We received a Seagate Wren VI drive from PARITY systems, already formatted and partitioned for our Personnal IRIS (4D/35). It was even setup to be used as disk "2" (dks0d2). I installed it on the PI, used "fx" to be sure it was partitioned. It is partitioned in the same way SGI partitions their (16 MB for root, 50 for swap, and the rest for "/usr"). Partition 7 representing the entire area ("/" + swap + "/usr"), I used mkfs /dev/dsk/dks0d2s7 That worked fine. I then issued ln /dev/dsk/dks0d2s7 /dev/usr2 and the same for the raw device. mount /dev/usr2 /usr2 was also successfull, as was writing small files to the disk. I then wanted to move all users from /usr/people to /usr2/people. For some reason, I felt like using bru, so I issued cd /usr bru -cvf /usr2/bru.dat people This worked fine and created a file of 77 MB. cd /usr2 bru -xvf bru.dat started normally, BUT at several intervals (at 13, 21, 23, 30.6, 30.8, 32.3, 38.4, 47.5, 59, 65.3, 75.9 Megabytes), the extraction from the file stopped, then when it started again the messages sc0,2,0: timeout after 30 sec. Resetting SCSI BUS dksc0d2s7: retrying request dksc0d2s7: retrying request dksc0d2s7: retrying request would appear on the console. The /usr/adm/SYSLOG file (the last 20 lines) looks like: Apr 5 16:27:24 nrcbs3 grcond[471]: CIO: dks0d2s7: retrying request Apr 5 16:27:24 nrcbs3 grcond[471]: CIO: dks0d2s7: retrying request Apr 5 16:27:24 nrcbs3 grcond[471]: CIO: sc0,2,0: timeout after 30 sec. Res Apr 5 16:27:24 nrcbs3 grcond[471]: CIO: dks0d2s7: retrying request Apr 5 16:27:24 nrcbs3 grcond[471]: CIO: sc0,2,0: timeout after 30 sec. Res Apr 5 16:27:24 nrcbs3 grcond[471]: CIO: dks0d2s7: retrying request Apr 5 16:27:24 nrcbs3 grcond[471]: CIO: sc0,2,0: timeout after 30 sec. Res Apr 5 16:27:24 nrcbs3 grcond[471]: CIO: dks0d2s7: retrying request Apr 5 16:27:24 nrcbs3 grcond[471]: CIO: sc0,2,0: timeout after 30 sec. Res Apr 5 16:27:24 nrcbs3 grcond[471]: CIO: dks0d2s7: retrying request ALSO, as bru was extracting from the file "bru.dat", the following messages would should up at irregular intervals (and not necessarily in the following order): bru "filename" warning - X block checksum error bru "filename2" warning - block sequence error bru "filename3" warning - file synchronization error - attempting recovery. *** The same SCSI timeout errors happened when I used *** *** cp -r /usr/people/* /usr2/people/ *** The hard disk is "auto-terminating", so it does not need a SCSI terminator. That system is on one of our satellite campuses, so it's hard to keep carrying equipement back and forth (i.e. carry the disk here and try on our own PI, or come back here and get another cable, or terminator, or anything else...) What is wrong?? anyone have any clue?? Any suggestions?? Thank you for your suggestions, Claude Cantin National Reasearch Council
olson@anchor.esd.sgi.com (Dave Olson) (04/07/91)
In <9104051754.aa05599@VMB.BRL.MIL> Claude.P.Cantin@nrc.CA writes: | PROBLEM SUMMARY: | --------------- | WREN VI hard drive as "dks0d2". | | sc0,2,0: timeout after 30 sec. Resetting SCSI BUS | dksc0d2s7: retrying request | dksc0d2s7: retrying request | dksc0d2s7: retrying request | | Apr 5 16:27:24 nrcbs3 grcond[471]: CIO: dks0d2s7: retrying request | Apr 5 16:27:24 nrcbs3 grcond[471]: CIO: sc0,2,0: timeout after 30 sec. Res | BACKGROUND LEADING TO PROBLEM: | ----------------------------- | We received a Seagate Wren VI drive from PARITY systems, already formatted | and partitioned for our Personnal IRIS (4D/35). It was even setup to be | used as disk "2" (dks0d2). | | I installed it on the PI, used "fx" to be sure it was partitioned. It | is partitioned in the same way SGI partitions their (16 MB for root, | 50 for swap, and the rest for "/usr"). | ... | The hard disk is "auto-terminating", so it does not need a SCSI terminator. | | That system is on one of our satellite campuses, so it's hard to keep carrying | equipement back and forth (i.e. carry the disk here and try on our own PI, | or come back here and get another cable, or terminator, or anything else...) | | What is wrong?? anyone have any clue?? Any suggestions?? | | Thank you for your suggestions, There is no such thing as an 'auto terminating' drive. Either it has a terminator or it doesn't. Unless it is a completely external drive that is the furthest scsi device from the system, it should not have a terminator. My guess is termination problems, or possibly a bad cable. -- Dave Olson Life would be so much easier if we could just look at the source code.
dave@ADCS00.FNAL.GOV (David Richardson) (04/07/91)
I think there may be more to this than just coincidence. We have had these same messages show up on our machines since the upgrade to 3.3.2. There have not been any obvious problems other than the annoying messages. On one machine: Apr 4 09:05:42 adchl1 unix: sc0,1: Resetting SCSI bus: timeout after 30 sec Apr 4 09:05:42 adchl1 unix: dks0d1s0 (/): retrying request Apr 4 09:06:54 adchl1 unix: dks0d4s10: retrying request On another: Apr 5 16:57:09 adcs00 grcond[27111]: CIO: sc0,5: Resetting SCSI bus: timeout after 120 sec Apr 5 16:57:09 adcs00 grcond[27111]: CIO: dks0d1s0: retrying request Apr 5 16:57:09 adcs00 grcond[27111]: CIO: dks0d4s10: retrying request We did not make any changes to the hardware during or since the upgrade. I'm inclined to think the clue lies in the fact that both of these machines have Gigatape 8mm tape drives. Our machine's without the 8mm tapes do not record these messages. What's up? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Dave Richardson dave@adcs00.fnal.gov Fermi National Accelerator Laboratory Batavia, IL 60510 (708) 840-3354 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
olson@anchor.esd.sgi.com (Dave Olson) (04/08/91)
In <9104062100.AA28279@adcs00.fnal.gov> dave@ADCS00.FNAL.GOV (David Richardson) writes: | I think there may be more to this than just coincidence. We have had these | same messages show up on our machines since the upgrade to 3.3.2. There have | not been any obvious problems other than the annoying messages. | Apr 4 09:05:42 adchl1 unix: sc0,1: Resetting SCSI bus: timeout after 30 sec | Apr 4 09:05:42 adchl1 unix: dks0d1s0 (/): retrying request | Apr 4 09:06:54 adchl1 unix: dks0d4s10: retrying request | | Apr 5 16:57:09 adcs00 grcond[27111]: CIO: sc0,5: Resetting SCSI bus: timeout | after 120 sec | Apr 5 16:57:09 adcs00 grcond[27111]: CIO: dks0d1s0: retrying request | Apr 5 16:57:09 adcs00 grcond[27111]: CIO: dks0d4s10: retrying request | | We did not make any changes to the hardware during or since the upgrade. | I'm inclined to think the clue lies in the fact that both of these machines | have Gigatape 8mm tape drives. Our machine's without the 8mm tapes do not | record these messages. What's up? What OS release were you running before the upgrade, and what hardware do you have (cpu and SCSI devices). The hinv output might help. I'm beginning to sound like a broken record, but for 3rd party hardware, your best bet is to contact the vendor. My guess is still cabling or termination problems. There were some OS changes in 3.3 that affected the average size of disk i/o for scsi drives, and that could be exposing a problem that has always been there. If you were running pre 3.2 software before, scsi devices didn't run in sync mode, and that could have something to do with the problem, but that doesn't sound likely, since the 8mm drvies weren't really supported prior to 3.2 (there was a 3.1g maint release to support them, but not many people had that). -- Dave Olson Life would be so much easier if we could just look at the source code.
bernie@umbc3.umbc.edu (Bernard J. Duffy) (04/16/91)
These kinds of "disk timeout" SYSLOG messages happened on our 4D/25 4D/220 systems and the only thing that stopped them was swapping out some cables and in one case changing the SCSI bus device ordering. Do make sure you have no (internal, since it's real hard to have external ones in the middle of the bus ;-) ) terminators on the devices in the middle or the SGI disk drive inside the power/disk tower/cabinet. After you make sure of that, use the shorter and shorter cables. It might help to do a quick test with the shortest cables you have and work up from there as space and cabinets require (can't use a real short cable out of the PI boxes with the case on it). You can also try some tests with different arrangements of the SCSI bus. In one case I had to put the 8mm Exabyte on the end of the bus in order to get things working quietly. Not all of these disk-timeout errors aren't fatal, they just put a nasty pause in I/O. For each of your arrangements/cableing, fire a tar/bru read/write operations. This should run error free/ pause free on a lightly loaded system. I usually fire off two tar/bru jobs to separate drives on the same SCSI bus for my final verification test. This is an extreme test since most daily operations are not this SCSI I/O intensive. Our 4D/220 will occasionally pause under this test, but the I/O operation continues and bru operation doesn't fail (I used bru's check/verification option). Good luck, -- Bernie Duffy Systems Programmer II | Bitnet : BERNIE@UMBC2 Academic Computing Services - L005e | Internet : BERNIE@UMBC2.UMBC.EDU Univ. of Maryland Baltimore County | UUCP : ...!uunet!umbc3!bernie Baltimore, MD 21228 (U.S.A.) | W: (301) 455-3231 H: (301) 744-2954