[comp.sys.sun] disk sequencer error

roy@umix.cc.umich.edu (Roy Richter) (01/16/89)

I also have a Sun 3/260. In September I started getting messages similar
to yours; disk resequencer errors.  At first, it was once a month; by the
end of November, once a week; and in Mid-December, once a day.  Finally I
had a disk failure over Xmas.  Sun installed a new disk, and no problem
since.

Roy Richter                      UUCP:  {umix,edsews,mcf}!rphroy!roy
Physics Dept, GM Research        CSNet: rrichter@gmr.com
                                 Internet: roy%rphroy.uucp@umix.cc.umich.edu

dinah@shell.UUCP (Dinah Anderson) (01/16/89)

> xy0a: read retry (disk sequencer error) -- blk #495, abs blk #495

>Is this a serious disk problem that I should worry about?  The system
>seems to be working well.  What is a disk sequencer error?  Would diag
>have to be used to fix this problem?  

We have received these errors on several different systems (several
different controller/disk combinations.) Sometimes the disk stops working
and sometimes we get a couple of the error messages and then things
proceed normally. 

I would like to know what the errors mean and under what circumstances
they occur. I would also like to know what we should do about them.

Dinah Anderson 
Shell Oil Company, Information Center (713) 795-3287
...!{sun,psuvax,soma,rice,ut-sally,ihnp4}!shell!dinah

msommer@watson.bbn.com (01/19/89)

> > xy0a: read retry (disk sequencer error) -- blk #495, abs blk #495
> 
> We have received these errors on several different systems (several
> different controller/disk combinations.) Sometimes the disk stops working
> and sometimes we get a couple of the error messages and then things
> proceed normally. 
>...

Dinah,

We were besieged with a bunch of these errors a few months ago on our
3/160, with a 2351A Eagle disk, and a Xylogics 450 controller. 

Our problem was caused by a low voltage near the controller board's
backplane slot. Luckily, our Sun field service rep had heard reports of
such a problem (though he had never seen it before, himself), and knew
exactly what to check. Once he found the problem, he just tweaked the
voltage and, presto, the errors disappeared.

If this is the cause of your (and others') problems, and you've called Sun
about them without any success, it suggests there's a lack of
communication among Sun f.s. reps.  

I should note our 3/160 was originally a 1/160 or 2/160. The CPU board and
power supply are both Sun 3 models. We upgraded our power supply after
suffering repeated crashes (several months after the CPU board was
upgraded, when we tried to add a memory board). The backplane, however,
has never been upgraded. My intuition tells me this may have something to
do with the voltage problems. People should probably avoid upgrading their
systems in such a piecemeal fashion. 

mark sommer
msommer@bbn.com

rlk@think.com (Robert L. Krawitz) (01/19/89)

We had a lot of these once, and it indicated a bad spot on the disk (we
also had hard errors at the same time, though).  Reformatting the
partition didn't help us; we had to move the affected partition to another
disk.

harvard >>>>>>  |	Robert Krawitz <rlk@think.com>	245 First St.
bloom-beacon >  |think!rlk	(postmaster)		Cambridge, MA  02142
topaz >>>>>>>>  .	Thinking Machines Corp.		(617)876-1111

eap@bu-it.bu.edu (Eric A. Pearce) (01/25/89)

tomc@dftsrv.gsfc.nasa.gov (Tom Corsetti):

 >Recently, our Sun 3/260 crashed because of a power outage....
 >Well, today, almost a
 >week later, I shutdown and rebooted, and got the message:
 >  xy0a: read retry (disk sequencer error) -- blk #495, abs blk #495
 >Is this a serious disk problem that I should worry about?...

dinah@shell.UUCP (Dinah Anderson):

 >...
 >I would like to know what the errors mean and under what circumstances
 >they occur. I would also like to know what we should do about them.

I looked up the error in my Xylogics 451 manual:

"Disk Sequencer Error - The disk sequencer did not finish its operation
within the allowed time.  Several factors may cause this problem. 

  - The 451 did not receive the servo clock signal from the the selected
    disk drive.  Check the B cable; if the connection is good, try a
    different B cable port on the 451.

  - The 451 is not receiving any read data from the selected drive. Check
    the B cable.

  - The Multibus may be preventing the 451 from gaining proper access."

The manual entry I quote from above suggests the problem could be with the
cabling or the controller itself, but this has not been the case for us.
A bad controller usually spews out large numbers of errors with random
block numbers over more than one disk.  A bad cable will produce random
block errors on one drive (since it's unlikely that more than one cable
would crap out at a time.)  We had drive cable problems on some
rack-mounted systems (3/180's and 3/280's). I believe they were caused by
repeated flexing of the drive cables by the doors on the back of the
cabinets.  The older rack setups have several feet of cable that  dangle
out of the back of the cabinet and move every time you open the door. (The
doors have since been removed - I have not seen any cooling problems so
far).  

A bad disk usually will have errors that give sequential block numbers or
at least repeat them numerous times.  If you only get an occasional disk
error, such as one a week, you might be safe to just map or slip the bad
spots, but in my experience, any errors that occur with regularity are
indicative of future trouble.  

If you have a Sun hardware contract, I would have them replace it as soon
as possible.  If they balk at replacing a drive with only a few errors,
push them a bit.  It *is* possible for systems to run for long periods
without disk problems.           

I would do a full level 0 of the disk as soon as possible.  If you act
before a crisis, you can have a scheduled downtime for a drive
replacement.  You would do a level 0 dump and Sun would come in and
replace it.  This would make the restore much easier, as you would not
have to worry about multi-level backups, not to mention the time you would
save.

I have seen this error on Fujitsu 2351's ("single" Eagle) and 2361's
("double" or "super" Eagle).  It was always accompanied by a massive
number of disk errors.  

Our local Sun field service will replace single Eagles as a whole but they
replace only parts of double Eagles (in this case the HDA and the servo
board).  

The "Eagle" series of drives seem to be rather sensitive to power
fluctuations.  the newer Hitachi DK815-10 and NEC D2363 seem to be more
tolerant. 

 -e

 Eric Pearce                                   ARPANET eap@bu-it.bu.edu
 Boston University Information Technology      CSNET   eap%bu-it@bu-cs
 111 Cummington Street                         JNET    jnet%"ep@buenga" 
 Boston MA 02215                               UUCP    !harvard!bu-cs!bu-it!eap 
 617-353-2780 voice  617-353-6260 fax          BITNET  ep@buenga