[net.unix] Ra81's and bad blocks

andy@cheviot.UUCP (Andy Linton) (03/19/85)
 There has been a lot of traffic about UDA-50 devices on the net and I
 was confused about bad blocks etc. on them. I asked our Dec Service
 Engineer for more info and he produced the following:

**************************
	 
       ALL YOU WANT TO KNOW ABOUT BAD BLOCK REPLACEMENT AND MORE
 
    Introduction
 
    The purpose of this transmission is to inform the readership of the
 differences between Dec standard 144 and 166 disk media.
 
 Dec  Standard  144 media: Rx01/2, R101/2, Rk05J/F, Rk06/7, Rp04/5/6/7,
 R80, Rm02/3/5/80.
 
    Above  are  some  of the media that falls into the Dec standard 144
 classification.  A general rule to thumb is, any  massbus  disk  media
 conforms  to  this  Standard.  The rule may change, but in general the
 rule holds true.  The above also includes serial  and  parallel  drive
 subsystems, i.e.  the R10/2, Rk06/7 and Rk05, Rx01/2.
 
 Dec Standard 166 media: Ra60/1, Ra80/1/2.
 
    Above  are  some  of the media that falls into the Dec standard 166
 classification.  A general rule here is, if it plugs into a UDA-50  or
 a UDA-50 emulator, its 166 media.
 
 Differences with respect to bad blocking:
 
    Bad  blocking,  by  definition,  is  the  generation of a file by a
 software utility that contains information  with  respect  to  pattern
 sensitive or unreadable areas of the media under test.
 
    With  the  exception  of  Rx and Rk05 media, the manufacture of the
 media tests and creates a manufacturer's bad block area.  One  of  the
 major  differences between these media is that on 166 additions to the
 manufactures area are not allowed as on 144.  Another major difference
 between these standards are the number of bad blocks allowed, i.e.  61
 entries (Rp06) on 144 vs 17 thousand (Ra81) on 166.
 
    The above manufacture areas differ greatly between these two media.
 On the Ra series this table is known as the Factory Control Table,  or
 the  FCT.   This  table  could  be loosely compared with the Rp series
 manufacture's bad block, i.e.  they  both  contain  bad  blocks  found
 during  manufacture,  but  this  assumption is misleading.  During the
 initialisation process on a RPxx pack,  the  manufacture's  bad  block
 table  is  read  and  normally, dependent on the operating system, two
 separate files are generated.  During the initialisation process on  a
 Ra  pack  the  FCT  table  is not readable and we therefore create two
 files with null entries, assuming our initialisation  process  doesn't
 know about 166 media.
 
    If  we  compare  what  occurs on major operating systems during bad
 block detection, I hope  the  reader  can  make  sense  of  the  above
 statement.   On,  lets  say an initialised RPxx on a running system, a
 bad block develops.  The drive subsystem reports a hard ECC  error  to
 the  operating  system,  the  actions taken by the operating system on
 receipt of this error normally takes the form of x number  of  retries
 with  offsets.   If  at  the  end of the day the error reported by the
 subsystem still is an ECH, hard uncorrectable ECC, an  addition  to  a
 file,  lets say, badblock.log is made.  The resultant actions taken by
 the system is; one, the data in that block  is  lost,  and  two,  that
 block  is  never  used  again  during  write  operations.  How this is
 accomplished, again dependent on operating systems, is  that  a  mount
 time  or  detection, the badblock files are read and stored in memory.
 In other words, it becomes a system overhead.
 
    This  differs  with  respect  to the RAx series.  On mount the same
 action occurs, as on the RPx subsystem, but since there's  no  entries
 (lets  say),  no harm is done here.  As we write information onto this
 structure the RAx micor processor notes from the  target  header  that
 the  block is bad, it consults the Re-Vector Control Table, RCT, where
 the data should be written, i.e.  where has the block been re-vectored
 and  thus  after the write, a re-vector is accomplished without system
 intervention.
 
    The  RCT  is  a  direct  copy of the FTC, both these tables are not
 directly accessible, at present, by any operating  system  other  than
 the applicable engineering diagnostic.  If during a read operation the
 subsystem reports an ECC error and the operating system  supports  Bad
 Block  Replacement,  BBR, the system, dependent on the reported error,
 i.e.  1-8 symbol ECC errors, can determine when it wants to  re-vector
 the  block  prior  to  the  data  degrading  to  unreadable.  If it is
 ascertained that the data is unreadable, worst case, ECH, a four phase
 process  is started.  The first phase; the error, hard or recoverable,
 is reported via the UDA-50 to the operating system.  If hard or  limit
 is  reached the system starts phase two; recover data, test block, and
 report findings on suspect block.  What happens during this  phase  is
 the data is read and written into a scratch area of the RCT and a test
 pattern  similar to the read data is re-written.  Error information is
 then passed back to the system  after  a  re-read,  "yes  bad  block".
 System says go, phase three please, find and test primary or secondary
 replacement block, mark header of bad block as bad, add block to  RCT,
 and  report  errors  or  when  finished.  Go phase four, write data to
 re-vectored block, if ECH occurred during read,  write  good  ECC  but
 invert  EDC  bit  to notify system that a forced replacement occurred,
 i.e.  data had ECH must be re-written or restored from  whatever  last
 backup media used.
 
 Summary
 
    I  hope the Readership can tell from the above, BBR if implemented,
 will protect data to a level not previously thought  possible.   Drive
 micro code determines at what error limit BBR kicks off.
 
    If  you'll  note  from  the  above  only  the  RCT  is updated with
 additional bad blocks not the FCT.  If  a  reformat  is  done  on  the
 device  the  RCT  data  is zero'ed and FCT information replaced.  This
 implies any additional blocks to your users,  assuming  the  formatter
 doesn't  find  these  pattern sensitive areas.  I would only recommend
 that a re-format be done after gross numbers  of  re-vectors,  due  to
 read/write   problems.    If   inverted  edc's  are  a  problem,  have
 engineering write to the customer area using  /sec:manual,  your  data
 will  be  lost  but this action will re-invert edc's, it will not lose
 the good information held in the RCT.

Regards

Ed Merrill

Country support Engineer
Internal Consultancy group
Basingstoke, England
44-256-56101 ext 3778

******************

I hope this is of some interest to those of you who have problems
with Ra81's (as I do).
Andy


Aindrias Mac Giolla Fhionntain - Computing Lab., U of Newcastle upon Tyne 
	ARPA	: andy%cheviot%newcastle.mailnet@MIT-MULTICS.ARPA	     
	UUCP	: UK!ukc!cheviot!andy					     

***  Ni fui moran beagan d'aon rud, ach is fui moran beagan ceille.  ***