[comp.sys.att] The 3B1 and the Bad Block

jeff@cjsa.wa.com (Jeffery Small) (01/12/91)

I guess I need the help of some of you experts!

Environment:	3B1, 2Mb RAM, (1) 67-Mb internal disk,
		(2) RS232 expansion cards, unix-3.51m, WD2010

I have been getting a number of the following HDERR reports in
/usr/adm/unix.log (long lines have been wrapped for readability):


    HDERR ST:51 EF:40 CL:E3 CH:3 SN:7 SC:1 SDH:22 DMACNT:FFFF DCRREG:9A
    MCRREG:8F00 Thu Dec 27 11:11:00 1990

    WD2010 ST=/Sekg/Err/ EF=/CRC/ cy=995. sc=7. hd=2. dr#=0. MCR2:0x0
    Thu Dec 27 11:11:04 1990
    
    drv:0 part:2 blk:58635 rpts:1 Fri Jan 11 07:19:14 1991


So, I used Brant Cheikes' great "bf" program to determine that block number
58635 was allocated to inode #2583.  ncheck then told me that this was
currently assigned to my 1Mb-sized Cnews "history" file.  Next, I tried to
make a copy of this file and after about 8 attempts I had a successful read
of this data block.  I renamed the bad file and installed the good copy as
the "history" file.

Although I have three of these machines (one dating back to 1985) I have
never had a repeatable HD error before (just lucky I guess) so I was now
ready to add my first block to the Bad Block Table.  I booted the revised
(WD2010) s4diag diskette, selected test 3 "Enter Bad Blocks" and when
prompted, selected option #3 to specify by logical block number.  At the
prompt for the block number I entered:  58635.  The diagnostic routine then
responded with [approximately]:

    Added bad block: Cylinder 916, Track 1, Sector 6.
    Used Track 7329 as the alternative.

Cylinder 916?  Shouldn't this be 995?  At the next prompt to
(Add, Delete or Ignore) I entered an "I" and then thought I would try
this procedure again.  I re-ran the test and re-entered block #58635 and
everything occurred as described above but now the report read:

    Added bad block: Cylinder 916, Track 1, Sector 7.
    Used Track 7329 as the alternative.

Running the expert subtest #6,12 to display the BBT, I now saw that I had
16 BBT entries with the last two being the 916,1,6 and 916,1,7 as reported
above.

Not sure what to do next, I rebooted the machine.  As you might expect, the 
problem was not resolved.  The bad file still contained the bad block and
"cat file" yielded additional error reports to unix.log.

So my questions are:

1:  Why didn't the block number in the error report (58635) work?  What
    (probably obvious) idea am I missing and how should I properly fix this
    problem?

2:  Did I just (potentially) hose some disk files by entering the two good
    sectors into the BBT or did the contents of these sectors get copied to
    the alternate track by the diagnostic routine?  If the data was not
    copied, is there a way that I could determine the files (if any) which
    were damaged?

3:  Can I use the same diagnostic routine to recover the use of these two
    sectors by "Deleting" them from the BBT?  If not, what is the "Delete"
    option for?


Any help would be greatly appreciated.  Thanks.
--
Jeff Small                     C. Jeffery Small & Associates    (206) 232-3338
uunet!nwnexus!cjsa!jeff        7000 E Mercer Way,  Mercer Island, WA     98040

thad@cup.portal.com (Thad P Floryan) (01/13/91)

jeff@cjsa.wa.com (Jeffery Small) in <1991Jan12.014524.300@cjsa.wa.com> writes:

	I have been getting a number of the following HDERR reports in
	/usr/adm/unix.log (long lines have been wrapped for readability):

[A]	HDERR ST:51 EF:40 CL:E3 CH:3 SN:7 SC:1 SDH:22 DMACNT:FFFF DCRREG:9A
	MCRREG:8F00 Thu Dec 27 11:11:00 1990

[B]	WD2010 ST=/Sekg/Err/ EF=/CRC/ cy=995. sc=7. hd=2. dr#=0. MCR2:0x0
	Thu Dec 27 11:11:04 1990
    
[C]	drv:0 part:2 blk:58635 rpts:1 Fri Jan 11 07:19:14 1991

	So my questions are:

	1:  Why didn't the block number in the error report (58635) work?  What
	    (probably obvious) idea am I missing and how should I properly fix
	    this problem?

The "blk:58635" is with respect to the THIRD HD partition "part:2" (counting
from 0).  You need to convert that to a block number with respect to the
beginning of the HD for use with the bad-block mapper in s4diag.

I have a program "hdhelp" that performs the calculation of the "real" block
number given items [A] and/or [C] above from /usr/adm/unix.log along with the
"-t" partitioning report data from "iv".  The program is still in an "ALPHA"
stage because I'm still "playing" with it to handle other error reports and to
report the bad block in several different forms including byte-offset.

And I have a thought "hdhelp" might be eventually adapted to "correct" the
problem (by mapping out bad block(s)) "online" ... but there are potential
nasties with this on a mounted file system and I haven't yet given any thought
concerning strategy(ies) for doing this.

To get the partition information needed, you can run "iv" su'd online thusly:

	# iv  -t  /dev/rfp000

or you can request the same information from one of s4diag's menus.

I've included a "shar" of the present version 0.1 "hdhelp" at the end of this
posting since it IS useful in its present form.  NO DOCS are yet available but
it should be easy to follow the code and the comments.  I've already used it
to calculate the bad blocks to be mapped-out on 5 systems, but the program will
change considerably before the version 1.0 "official" release.  One note: the
program should be compiled and run on a system OTHER than the one with the
problem!  :-)   I believe it'll even compile and run with the C on my C-64.

	2:  Did I just (potentially) hose some disk files by entering the two
	    good sectors into the BBT or did the contents of these sectors get
	    copied to the alternate track by the diagnostic routine?  If the
	    data was not copied, is there a way that I could determine the
	    files (if any) which were damaged?

Yep, you hosed 'em real good!  I seriously doubt the s4diag bad-block mapper
copies or zaps anything, so the information in the original blocks should
still be intact.  Given that you already know how to run "bf":

	So, I used Brant Cheikes' great "bf" program to determine that block
	number 58635 was allocated to inode #2583.  ncheck then told me that
	this was currently assigned to my 1Mb-sized Cnews "history" file.

you could do the same thing to determine the file(s) to which the blocks you
did map out were assigned.  In this case, you have to convert the partition 0
block number (which is what s4diag uses) to a partition 2 block number for bf
to do its thing.  This calculation is the INVERSE of what "hdhelp" does.  I
don't know if you have to un-bad-block-map the two "good" blocks you canned,
but you can try it both ways.

	3:  Can I use the same diagnostic routine to recover the use of these
	    two sectors by "Deleting" them from the BBT?  If not, what is the
	    "Delete" option for?

Yes, you "should" be able to recover the two erroneously mapped-out blocks
using the "Delete" option.

One thing that "bothers" me with your posting is that you didn't indicate
whether you mapped out both sectors of a logical block or whether you just
mapped out by single sector.  If you simply use "Delete" to undo what you
originally did, you should be OK.  But keep in mind that a 1K logical block
on the 3B1 comprises two sectors (physical 512-byte blocks).

Thad Floryan [ thad@cup.portal.com ]

---- Cut Here and feed the following to sh ----
#!/bin/sh
# This is a shell archive (produced by shar 3.49)
# To extract the files from this archive, save it to a file, remove
# everything above the "!/bin/sh" line above, and type "sh file_name".
#
# made 01/13/1991 08:17 UTC by thad@thadlabs
# Source directory /u/thad/temp
#
# existing files will NOT be overwritten unless -c is specified
#
# This shar contains:
# length  mode       name
# ------ ---------- ------------------------------------------
#    291 -rw-r--r-- Makefile
#   6512 -rw-r--r-- hdhelp.c
#
if touch 2>&1 | fgrep 'amc' > /dev/null
 then TOUCH=touch
 else TOUCH=true
fi
# ============= Makefile ==============
if test -f 'Makefile' -a X"$1" != X"-c"; then
	echo 'x - skipping Makefile (File already exists)'
else
echo 'x - extracting Makefile (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'Makefile' &&
X# 3B1 makefile for hdhelp
X#
XCC	=	cc
XCFLAGS	=	-O
XLDFLAGS	=	-s
XLIBS	=	/lib/crt0s.o /lib/shlib.ifile
XNAME	=	hdhelp
XOBJS	=	hdhelp.o
XDEST	=	/usr/local/bin
X
X$(NAME)	:	$(OBJS)
X		$(LD) $(LDFLAGS) -o $(NAME) $(OBJS) $(LIBS)
X
Xinstall :	$(NAME)
X		mv $(NAME)  $(DEST)/.
X
Xclean	:
X		rm -f $(OBJS) core *~
SHAR_EOF
$TOUCH -am 0113001591 'Makefile' &&
chmod 0644 Makefile ||
echo 'restore of Makefile failed'
Wc_c="`wc -c < 'Makefile'`"
test 291 -eq "$Wc_c" ||
	echo 'Makefile: original size 291, current size' "$Wc_c"
fi
# ============= hdhelp.c ==============
if test -f 'hdhelp.c' -a X"$1" != X"-c"; then
	echo 'x - skipping hdhelp.c (File already exists)'
else
echo 'x - extracting hdhelp.c (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'hdhelp.c' &&
X/*	hdhelp
X *
X *	This program helps identify the bad block(s) reported in the file
X *	/usr/adm/unix.log and/or on the screen of the UNIXPC/3B1/PC7300.
X *
X *	Usage:
X *
X *		hdhelp  [ -# ]
X *
X *	Where:	# = method number if both are not desired
X *
X *	The format of HD errors with kernels up to and including 3.51a is:
X *
X *		HDERR ST:11 EF:40 CL:4241 CH:4201 SN:420C SC:4202 SDH:4223 \
X *		DMACNT:FFFF DCRREG:93 MCRREG:8100 Tue Dec 27 02:23:51 1988
X *
X *		drv:0 part:2 blk:15510 rpts:1 Tue Dec 27 02:23:53 1988
X *
X *	The bad block can be calculated using two methods, each as a check
X *	on the other, depending on the available data.
X *
X *	The first method uses the ...
X */
X
X#include <stdio.h>
X
Xstatic char *version = "@(#) hdhelp 0.1 Thad Floryan 17-Oct-1990";
X
Xmain(argc, argv)
X	int	argc;
X	char	*argv[];
X{
X	extern int scanf();
X
X	int	method1 = 0;
X	int	method2 = 0;
X	int	choice;
X	int	CL;	/* Cylinder LOW: only lower byte is significant */
X	int	CH;	/* Cylinder HIGH: only lower byte is significant */
X	int	SN;	/* Sector Number: only lower byte is significant */
X	int	SDH;	/* Head Number: only lower nybble is significant */
X	int	num_heads;		/* number of HD heads */
X	int	part_num;		/* current partition number */
X	int	block_num;		/* HD block number */
X	int	track;			/* HD track number */
X	int	part_blocks[17];	/* partition data from s4test DIAG */
X	int	part_index;		/* subscript for part_blocks[] */
X	int	block1 = 0;		/* method 1 results */
X	int	sector1 = 0;		/* method 1 results */
X	int	block2 = 0;		/* method 2 results */
X	int	sector2 = 0;		/* method 2 results */
X	int	blocks_per_track = 8;	/* UNIXPC has eight 1024-byte blocks */
X					/* same as sixteen 512-byte sectors */
X					/* with one spare per track */
X	int	sectors_per_block = 2;	/* UNIXPC with 1K file system (std) */
X
X	if (argc == 1)
X	{
X		method1 = method2 = 1;
X	}
X	else if (argc == 2)
X	{
X		choice = -atoi(argv[1]);
X		if (choice == 1)
X		{
X			method1 = 1;
X		}
X		else if (choice == 2)
X		{
X			method2 = 1;
X		}
X		else
X		{
X			DoUsage(argv[0]);
X		}
X	}
X	else
X	{
X		DoUsage(argv[0]);
X	}
X
X	printf(
X"You will be asked to supply several values from the HD error report found\n")
;
X	printf(
X"in /usr/adm/unix.log and/or from the s4test DIAG report; enter each value\n")
;
X	printf(
X"followed by a RETURN.  If the data available is only that which appears\n");
X	printf(
X"on your UNIXPC's screen, select method 1.  Be SURE to read this program's\n")
;
X	printf(
X"accompanying documentation!  You use this program at your own risk.  The\n");
X	printf(
X"program's author believes this program to be correct, but, in ALL cases,\n");
X	printf(
X"you, the user, are responsible for the (mis)use, (mis)interpretation, and\n")
;
X	printf(
X"(mis)application of this program's calculations.  Be forewarned!\n");
X
X	PromptDec("\n\tNumber of HD heads? ", &num_heads);
X
X	if (method1 != 0)
X	{
X		printf("\nMETHOD 1 DATA INPUT:\n\n");
X
X		printf(
X"The values for the next 4 inputs can be found in the /usr/adm/unix.log\n");
X		printf(
X"on the long line which begins \"HDERR ST: ...\"\n\n");
X
X		PromptHex("\tvalue of  CL:", &CL);
X		PromptHex("\tvalue of  CH:", &CH);
X		PromptHex("\tvalue of  SN:", &SN);
X		PromptHex("\tvalue of SDH:", &SDH);
X
X		block1  =   (CH  & 0xFF) * 256 * num_heads * blocks_per_track
X			  + (CL  & 0xFF)       * num_heads * blocks_per_track
X			  + (SDH & 0x0F)                   * blocks_per_track
X			  + ((SN & 0xFF) >> 1);
X
X		sector1 =   (CH  & 0xFF) * 256 * num_heads * blocks_per_track
X			  + (CL  & 0xFF)       * num_heads * blocks_per_track
X			  + (SDH & 0x0F)                   * blocks_per_track;
X		sector1 *=  sectors_per_block;
X		sector1 +=  (SN  & 0xFF);
X	}
X
X	if (method2 != 0)
X	{
X		printf("\nMETHOD 2 DATA INPUT:\n\n");
X
X		printf(
X"The values for the next 2 inputs can be found in the /usr/adm/unix.log\n");
X		printf(
X"on the line which looks like \"drv:0 part:2 blk:25916 rpts:1 ...\"\n");
X		printf(
X"The prompt calculations are assuming %d heads as previously entered.\n\n",
X			num_heads);
X
X		PromptDec("\tpart:", &part_num);
X		PromptDec("\t blk:", &block_num);
X
X		printf(
X"\nThe values for the next %d inputs are from the s4test DIAG disk report\n\n"
,
X			part_num + 1);
X
X/*
X *	Ask for one more partition than needed just so no-one feels
X *	queasy about not entering everything on the s4test report.
X *	Believe me, this is important user psychology.
X */
X		for (part_index=track=0; part_index <= part_num; part_index++)
X		{
X			printf("\tPartition %d: start Track=%d, ",
X				part_index, track);
X
X			PromptDec("size (in Blocks)=",
X				&part_blocks[part_index]);
X
X			track += (part_blocks[part_index] / blocks_per_track);
X
X			if (part_index < part_num)
X			{
X				block2 += part_blocks[part_index];
X			}
X		}
X		block2 += block_num;
X		sector2 = block2 * sectors_per_block;
X	}
X
X	if (method1 != 0)
X	{
X		printf("\nMETHOD 1 RESULTS:\n\n");
X		printf("For a HD with %d heads and error report per ",
X			num_heads);
X		printf("\"CL:%04X CH:%04X SN:%04X SDH:%04X\"\n\n",
X			CL, CH, SN, SDH);
X		printf("\tThe partition 0 block number is %d\n", block1);
X		printf("\tThe partition 0 sector number is %d\n", sector1);
X	}
X
X	if (method2 != 0)
X	{
X		printf("\nMETHOD 2 RESULTS:\n\n");
X		printf(
X"For a HD error on \"part:%d blk:%d\" and partitioned per:\n",
X			part_num, block_num);
X		for (part_index=track=0; part_index <= part_num; part_index++)
X		{
X			printf(
X"\tPartition %d: start Track=%d, size (in Blocks)=%d\n",
X				part_index, track, part_blocks[part_index]);
X			track += (part_blocks[part_index] / blocks_per_track);
X		}
X		printf("\n\tThe partition 0 block number is %d\n", block2);
X		printf("\tThe partition 0 sector numbers are %d and %d\n",
X			sector2, sector2 + 1);
X	}
X
X	if (method1 != 0 && method2 != 0)
X	{
X		if ((block1 == block2) &&
X			((sector1 == sector2) || (sector1 == sector2 + 1)))
X		{
X			printf(
X"\nThe two methods concur, so you can proceed per the documentation.\n");
X		}
X		else
X		{
X			printf(
X"\nThe values for the blocks disagree; please check your data input.\n");
X		}
X	}
X}
X
X
XPromptDec(msg, val)
X	char	*msg;
X	int	*val;
X{
X	extern int strlen();
X
X	char	inbuf[81];
X
X	printf(msg);
X	fgets(inbuf, 80, stdin);
X	inbuf[strlen(inbuf) - 1] = '\0';	/* null out newline */
X	sscanf(inbuf, "%d", val);
X}
X
X
XPromptHex(msg, val)
X	char	*msg;
X	int	*val;
X{
X	extern int strlen();
X
X	char	inbuf[81];
X
X	printf(msg);
X	fgets(inbuf, 80, stdin);
X	inbuf[strlen(inbuf) - 1] = '\0';	/* null out newline */
X	sscanf(inbuf, "%x", val);
X}
X
X
XDoUsage(pname)
X	char	*pname;
X{
X	printf("usage: %s  [ -# ]\n", pname);
X	printf("where: # is either 1 or 2; see program docs\n%s\n", version+5);
X	exit(1);
X}
SHAR_EOF
$TOUCH -am 0113001591 'hdhelp.c' &&
chmod 0644 hdhelp.c ||
echo 'restore of hdhelp.c failed'
Wc_c="`wc -c < 'hdhelp.c'`"
test 6512 -eq "$Wc_c" ||
	echo 'hdhelp.c: original size 6512, current size' "$Wc_c"
fi
exit 0