[comp.unix.admin] Problem with dump

bbraden@mesrx.UUCP (Bill Braden) (01/31/91)

Using dump under Ultrix V3.1 and trying to do a level 0 dump of /usr
I am getting the following result:
-------------------------------------------------------------------------
# dump 0sf 4800 /dev/nrmt0h /usr
  DUMP: Date of this level 0 dump: Wed Jan 30 08:47:29 1991
  DUMP: Date of last level 0 dump: the epoch
  DUMP: Dumping /dev/rra0c (/usr) to /dev/nrmt0h
  DUMP: Mapping (Pass I) [regular files]
  DUMP: Mapping (Pass II) [directories]

  DUMP: Estimates based on 4800 feet of tape at a density of 10240
BPI...
  DUMP: This dump will occupy 13296 (10240 byte) blocks on 0.39 tape(s).

  DUMP: Dumping (Pass III) [directories]
  DUMP: Dumping (Pass IV) [regular files]

		********************
  DUMP: (This should not happen)bread from /dev/rra0c [block 69872]:
count=8192, got=-1
		********************

  DUMP: 13295 tape blocks were dumped on 1 tape(s)

  DUMP: Dump is done
---------------------------------------------------------------------------
Apparently this should not happen, but it does.  Problem only occures on
level 0 dumps.  fsck finds no problems with the file system.  I am doing
this with the file system mounted (and probably active).  System is a
microvax II and the disk is an RD54, /usr is the only file system on the
disk and covers the entire disk.  Are these dumps worth the tape they
are written on? Is there a way to clear this?  

Please no lectures about dumping active file systems.  That subject has
already been covered.  Besides life is not without risk.

Thanks in advance for any help.

Bill Braden
!uunet!mesrx!bbraden		"Onward Through The Fog"
-------------------------------------------------------------------------
Any opinions expressed belong to me not Measurex.
-------------------------------------------------------------------------

vixie@decwrl.dec.com (Paul A Vixie) (02/01/91)

Sounds like a hard error.  What does "uerf" say?  Try bringing
the system down to single-user and using "radisk -s 0 -1 /dev/rra0c".

Fsck doesn't read data blocks, so it wouldn't find this.

Cheers,
--
Paul Vixie
DEC Western Research Lab	<vixie@wrl.dec.com>
Palo Alto, California		...!decwrl!vixie

rbj@uunet.UU.NET (Root Boy Jim) (02/01/91)

In article <529@mesrx.UUCP> bbraden@mesrx.UUCP (Bill Braden) writes:
>Using dump under Ultrix V3.1 and trying to do a level 0 dump of /usr
>I am getting the following result:
>  DUMP: This dump will occupy 13296 (10240 byte) blocks on 0.39 tape(s).
>  DUMP: (This should not happen)bread from /dev/rra0c [block 69872]:
>count=8192, got=-1
>  DUMP: 13295 tape blocks were dumped on 1 tape(s)
>  DUMP: Dump is done

OK, I won't flame you, but if you really want to know, you may have
to dump with the FS unmounted, or read only. What happens when you
dump the block device? Perhaps your FS is screwed up. Can you write
a C program (or use dd) to read block 69872? Can you take the tape
and restore it somewhere else (different machine or partition)?
You will note that only one block was missed. Use the duck test.
If your dump tape walks, talks, and quacks like a dump tape, then it's
problably a dump tape. Probably. As you say, life is not without risk.
-- 

	Root Boy Jim Cottrell <rbj@uunet.uu.net>
	Close the gap of the dark year in between

stevel@Autodesk.COM (Steve Litras) (02/06/91)

In article <529@mesrx.UUCP> bbraden@mesrx.UUCP (Bill Braden) writes:
>  DUMP: (This should not happen)bread from /dev/rra0c [block 69872]:
>count=8192, got=-1

We've had this problem too. In fact we just replaced the disk, so I don't know
how it will help (I'm assuming it will fix it, but you never know). According
to our Sun engineer, it's is a fairly harmless problem (soft error), but I
have had it bogus backups.

grr@cbmvax.commodore.com (George Robbins) (02/06/91)

In article <2528@autodesk.COM> stevel@Autodesk.COM (Steve Litras) writes:
> In article <529@mesrx.UUCP> bbraden@mesrx.UUCP (Bill Braden) writes:
> >  DUMP: (This should not happen)bread from /dev/rra0c [block 69872]:
> >count=8192, got=-1
> 
> We've had this problem too. In fact we just replaced the disk, so I don't know
> how it will help (I'm assuming it will fix it, but you never know). According
> to our Sun engineer, it's is a fairly harmless problem (soft error), but I
> have had it bogus backups.

Well, there's no way it's a "soft" error, the data that dump was trying to read
isn't, and junk is written on the tape.  In some cases, it may be that the data
is "don't care"...

There seem to be several major causes for the problem.  One is an actual read
error, in wich case you should review the console and uerf output to find/fix
the problem.  Another is when a filesystem has been corrupted such that there
are pointers outside the partition in the structure.  Running fsck should find/
"fix" this sort of problem.

Another cause that has been mentioned from time to time is when a filesystem
completely fills a partition and the partition size isn't multiple of some
magic number of blocks (I forget the exact excuse).  In this case when the
partition fills, dump tries to do a multi-block read of the last chunk of
data and fails because the multi-block region crosses the partion boundry.
If the block in error corresponds to one of the last blocks in the partition,
this might be your problem.

-- 
George Robbins - now working for,     uucp:   {uunet|pyramid|rutgers}!cbmvax!grr
but no way officially representing:   domain: grr@cbmvax.commodore.com
Commodore, Engineering Department     phone:  215-431-9349 (only by moonlite)

torek@elf.ee.lbl.gov (Chris Torek) (02/19/91)

>>In article <529@mesrx.UUCP> bbraden@mesrx.UUCP (Bill Braden) writes:
>>>  DUMP: (This should not happen)bread from /dev/rra0c [block 69872]:
>>>count=8192, got=-1

>In article <2528@autodesk.COM> stevel@Autodesk.COM (Steve Litras) writes:
>>... According to our Sun engineer, it's is a fairly harmless problem
>>(soft error) ....

In article <18621@cbmvax.commodore.com> grr@cbmvax.commodore.com
(George Robbins) writes:
>Well, there's no way it's a "soft" error,

Depending on definitions and exact circumstances, it could be; but read on:

>the data that dump was trying to read isn't [read], and junk is written
>on the tape.  In some cases, it may be that the data is "don't care"...

>There seem to be several major causes for the problem.
>One is an actual read error [on the disk drive] ...

This is probably the most common cause.  Since many read errors can be
recovered simply by persistence, you may be able to get a good copy of
the file or block in question.  The drive should be repaired and/or the
bad sector forwarded.

To find the name of the file, use icheck -b <block number> followed by
ncheck -i <inode number>.  The <block number> you need for icheck is the
number shown in square brackets (here 69872).  See icheck(8) and ncheck(8)
for details.

Note that it is possible that the block is the final block of a large
file (one that no longer ends in a fragment) and that the data dump could
not read may be irrelevant.  You should still fix the problem, before
the file is extended in place.

>Another is when a filesystem has been corrupted such that there are
>pointers outside the partition in the structure.  Running fsck should
>find/"fix" this sort of problem.

Correct, unless the reason the file system appeared to be damaged was
a synchronization error caused by dumping a `live' file system (one that
is actively being modified).  In this case a second dump will not show
the error (hence it can be called `soft').

>Another cause that has been mentioned from time to time is when a filesystem
>completely fills a partition and the partition size isn't multiple of some
>magic number of blocks (I forget the exact excuse).  In this case when the
>partition fills, dump tries to do a multi-block read of the last chunk of
>data and fails because the multi-block region crosses the partion boundry.
>If the block in error corresponds to one of the last blocks in the partition,
>this might be your problem.

Dump is (and has been since 4.2BSD; the comment is signed `mkm 9/25/83')
smart enough to `back off' after such an error; it will not complain
about `bread from %s [block %d]...'.  An over-eager kernel hacker might
have a driver that logs the attempt to read past the end of the partition,
but dump itself will recover.

Along the same lines, fsck can fail on a block file system under 4.2BSD
if a partition size is not a multiple of 2048 (BLKDEV_IOSIZE) bytes.
This was fixed by the time of the 4.3BSD-tahoe distribution.
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov