wolman@crl.dec.com (Alec Wolman) (05/01/91)
In article <1005@aerodec.anu.edu.au> tridge@aerodec.anu.edu.au (Andrew Tridgell) writes: >I have recently struck a really weird file system problem under Ultrix >4.0 on a DS3100. The root of the problem was a "CANNOT READ BLK : >567664" reported from fsck. This prevented unattended rebooting as a Don't run fsck on the block device (eg /dev/rz1g), run it only on the raw device (eg /dev/rrz1g). Your problem will then go away. There is nothing wrong with the disk (at least as far as fsck is concerned). Alec
jtkohl@MIT.EDU (John T Kohl) (05/01/91)
In article <1005@aerodec.anu.edu.au> tridge@aerodec.anu.edu.au (Andrew Tridgell) writes: > I have recently struck a really weird file system problem under Ultrix > 4.0 on a DS3100. The root of the problem was a "CANNOT READ BLK : > 567664" reported from fsck. > :pg#567666:bg#8192:fg#1024:\ > Basically my questions to the net are : > - what caused the problem in the first case The symptoms you describe exactly match my experience from about 5 years ago with a 4.2BSD filesystem. The 4.xBSD fsck wants to read its disk blocks in groups of 4. The partition size you have (by default, unfortunately) is not a multiple of four. When reading the last two blocks, fsck gets a short read and gets confused. Normally this isn't a problem, EXCEPT when there's something (like a directory block) allocated in the leftover sectors at the end of the disk. You probably have a fairly full partition, which ensures those blocks get used. > - how can dump/restore or tar transfer a bad block between drives? This also explains why the problem apparently "followed" the data to the other disks, and why it went away when you changed the partition size to an integral multiple of 4. > BUT (there has to be a but) > I now have an inconsistancy between the partition table and the file > system. The partition table thinks there is 567666 sectors, the file > system thinks there is 567660. Could this cause a problem in the future? It should not cause problems unless you do another newfs(8) on that filesystem without adjusting the partition size. You may want to use chpt(8) to adjust the system's idea of the filesystem size to avoid this in the future. > - have I now got a time bomb waiting to go off? Not as far as I can determine from your descriptions. -- John Kohl <jtkohl@MIT.EDU> Digital Equipment Corporation/Project Athena (The above opinions are MINE. Don't put my words in somebody else's mouth!) ___This signature printed on recycled bits___ [not original; heard 2nd hand]
tridge@aerodec.anu.edu.au (Andrew Tridgell) (05/01/91)
I have recently struck a really weird file system problem under Ultrix 4.0 on a DS3100. The root of the problem was a "CANNOT READ BLK : 567664" reported from fsck. This prevented unattended rebooting as a reboot required manual operation of fsck, which is a real pain. It also meant that whenever the file at 567664 was accessed I got a kernel panic! for example a "ls -l" in the directory would cause a panic. We have a rz56 disk (600Mb) with /dev/rz1a as root, /dev/rz1g as /usr and /dev/rz1g as /u. Here is the disktab for a rz56 in case you don't have one rz56|RZ56|DEC RZ56 Winchester:\ :ty=winchester:ns#54:nt#15:nc#1632:\ :pa#32768:ba#8192:fa#1024:\ :pb#131072:bb#4096:fb#1024:\ :pc#1299174:bc#8192:fc#1024:\ :pd#292530:bd#8192:fd#1024:\ :pe#292530:be#8192:fe#1024:\ :pf#550274:bf#8192:ff#1024:\ :pg#567666:bg#8192:fg#1024:\ :ph#567668:bh#8192:fh#1024: Read on for more weirdness : The problem started when I moved a few files from /usr to /u. Soon afterwards I noticed that doing "ls -l" in a certain subdirectory where I'd been moving from caused a panic (it was a subdirectory of /usr/include I think). I moved this directory to /usr/BAD and removed all read and write permissions from it to prevent users crashing the system. Next time we rebooted I got this: ** /dev/rz1g ** Last Mounted on /usr ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames CANNOT READ: BLK 567664 CONTINUE? Continuing did no good. At this stage a check in the errorlog showed nothing. I tried using rzdisk to reassing the block with no luck, rzdisk said the block was OK and asked if I would like to continue anyway. I tried it both ways with no change to the situation. Next I got DEC to lend us another identical brand new disk. With both mounted side by side I did a newfs on the news disk (all 3 partitions) then with the new disk's g partition mounted as /nusr I did this dump 0f - /usr | (cd /nusr ; restore xf - ) to transfer everything to the new disk (I repeated this for the a and h partitions, on /nroot and /nu) It didn't work! Everything transferred OK but fsck reported the same problem on the new g partition. Thus dump/restore had taken the problem with it to the new disk. This was a real surprise to me. It showed the problem was not hardware but was in fact software. My next step was to use tar to transfer instead of dump/restore, thinking that tar doesn't save any inode info but only transfers files and permissions. I did this newfs /dev/rrz1g rz56 fsck /dev/rz3g #(which reported no problems) (cd /usr ; tar -cf - . ) | ( cd /nusr ; tar -xf - ) Once again the problem was transferred to the new disk! A fsck on the new disk reported the same error as above. Now I was really confused, if it was a file system error then how did tar transfer it? Tar only knows about things like filenames, ownership and permissions? Yet it reported the same block number(567664) ? Now I got really desperate. I ran newfs with the -v option on /dev/rrz3g so I could see how it was calling mkfs, then I ran mkfs manually with 6 less sectors. So I changed the second parameter in the call to mkfs from 567666 to 567660. The bad block is thus excluded from the file system. I then used tyar as above to try yet again to transfer the stuff. Success! fsck reports no problems on the new g partition! BUT (there has to be a but) I now have an inconsistancy between the partition table and the file system. The partition table thinks there is 567666 sectors, the file system thinks there is 567660. Could this cause a problem in the future? I asked dec support and they don't think so, but maybe...... Basically my questions to the net are : - what caused the problem in the first case - how can dump/restore or tar transfer a bad block between drives? - have I now got a time bomb waiting to go off? I have the second drive for 2 days. If you ask me to experiment with something then it must be BEFORE the weekend. After that I'm back to 1 drive and I am not willing to try anything esoteric. Thanks! -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Andrew Tridgell CSLab, Research School of Physical Science tridge@aerodec.anu.edu.au Australian National University =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-