gordon@cs.UAlberta.CA (Gordon Atwood) (01/17/90)
I have encountered an interesting problem on the mips machine. I would be most interested if anyone out there can provide an explanation and/or solution. We have two machines in our department, an m1000 and a recently installed m120-5. The former was converted to RISC/os 4.0 (Sys V mode) several months ago while the latter (the m120-5) was simply brought up with RISC/os 4.0 (Sys V mode). Both machines experience what appears to be an identical problem. They both crash when a filesystem is dumped using the raw device name. As you will recall the disks can be referenced in either block device or character (raw) device form. The latter allows for uninterrupted I/O and bypasses the filesystem software. I noticed that the new Sys V flavor didn't provide raw device names for the filesystems and that the supplied dump program (/etc/dump.ffs) used the block device. There is, however, no reason that I can think of for not using the raw device, so I tried the dump program with a raw device filesystem name. The machine promptly crashed (an orderly shutdown and reboot). This happens on both the m1000 and the m120-5. Even more interesting is that when the m1000 was running the BSD 4.3 OS the corresponding dump program was quite happy with both the block and the raw device filesystems. The only other clue that I can provide is the error message which appears on the console just prior to the crash. The messages was "assertion failed !pg_ismod(pd), file: fault.c line 284" I have two suspicions: 1) Since the filesystem is active, perhaps the os is detecting that a disk block has been read which is already been modified in the real memory (ie flagged as modified). 2) The raw device I/O has a bug. Any thoughts would be gratefully received. Gordon Atwood Programmer/Analyst Department of Computing Science University of Alberta
rogerk@mips.COM (Roger B.A. Klorese) (01/18/90)
In article <1990Jan16.220024.2485@alberta.uucp> gordon@cs.UAlberta.CA (Gordon Atwood) writes: >Both machines experience what appears to be an identical problem. They both >crash when a filesystem is dumped using the raw device name. >I noticed that the new Sys V flavor didn't provide raw device names >for the filesystems... This is not true. The devices merely follow the System V organization for devices. /dev/dsk contains the block devices, and /dev/rdsk contains the raw devices. >Even more interesting is that when the m1000 was running the BSD 4.3 OS >the corresponding dump program was quite happy with both the block and the >raw device filesystems. That's because these are totally different ports and kernels. We didn't build RISC/os from UMIPS-BSD. We built it from UMIPS-V and re-ported many BSD commands, system calls, etc. to it. >The only other clue that I can provide is the error message which appears >on the console just prior to the crash. The messages was > "assertion failed !pg_ismod(pd), file: fault.c line 284" Thanks for this piece of information. It's an *extremely* important one. Please be sure to report anything you think could even possibly be remotely relevant to a problem; often it's not clear where the real clue is. >I have two suspicions: 1) Since the filesystem is active, perhaps the os >is detecting that a disk block has been read which is already been modified >in the real memory (ie flagged as modified). > >2) The raw device I/O has a bug. > >Any thoughts would be gratefully received. OK, the answer is: 3) In 4.0-based releases, there was a bug introduced which could cause the in-memory page tables and the TLB to become inconsistent under some circumstances. The two most visible symptoms are either the assertion failure on !pg_ismod (in a few different code locations) or hangs of 5 to 15 minutes' duration where the system "goes catatonic" (no apparent action at all) and spontaneously revives. It does *not*, however, produce permanent or hidden effects such as broken filesystems or incorrect calculations. This will be fixed in the forthcoming release 4.50. In addition, it will be provided in a "patch release" now being tested, which should become available in two to three weeks. Since patch releases do not undergo the same full QA cycle as major and minor releases, we prefer that users not strongly inconvenienced by the problems addressed in the patch release do not install it, as there is always the possibility that they will turn up some code regression that our testing did not detect. So we are asking that people who are getting unavoidable panics from this problem, or the hangs, install the patch, but that users who are having problems with a workaround (such as using the block device for dump) wait for 4.50. By the way, not that we're hiding this issue, but if you have a support contract for RISC/os, you probably would have gotten more direct and immediate response from the CRC at 1-800-443-MIPS. -- ROGER B.A. KLORESE MIPS Computer Systems, Inc. phone: +1 408 720-2939 928 E. Arques Ave. Sunnyvale, CA 94086 rogerk@mips.COM {ames,decwrl,pyramid}!mips!rogerk "Two guys, one cart, fresh pasta... *you* figure it out." -- Suzanne Sugarbaker