[comp.sys.mips] Dump program causes system crash

gordon@cs.UAlberta.CA (Gordon Atwood) (01/17/90)

I have encountered an interesting problem on the mips machine.  I would be
most interested if anyone out there can provide an explanation and/or
solution.  We have two machines in our department, an m1000 and a recently
installed m120-5.  The former was converted to RISC/os 4.0 (Sys V mode)
several months ago while the latter (the m120-5) was simply brought up with
RISC/os 4.0 (Sys V mode).

Both machines experience what appears to be an identical problem.  They both
crash when a filesystem is dumped using the raw device name.

As you will recall the disks can be referenced in either block device or
character (raw) device form.  The latter allows for uninterrupted I/O and
bypasses the filesystem software.  

I noticed that the new Sys V flavor didn't provide raw device names
for the filesystems and that the supplied dump program (/etc/dump.ffs) used
the block device.  There is, however, no reason that I can think of for not
using the raw device, so I tried the dump program with a raw device
filesystem name.  The machine promptly crashed (an orderly shutdown and
reboot).  This happens on both the m1000 and the m120-5.

Even more interesting is that when the m1000 was running the BSD 4.3 OS
the corresponding dump program was quite happy with both the block and the
raw device filesystems.

The only other clue that I can provide is the error message which appears
on the console just prior to the crash.  The messages was
     "assertion failed !pg_ismod(pd), file:  fault.c line 284"

I have two suspicions:  1)  Since the filesystem is active, perhaps the os
is detecting that a disk block has been read which is already been modified
in the real memory (ie flagged as modified).

2) The raw device I/O has a bug.

Any thoughts would be gratefully received.

Gordon Atwood
Programmer/Analyst
Department of Computing Science
University of Alberta

rogerk@mips.COM (Roger B.A. Klorese) (01/18/90)

In article <1990Jan16.220024.2485@alberta.uucp> gordon@cs.UAlberta.CA (Gordon Atwood) writes:
>Both machines experience what appears to be an identical problem.  They both
>crash when a filesystem is dumped using the raw device name.
>I noticed that the new Sys V flavor didn't provide raw device names
>for the filesystems...

This is not true.  The devices merely follow the System V organization for
devices.  /dev/dsk contains the block devices, and /dev/rdsk contains the
raw devices.

>Even more interesting is that when the m1000 was running the BSD 4.3 OS
>the corresponding dump program was quite happy with both the block and the
>raw device filesystems.

That's because these are totally different ports and kernels.  We didn't
build RISC/os from UMIPS-BSD.  We built it from UMIPS-V and re-ported many
BSD commands, system calls, etc. to it.

>The only other clue that I can provide is the error message which appears
>on the console just prior to the crash.  The messages was
>     "assertion failed !pg_ismod(pd), file:  fault.c line 284"

Thanks for this piece of information.  It's an *extremely* important one.
Please be sure to report anything you think could even possibly be remotely
relevant to a problem; often it's not clear where the real clue is.

>I have two suspicions:  1)  Since the filesystem is active, perhaps the os
>is detecting that a disk block has been read which is already been modified
>in the real memory (ie flagged as modified).
>
>2) The raw device I/O has a bug.
>
>Any thoughts would be gratefully received.

OK, the answer is:

3) In 4.0-based releases, there was a bug introduced which could cause the
   in-memory page tables and the TLB to become inconsistent under some
   circumstances.  The two most visible symptoms are either the assertion
   failure on !pg_ismod (in a few different code locations) or hangs of
   5 to 15 minutes' duration where the system "goes catatonic" (no apparent
   action at all) and spontaneously revives.  It does *not*, however, produce
   permanent or hidden effects such as broken filesystems or incorrect
   calculations.  This will be fixed in the forthcoming release 4.50.  In
   addition, it will be provided in a "patch release" now being tested, which
   should become available in two to three weeks.

   Since patch releases do not undergo the same full QA cycle as major and 
   minor releases, we prefer that users not strongly inconvenienced by the
   problems addressed in the patch release do not install it, as there is
   always the possibility that they will turn up some code regression that
   our testing did not detect.  So we are asking that people who are getting
   unavoidable panics from this problem, or the hangs, install the patch,
   but that users who are having problems with a workaround (such as using
   the block device for dump) wait for 4.50.

   By the way, not that we're hiding this issue, but if you have a support
   contract for RISC/os, you probably would have gotten more direct and
   immediate response from the CRC at 1-800-443-MIPS.
-- 
ROGER B.A. KLORESE      MIPS Computer Systems, Inc.      phone: +1 408 720-2939
928 E. Arques Ave.  Sunnyvale, CA  94086                        rogerk@mips.COM
{ames,decwrl,pyramid}!mips!rogerk
"Two guys, one cart, fresh pasta... *you* figure it out." -- Suzanne Sugarbaker