[comp.sys.sun] SunOS 4.1 multi-user dump causes crashes

fuat@cunixf.cc.columbia.edu (Fuat C. Baran) (06/02/90)

Last weekend we upgraded our Sun-4/280's from SunOS 4.0.1 to SunOS 4.1.
Since then they have been crashing (panic: writeback error) every time we
try to backup the disks in multi-user mode using dump (our usual procedure
for daily and weekly backups).  Backups are done to a 1/2 inch tape drive
on a Xylogics 472 tape controller.

The systems crash with the GENERIC kernel as well as a custom config'ed
kernel.  Hardware configuration is 4/280 with rev 26 CPU's (PROM 3.0) (as
well as a rev 22 CPU with PROM 2.8.4 and a rev 14 CPU with PROM 1.7), 3 16
Mb memory boards, one ALM-II, one Xylogics 472 tape controller with one
tape drive, one Xylogics 450/451(?) controller, 2 Hitachi DK815-10 drives.
Most of the time the system hangs after the panic, though once we were
able to get a core dump.

Output on the console at crash time is (addresses vary slightly):

Memory Error Register 1d4<INTR,INTENA,CE_ENA,WBACKERR>
DVMA=1, context=0, virtual address=fff3cfc0
pme=0, physical address=fc0
panic: writeback error
syncing file system...  {at this point it hangs and we have to reset
			 from the cpu board, though in one of the 20
                         or so crashes it saved a core image}

stack backtrace of the vmcore file shows:

_panic(0xf80d1272,0x0,0x1bdc,0xfff3fbdc,0x0,0xf80bcf20) + 6c
_ecc_error(0xffff6004,0xf80a3120,0xc000,0xf80e86f0,0x0,0xf80d1272) + 1c4
_memerr(0x0,0x0,0xffff8000,0x1f0,0xc0,0xd4) + 80
memory_err(?)
_splx(0xf817fc74,0xff005f74,0xff005f74,0x0,0x1,0x64c000) + 14
_hat_pagesync(?)
_page_sortadd(0xf81c4d84,0xf817fc9c,0x80,0x0,0x566000,0xf817fbd4) + 1c8
_pvn_getdirty(0xf817fc9c,0xf81c4d84,0x0,0x12000,0x566000,0xff005f74) + 29c
_pvn_vplist_dirty(0xff005f74,0x0,0x100,0x0,0xf817fcc4,0xf817fc9c) + 110
_spec_putpage(0xff005f74,0x0,0x0,0x100,0x0,0xf8128348) + 1dc
_spec_sync(0x0,0xf80cab90,0xf80cb850,0xf80de9d8,0xff005f70,0xff0fd234) + 98
_sync(0xf81c4fe0,0x120,0xf80c85f8,0xf80c8718,0xf81c5000,0xf80cab48) + 3c
_syscall(0xf81c5000) + 3b4

Since our summer semester started on Tuesday, we haven't had the
opportunity to do exhaustive tests such as single-user vs. multi-user, tar
vs dump, remote dumps, etc., though we have used rdump on our Encore
Multimax systems to back them up onto the Sun tape drives successfully.

Sun software support is currently "working on it".  We made enough of a
fuss so they have given it "high priority".  The first response I got was
"All dumps have to be done single-user, and multi-user dumps are not
supported.  If you want, we can design a custom program to do it, though
you'll have to contract us to develop it," though they retracted this when
I asked for that statement in writing.  Since it crashes the OS, it is a
bug regardless of what is and isn't supported in the application, and they
have finally begun to look into it.  So far they haven't gotten back to me
with an analysis, fix, or estimates on how long it will take for both.

Does anyone else have a similarly configured system running SunOS 4.1?
Can you do backups with the system multi-user?

Does anyone have any ideas as to what the problem is?  We currently are
forced to take the systems standalone to do backups.  Needless to say,
these machines are in constant use 24 hours a day by students working on
homework, and they don't appreciate a 2-3 hour interruption of service for
backups, no matter what time of day or night we schedule it for.  One
other alternative is to give up and downgrade to SunOS 4.0.1, which for
the most part worked (ignoring such things as NFS bugs, VNODE hangs,
etc.)...

Any help, suggestions, or reports of similar occurences would be
appreciated.

Internet: fuat@columbia.edu         U.S. MAIL: Columbia University
BITNET: fuat@cunixf                            Center for Computing Activities
UUCP: ...!rutgers!columbia!cunixf!fuat         712 Watson Labs, 612 W115th St.
Phone: (212) 854-5128    Fax: (212) 662-6442   New York, NY 10025

fuat@cunixf.cc.columbia.edu (Fuat C. Baran) (08/09/90)

Summary [you can skip to the end if you already know the story]:

25-May-90:
    
Upgrade from SunOS 4.0.1 to SunOS 4.1 on Sun-4/280's (with 1 ALM-II, 2
Hitachi disks on a xylogics 451 controller, 1 tape drive on a xylogics 472
controller, 2 8 Mb and 1 32 Mb memory board).  During first post-upgrade
multi-user (logins disabled) full dump system crashed with:

    Memory Error Register 1d4<INTR,INTENA,CE_ENA,WBACKERR>
    DVMA=1, context=0, virtual address=fff3cfc0
    pme=0, physical address=fc0
    panic: writeback error
    syncing file system...  {at this point it hangs and we have to reset
			     from the cpu board, though in one of the 20
                             or so crashes it saved a core image}
1-Jun-90: 

My first message to sun-spots/sun-managers.  Got a few responses
describing similar occurences, but no suggested solution worked.

20-Jun-90:

Frustrated by Sun's lack of responsiveness in looking into the problem
(hardware support people worked hard, swapping boards, building test
systems, etc. despite their suspicions that the problem was software
related), I posted my second message to sun-spots/sun-managers, and
received even more reports of similar problems, including one other site
that received a similar brush-off ("multi-user dumps aren't supported").

31-Jul-90: 

After repeated calls to Sun and getting various managers involved and
having the problem "escalated" even further, the problem was finally
identified.

**********************************************************************

Fix:

Remove from /etc/fstab the line:

	/dev/xy0b	swap	swap	rw	0 0

Apparently in SunOS 4.1, if you have an fstab entry for the default swap
partition, then when you go multi-user and run swapon(8) the default swap
gets added again.  This eventually leads to the kernel crashing when dump
runs and causes the system to swap.  This is an unconfirmed theory (we are
still waiting for our sources), but removing the fstab entry stopped the
system from crashing.  We are now back to daily multi-user incremental
dumps on our systems.  Now all we have to do is get one of our machines,
whose disk got trashed when a faulty disk controller was swapped in during
one of numerous experiments, back into full service.

Thanks to everyone who responded with suggestions and reports of similar
occurences.  It helped put the pressure on Sun to get them to look at the
problem seriously.

	--Fuat

Internet: fuat@columbia.edu          U.S. MAIL: Columbia University
 BITNET: fuat@cunixf                           Center for Computing Activities
   UUCP: ...!rutgers!columbia!cunixf!fuat      712 Watson Labs, 612 W115th St.
   Phone: (212) 854-5128  Fax: (212) 662-6442   New York, NY 10025

steve@maths.warwick.ac.uk (Steve Rumsby) (08/13/90)

> Apparently in SunOS 4.1, if you have an fstab entry for the default swap
> partition, then when you go multi-user and run swapon(8) the default swap
> gets added again.  This eventually leads to the kernel crashing when dump
> runs and causes the system to swap.  This is an unconfirmed theory (we are
> still waiting for our sources), but removing the fstab entry stopped the
> system from crashing. 

No, I don't believe this! There are lots of machines around here running
with the default swap partitions mentioned in /etc/fstab. This has never
caused a problem. Surely this is standard practice? I've always done it
this way. Yes, the machines are running 4.1, and yes, I do dumps while
running multi-user. If Sun had broken this (and I guess I wouldn't be
*completely* surprised) there would have been many more people suffering
because of it.

It is kind of curious that removing the fstab line stopped the crashing
though...

UUCP:	 ...!ukc!warwick!steve		Internet: steve@maths.warwick.ac.uk
JANET:	 steve@uk.ac.warwick.maths	PHONE:	 +44 203 524657