fuat@cunixf.cc.columbia.edu (Fuat C. Baran) (06/02/90)
Last weekend we upgraded our Sun-4/280's from SunOS 4.0.1 to SunOS 4.1. Since then they have been crashing (panic: writeback error) every time we try to backup the disks in multi-user mode using dump (our usual procedure for daily and weekly backups). Backups are done to a 1/2 inch tape drive on a Xylogics 472 tape controller. The systems crash with the GENERIC kernel as well as a custom config'ed kernel. Hardware configuration is 4/280 with rev 26 CPU's (PROM 3.0) (as well as a rev 22 CPU with PROM 2.8.4 and a rev 14 CPU with PROM 1.7), 3 16 Mb memory boards, one ALM-II, one Xylogics 472 tape controller with one tape drive, one Xylogics 450/451(?) controller, 2 Hitachi DK815-10 drives. Most of the time the system hangs after the panic, though once we were able to get a core dump. Output on the console at crash time is (addresses vary slightly): Memory Error Register 1d4<INTR,INTENA,CE_ENA,WBACKERR> DVMA=1, context=0, virtual address=fff3cfc0 pme=0, physical address=fc0 panic: writeback error syncing file system... {at this point it hangs and we have to reset from the cpu board, though in one of the 20 or so crashes it saved a core image} stack backtrace of the vmcore file shows: _panic(0xf80d1272,0x0,0x1bdc,0xfff3fbdc,0x0,0xf80bcf20) + 6c _ecc_error(0xffff6004,0xf80a3120,0xc000,0xf80e86f0,0x0,0xf80d1272) + 1c4 _memerr(0x0,0x0,0xffff8000,0x1f0,0xc0,0xd4) + 80 memory_err(?) _splx(0xf817fc74,0xff005f74,0xff005f74,0x0,0x1,0x64c000) + 14 _hat_pagesync(?) _page_sortadd(0xf81c4d84,0xf817fc9c,0x80,0x0,0x566000,0xf817fbd4) + 1c8 _pvn_getdirty(0xf817fc9c,0xf81c4d84,0x0,0x12000,0x566000,0xff005f74) + 29c _pvn_vplist_dirty(0xff005f74,0x0,0x100,0x0,0xf817fcc4,0xf817fc9c) + 110 _spec_putpage(0xff005f74,0x0,0x0,0x100,0x0,0xf8128348) + 1dc _spec_sync(0x0,0xf80cab90,0xf80cb850,0xf80de9d8,0xff005f70,0xff0fd234) + 98 _sync(0xf81c4fe0,0x120,0xf80c85f8,0xf80c8718,0xf81c5000,0xf80cab48) + 3c _syscall(0xf81c5000) + 3b4 Since our summer semester started on Tuesday, we haven't had the opportunity to do exhaustive tests such as single-user vs. multi-user, tar vs dump, remote dumps, etc., though we have used rdump on our Encore Multimax systems to back them up onto the Sun tape drives successfully. Sun software support is currently "working on it". We made enough of a fuss so they have given it "high priority". The first response I got was "All dumps have to be done single-user, and multi-user dumps are not supported. If you want, we can design a custom program to do it, though you'll have to contract us to develop it," though they retracted this when I asked for that statement in writing. Since it crashes the OS, it is a bug regardless of what is and isn't supported in the application, and they have finally begun to look into it. So far they haven't gotten back to me with an analysis, fix, or estimates on how long it will take for both. Does anyone else have a similarly configured system running SunOS 4.1? Can you do backups with the system multi-user? Does anyone have any ideas as to what the problem is? We currently are forced to take the systems standalone to do backups. Needless to say, these machines are in constant use 24 hours a day by students working on homework, and they don't appreciate a 2-3 hour interruption of service for backups, no matter what time of day or night we schedule it for. One other alternative is to give up and downgrade to SunOS 4.0.1, which for the most part worked (ignoring such things as NFS bugs, VNODE hangs, etc.)... Any help, suggestions, or reports of similar occurences would be appreciated. Internet: fuat@columbia.edu U.S. MAIL: Columbia University BITNET: fuat@cunixf Center for Computing Activities UUCP: ...!rutgers!columbia!cunixf!fuat 712 Watson Labs, 612 W115th St. Phone: (212) 854-5128 Fax: (212) 662-6442 New York, NY 10025
fuat@cunixf.cc.columbia.edu (Fuat C. Baran) (08/09/90)
Summary [you can skip to the end if you already know the story]: 25-May-90: Upgrade from SunOS 4.0.1 to SunOS 4.1 on Sun-4/280's (with 1 ALM-II, 2 Hitachi disks on a xylogics 451 controller, 1 tape drive on a xylogics 472 controller, 2 8 Mb and 1 32 Mb memory board). During first post-upgrade multi-user (logins disabled) full dump system crashed with: Memory Error Register 1d4<INTR,INTENA,CE_ENA,WBACKERR> DVMA=1, context=0, virtual address=fff3cfc0 pme=0, physical address=fc0 panic: writeback error syncing file system... {at this point it hangs and we have to reset from the cpu board, though in one of the 20 or so crashes it saved a core image} 1-Jun-90: My first message to sun-spots/sun-managers. Got a few responses describing similar occurences, but no suggested solution worked. 20-Jun-90: Frustrated by Sun's lack of responsiveness in looking into the problem (hardware support people worked hard, swapping boards, building test systems, etc. despite their suspicions that the problem was software related), I posted my second message to sun-spots/sun-managers, and received even more reports of similar problems, including one other site that received a similar brush-off ("multi-user dumps aren't supported"). 31-Jul-90: After repeated calls to Sun and getting various managers involved and having the problem "escalated" even further, the problem was finally identified. ********************************************************************** Fix: Remove from /etc/fstab the line: /dev/xy0b swap swap rw 0 0 Apparently in SunOS 4.1, if you have an fstab entry for the default swap partition, then when you go multi-user and run swapon(8) the default swap gets added again. This eventually leads to the kernel crashing when dump runs and causes the system to swap. This is an unconfirmed theory (we are still waiting for our sources), but removing the fstab entry stopped the system from crashing. We are now back to daily multi-user incremental dumps on our systems. Now all we have to do is get one of our machines, whose disk got trashed when a faulty disk controller was swapped in during one of numerous experiments, back into full service. Thanks to everyone who responded with suggestions and reports of similar occurences. It helped put the pressure on Sun to get them to look at the problem seriously. --Fuat Internet: fuat@columbia.edu U.S. MAIL: Columbia University BITNET: fuat@cunixf Center for Computing Activities UUCP: ...!rutgers!columbia!cunixf!fuat 712 Watson Labs, 612 W115th St. Phone: (212) 854-5128 Fax: (212) 662-6442 New York, NY 10025
steve@maths.warwick.ac.uk (Steve Rumsby) (08/13/90)
> Apparently in SunOS 4.1, if you have an fstab entry for the default swap > partition, then when you go multi-user and run swapon(8) the default swap > gets added again. This eventually leads to the kernel crashing when dump > runs and causes the system to swap. This is an unconfirmed theory (we are > still waiting for our sources), but removing the fstab entry stopped the > system from crashing. No, I don't believe this! There are lots of machines around here running with the default swap partitions mentioned in /etc/fstab. This has never caused a problem. Surely this is standard practice? I've always done it this way. Yes, the machines are running 4.1, and yes, I do dumps while running multi-user. If Sun had broken this (and I guess I wouldn't be *completely* surprised) there would have been many more people suffering because of it. It is kind of curious that removing the fstab line stopped the crashing though... UUCP: ...!ukc!warwick!steve Internet: steve@maths.warwick.ac.uk JANET: steve@uk.ac.warwick.maths PHONE: +44 203 524657