[comp.bugs.4bsd] Bus error crash during dumps

rich@cfi.COM (rich) (01/14/88)
We have recently encountered a bug in the kernal, apparently documented by
Sun in the May 1987 Software Technical Bulletin, that causes our file server
(a 3/280) to crash with a bus error while running a full system dump using
a third-party tape backup system.  I am looking for more information about
the bug; in particular what causes it and how we might get around it.  Now
for the details.

We have a Sun network with two servers (a 3/180 with a single Eagle disk and
a 3/280 with two Super Eagles), five 3/160s (dual 70MB disks), a 3/60 (single
70MB disk), and three 3/50s (diskless).  They are all running SunOS 3.4.
We perform system backups using the UBACKUP package from Unitech Software (a
nice package, by the way).  The package uses the Sun-supplied tar (it can use
cpio) at the lowest level to perform the dumps (full dump on the weekend,
incrementals during the week).  We dump the entire network (with some excluded
directories) through NFS mounts.

This backup system worked fine for about 8 months.  About a month ago, we
decided to change the way we access remote machines, reducing the mount list
from over 80 entries to about 15 and increasing the use of symbolic links.
After making this change (no physical files were moved), we tried to perform
a full dump, and we started getting the bus error.  After Sun took a look at
our core dump, they determined that we were encountering a reported bug:
Ref# 1004002 in the May '87 STB (page 134).  They also said that the bug has
been fixed in 4.0.  The synopsis is:  "*crfreelist in kern_prot.c gets
trashed."  The description is:  "When doing extensive ethernet/disk activity
(time of occurrence ranges from 2 to 12 hours) the system may trap on a bus
error condition."  The crash usually occurs near the beginning of the third
tape (sure enough, about 2 1/2 hours into the dump), but not always at the same
place in the file system.  It does not crash when dumping individual machines
(e.g., /remote/<machinename>/u and /remote/<machinename/u2), probably because
that takes less than two hours.

The obvious fix would be to go back to the old way of doing our mounts.
However, that was getting incredibly unwieldy as we were mounting every new
login directory and every new project development area.  Backtracking is the
least desirable solution for us.  It would also take a lot of time.

We have tried running the backup when the entire network is idle (weekend, no
users or background processes); we've also tried running the backup on the
slower file server (the 180), with no luck.  It has now been a month since our
last successful backup, and we're getting very nervous.  We could try
partitioning the backup process into under 2-hour chunks, but that would
greatly increase the number of tapes and time required for dumps, not to the
increased confusion resulting from this diddling.

Our ideal solution (do I hear laughing out there?) would be to get the patch
from Sun that fixes this problem.  (BTW: We do not have a source license.)
Our next-to-ideal solution would be to get enough info about this bug from all
you wizards out there to enable us to create a simpler "work-around".  Although
I've been in this business long enough to know better than to ask, I will
anyway:  Why did we not get this bug before?  We are backing up using the
same mechanism as before; it's just that the files go on tape in a different
order and with different pathnames.  (There's a thought:  I think the old
system probably interspersed NFS directories with local directories...)

Any other sufferers of this bug out there?

Rich Baughman     The Consumer Financial Institute:  617-899-6500
rich@CFI.COM
{decvax!yale|allegra|ihnp4|ucbvax!cbosgd}!ima!cfisun!rich

P.S. Should this also be posted to comp.sys.sun?  If so, how do you post to
a moderated group (I'm new to this posting business)?  Just postnews to
comp.sys.sun, or mail direct to the moderator (phil@Rice.edu, or
Sun-Spots@rice.edu)?
-- 
Rich Baughman     The Consumer Financial Institute:  617-899-6500
rich@CFI.COM
{decvax!yale|allegra|ihnp4|ucbvax!cbosgd}!ima!cfisun!rich