rich@cfi.COM (rich) (01/14/88)
We have recently encountered a bug in the kernal, apparently documented by Sun in the May 1987 Software Technical Bulletin, that causes our file server (a 3/280) to crash with a bus error while running a full system dump using a third-party tape backup system. I am looking for more information about the bug; in particular what causes it and how we might get around it. Now for the details. We have a Sun network with two servers (a 3/180 with a single Eagle disk and a 3/280 with two Super Eagles), five 3/160s (dual 70MB disks), a 3/60 (single 70MB disk), and three 3/50s (diskless). They are all running SunOS 3.4. We perform system backups using the UBACKUP package from Unitech Software (a nice package, by the way). The package uses the Sun-supplied tar (it can use cpio) at the lowest level to perform the dumps (full dump on the weekend, incrementals during the week). We dump the entire network (with some excluded directories) through NFS mounts. This backup system worked fine for about 8 months. About a month ago, we decided to change the way we access remote machines, reducing the mount list from over 80 entries to about 15 and increasing the use of symbolic links. After making this change (no physical files were moved), we tried to perform a full dump, and we started getting the bus error. After Sun took a look at our core dump, they determined that we were encountering a reported bug: Ref# 1004002 in the May '87 STB (page 134). They also said that the bug has been fixed in 4.0. The synopsis is: "*crfreelist in kern_prot.c gets trashed." The description is: "When doing extensive ethernet/disk activity (time of occurrence ranges from 2 to 12 hours) the system may trap on a bus error condition." The crash usually occurs near the beginning of the third tape (sure enough, about 2 1/2 hours into the dump), but not always at the same place in the file system. It does not crash when dumping individual machines (e.g., /remote/<machinename>/u and /remote/<machinename/u2), probably because that takes less than two hours. The obvious fix would be to go back to the old way of doing our mounts. However, that was getting incredibly unwieldy as we were mounting every new login directory and every new project development area. Backtracking is the least desirable solution for us. It would also take a lot of time. We have tried running the backup when the entire network is idle (weekend, no users or background processes); we've also tried running the backup on the slower file server (the 180), with no luck. It has now been a month since our last successful backup, and we're getting very nervous. We could try partitioning the backup process into under 2-hour chunks, but that would greatly increase the number of tapes and time required for dumps, not to the increased confusion resulting from this diddling. Our ideal solution (do I hear laughing out there?) would be to get the patch from Sun that fixes this problem. (BTW: We do not have a source license.) Our next-to-ideal solution would be to get enough info about this bug from all you wizards out there to enable us to create a simpler "work-around". Although I've been in this business long enough to know better than to ask, I will anyway: Why did we not get this bug before? We are backing up using the same mechanism as before; it's just that the files go on tape in a different order and with different pathnames. (There's a thought: I think the old system probably interspersed NFS directories with local directories...) Any other sufferers of this bug out there? Rich Baughman The Consumer Financial Institute: 617-899-6500 rich@CFI.COM {decvax!yale|allegra|ihnp4|ucbvax!cbosgd}!ima!cfisun!rich P.S. Should this also be posted to comp.sys.sun? If so, how do you post to a moderated group (I'm new to this posting business)? Just postnews to comp.sys.sun, or mail direct to the moderator (phil@Rice.edu, or Sun-Spots@rice.edu)? -- Rich Baughman The Consumer Financial Institute: 617-899-6500 rich@CFI.COM {decvax!yale|allegra|ihnp4|ucbvax!cbosgd}!ima!cfisun!rich